Friday, 29 August 2025

Getting UTF-8 from PHP and MySQL

This turned out to be more annoying problem than I expected. 

Consider this MySQL table:- 

 MariaDB [demo]> show create table users\G
*************************** 1. row ***************************
       Table: users
Create Table: CREATE TABLE `users` (
  `name` varchar(64) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci
1 row in set (0.000 sec)

MariaDB [demo]> select * from users;
+------------------------+
| name                   |
+------------------------+
| Éámónn McGóníglé       |
+------------------------+
1 row in set (0.000 sec)

Note all of the accented characters (called fádas in Irish), which are about to cause us some headaches.

Now consider this bit of PHP code which retrieves names from that database:- 

<?php
$dsn = "mysql:host=localhost;dbname=demo";
$dbh = new PDO($dsn, "root");
$stmt = $dbh->query("SELECT name FROM users");
$res = $stmt->fetch(PDO::FETCH_ASSOC);
print($res['name']."\n");
 

The result is not what you might expect:-

$ php demo.php 
nn McGlamonn 

All of the accented characters have been lost and there is clearly some other weirdness going on. In a web-browser, there is a weird "placeholder" character everywhere an accented character should be:-


What's going on?  The charset of the database table is set to "UTF-8" and the "Content-type:" header of the response coming from the web server is "text/html; charset=UTF-8", just as it should be.  So why the mess?  

StackExchange proposed all sorts of solutions, but the one that worked was a change to the MySQL connect string, adding a ";charset=utf8" suffix:-

 <?php
$dsn = "mysql:host=localhost;dbname=demo;charset=utf8";

With that small change...

eamonn@redranger:~/public_html$ php demo.php 
Éámónn McGóníglé

Credit where it is due, the Stack Overflow post that helped was https://stackoverflow.com/questions/4475548/pdo-mysql-and-broken-utf-8-encoding/21373793#21373793

 One nuance is that

 <?php
$dsn = "mysql:host=localhost;dbname=demo;charset=utf8mb4";

 ...might be a better choice.  Apparently, in MySQL's world, "utf8" admits only three-byte UTF-8 characters (so not full UTF-8) while "utf8mb4" allows for four-byte UTF-8 characters.  Sure enough, if I try to insert a name with a four-byte UTF-8 character into the database with a charset of "utf8", it fails:-

MariaDB [demo]> insert into users values ("Smiley McSmiley 🙂");
ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x99\x82' for column `demo`.`users`.`name` at row 1

So I recreate the database table with a charset of "utf8mb4":-

MariaDB [demo]> create table users  (name varchar(64)) charset utf8mb4;
Query OK, 0 rows affected (0.214 sec)

MariaDB [demo]> set names utf8mb4;
Query OK, 0 rows affected (0.000 sec)

MariaDB [demo]> insert into users values ("Smiley McSmiley 🙂");
Query OK, 1 row affected (0.046 sec)

MariaDB [demo]> select * from users;
+----------------------+
| name                 |
+----------------------+
| Smiley McSmiley 🙂     |
+----------------------+
1 row in set (0.000 sec)

However, the PHP script fails to crack a smile:-

$ php demo.php 
Smiley McSmiley ?


 But if I change...

<?php
$dsn = "mysql:host=localhost;dbname=demo;charset=utf8";
 

...to...

<?php
$dsn = "mysql:host=localhost;dbname=demo;charset=utf8mb4";
 

 ...it's smiles all-round:-

$ php demo.php 
Smiley McSmiley 🙂


 

Another solution which is more trouble and less good is to use the PHP call mb_convert_encoding.  I did get it to work for accented characters.  The trick is to convert the ISO-8859-1 character that MySQL returns (without the ";charset=utf8" specified in the connect string) to a UTF-8 character. This works for accented characters such as á, é etc. that have a single-byte mapping in ISO-8859-1, but not for 🙂.  UTF-8 is a better way to do it.