I'm a programmer specialising in performant and scalable systems using PHP and Ruby and cooking


PHP, UTF-8, DOMDocument and HTML entities

Recently I was using DOMDocument to parse HTML snippets, add some attributes and output the amended HTML.

Someone pointed out to me that odd characters were being output after this process such as £.

After a bit of experimenting and digging I found that it was being caused by the UTF-8 encoding of the data from Postgres and PHP. When loading a HTML string into DOMDocument PHP defaults to using LATIN1 as the encoding. To fix this, you can force PHP to use UTF-8 by spoofing your snippet as xml:

$document = new \DOMDocument();
$document->loadHTML('<?xml encoding="UTF-8">'.$html);