Load HTML file and force UTF8 with PHP

324 views Asked by At

I am accessing an external URL to get a specific content of it with xPath.

I tried several different ways to achieve this, but all of them end up presenting a little problem. After a big research, I do it this way:

I create a stream context to open the file with the right headers: utf-8

$opts=array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'));
$context=stream_context_create($opts);
$html=file_get_contents($url,false,$context);

Then, inside my class, where I created a DOMDocument object, I load the fetched HTML string, as follows:

$this->dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

It works fine in almost every case, but it will sometimes strip away complex characters, like á, ó, ç, etc..

Example: "gobierno marroquí para" turns into "gobierno marroqu para"

I also tried loading my HTML with a plain text prefix <?xml encoding... and it works fine, but then I have issues with further HTMLPurifier operations.

Any kind of information is appreciated, I am not looking for somebody to do this task for me, but for the right and most efficient way. I need to understand it all so i can work with it.

Peace.

0

There are 0 answers