I am accessing an external URL to get a specific content of it with xPath.
I tried several different ways to achieve this, but all of them end up presenting a little problem. After a big research, I do it this way:
I create a stream context to open the file with the right headers: utf-8
$opts=array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'));
$context=stream_context_create($opts);
$html=file_get_contents($url,false,$context);
Then, inside my class, where I created a DOMDocument object, I load the fetched HTML string, as follows:
$this->dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
It works fine in almost every case, but it will sometimes strip away complex characters, like á, ó, ç, etc..
Example: "gobierno marroquí para" turns into "gobierno marroqu para"
I also tried loading my HTML with a plain text prefix <?xml encoding...
and it works fine, but then I have issues with further HTMLPurifier operations.
Any kind of information is appreciated, I am not looking for somebody to do this task for me, but for the right and most efficient way. I need to understand it all so i can work with it.
Peace.