I'm scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
However, the content type header is thus:
Content-Type: text/html
and not:
Content-Type: text/html; charset=utf-8
Thus, when I scrape, Goutte does not spot that it is UTF-8, and grabs data incorrectly. The remote site is not under my control, so I can't fix the problem there! Here's a set of scripts to replicate the problem. First, the scraper:
<?php
require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar';
$url = 'http://crawler-tests.local/utf-8.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('get', $url);
$text = $crawler->text();
echo 'Whole page: ' . $text . "\n";
Now a test page to be placed on a web server:
<?php
// Correct
#header('Content-Type: text/html; charset=utf-8');
// Incorrect
header('Content-Type: text/html');
?>
<!DOCTYPE html>
<html>
<head>
<title>UTF-8 test</title>
<meta charset="utf-8" />
</head>
<body>
<p>When the Content-Header header is incomplete, the pound sign breaks:
£15,216</p>
</body>
</html>
Here's the output of the Goutte test:
Whole page: UTF-8 test When the Content-Header header is incomplete, the pound sign breaks: £15,216
As you can see from the comments in the last script, properly declaring the character set in the header fixes things. I've hunted around in Goutte to see if there is anything that looks like it would force the character set, but to no avail. Any ideas?
The issue is actually with symfony/browser-kit and symfony/domcrawler. The browserkit's
Client
does not examine the HTML meta tags to determine the charset, content-type header only. When the response body is handed over to the domcrawler, it is treated as the default charset ISO-8859-1. After examining the meta tags that decision should be reverted and the DomDocument rebuilt, but that never happens.The easy workaround is to wrap
$crawler->text()
withutf8_decode()
:This works if the input is UTF-8. I suppose for other encodings something similar can be achieved with
iconv()
or so. However, you have to remember to do that every time you calltext()
.A more generic approach is to make the Domcrawler believe that it deals with UTF-8. To that end I've come up with a Guzzle plugin that overwrites (or adds) the charset in the content-type response header. You can find it at https://gist.github.com/pschultz/6554265. Usage is like this: