PHP: Simple XML and different codepages and getting the data correctly

1.3k views Asked by At

I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.

To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...

When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file. When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.

Example:

The XML:

<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km  ;  Dìèín - Rozb 741,85km </name>
...

The PHP code:

$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";

Result of the code (on linux bash shell) moves the cursor upwards and then prints: bín - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)

I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.

I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)

How can I convert my data so that it will always show up, with whatever encoding on the source?

2

There are 2 answers

0
netcoder On BEST ANSWER

Try iconv instead:

$str = iconv('UTF-8', 'WINDOWS-1250', $str);
0
Artefacto On

The problem is your input file is malformed. There is no character ì (latin small letter I with grave) in Windows-1250. See here.

The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).

The fact such character shows correctly in the shell is likely fortuitous.