Regex to delete faulty characters within a csv file to make SplFileObject work correctly in PHP

455 views Asked by At

I try to parse a csv file in PHP via SplFileObject. Sadly SplFileObject stucks sometimes if there are erroneous invisible characters in the text. The function detects a quote instead of skipping or read it as normal character while iterating over the lines in the csv file. The screenshot below is from Textwrangler:

enter image description here

I also copied it from Textwrangler here (invisible char should be between "forgé." and "Circa"):

Fer forgé.� Circa

My code (SplFileObject part):

$splFile = new \SplFileObject($file);
$splFile->setFlags(\SplFileObject::DROP_NEW_LINE | \SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);
$splFile->setCsvControl(",", '"', '"');

I tried to figure out which charset the csv file has via file -I my.csv. Output: my.csv: application/octet-stream; charset=binary. That is a weird result as the file is readable via Textwrangler and is therfore NOT binary. I also read another csv generated in the same way and the output is as expected: second.csv: text/plain; charset=utf-8. The tool used to generate the csv files is called Visual Web Ripper (tool for crawling web pages).

How I can determine which character this upside-down question mark is (it seems not to be the spanish upside down question mark - maybe just a placeholder inserted by Textwrangler)? How can I delete this character and all "invalid" characters in my csv file? Is there a regular expression which matches every character, number, sign (punctuation and other textual symbols) which is in fact a real character and leave out something like in the example above? I am looking for an unicode-safe regular expression (need to preserve german umlauts, french, russian, chinese, japan and korean characters as well). Alternatively: How can I convert a csv file with charset=binary to UTF-8?

Edit: If I open it via nano editor it shows forgé.^@ Circa. After a quick search it seems to be a NUL character or \u0000 (see comments and https://en.wikipedia.org/wiki/Null_character for reference).

Edit 2: I digged a little more into it: It seems that there is a problem with the $splFile->current() function, which reads a line at the current file pointer. The line gets truncated after the NUL character (no matter if I try to read it via SplFileObject::READ_CSV or just as normal string (without SplFileObject::READ_CSV parameter)).

1

There are 1 answers

0
koseduhemak On

The solution was to omit the SplFileObject::DROP_NEW_LINE parameter. I also checked if the NUL character is present: It is present, but it is now considered as part of the text value of the specific column in the csv and is NOT detected as quote or column enclosure.

Of course you have to filter out empty lines by yourself now with f. e. something like:

$splFileObject = new \SplFileObject();
$splFileObject->setFlags(\SplFileObject::SKIP_EMPTY | \SplFileObject::READ_AHEAD | \SplFileObject::READ_CSV);

$columns = $splFileObject->current();
if (count($columns) === 1 && array_key_exists(0, $columns) && $columns[0] === NULL) {
   // empty csv line
}