European 'é' character with ASCII code 101 204 129

741 views Asked by At

I have an issue with the character 'é'.

With a ftp_nlist($this->ftpStream, $directory); I've a string like that 'Parté.mp4' but the 'é' doesnt match the regex [\p{L}]*\.mp4

There are example here:

The ASCII code of the 'é' who doesn't work is '101 204 129'. The function ord($e); where $e is the weird character return '101' which is the code of the simple letter e.

It's seems like my 'é' is composed of three characters because I've to make a
$e = substr($fileName,4,3); to obtain my single character.

I would like to be able to authorize these characters in my regex... If you have any leads, thanks.

2

There are 2 answers

0
Nathan On

Use the extended unicode option.

\X*.mp4

Regex Demo

Here's the PHP manual that describes the extended unicode option.

The \X escape matches a Unicode extended grapheme cluster. An extended grapheme cluster is one or more Unicode characters that combine to form a single glyph. In effect, this can be thought of as the Unicode equivalent of . as it will match one composed character, regardless of how many individual characters are actually used to render it.

0
Jukka K. Korpela On

When you say “The ASCII code of the 'é' who doesn't work is '101 204 129'”, you probably mean that the bytes are those numbers in decimal. (They are not ASCII codes: they are not to be interpreted according to ASCII and, besides, ASCII ends at 127 decimal.) In hexadecimal, this means 65 CC 81. This is the correct UTF-8 representation of the Basic Latin letter “e” U+0065 followed by U+0301 COMBINING ACUTE ACCENT. This in turn is the correct decomposed representation of “é”.

Thus, you first have a character encoding problem to fix. You should not be dealing with the UTF-8 bytes of a character but the character itself. You may need to modify the routines for reading the data, or maybe fix the data itself, if it has been munged.

If you have correctly read the UTF-8 data, the combining acute accent is still a problem for matching, since it is not a letter. You may need to convert the data to Normalization Form C, which turns the two-character combination to “é”, a letter.