I have some hebrew websites that contains character references like: נוף
I can only view these letters if I save the file as .html and view in UTF-8 encoding.
If I try to open it as a regular text file then UTF-8 encoding does not show the proper output.
I noticed that if I open a text editor and write hebrew in UTF-8, each character takes two bytes not 4 bytes line in this example (ו
)
Any ideas if this is UTF-16 or any other kind of UTF representation of letters?
How can I convert it to normal letters if possible?
Using latest PHP version.
Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (
&#n;
) or hexadecimal (&#xn;
) notation.You can use
html_entity_decode
that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like<
,>
,&
will also get decoded:If you just want to decode the numeric character references, you can use this:
As YuriKolovsky and thirtydot have pointed out in another question, it seems that browser vendors did ‘silently’ agreed on something regarding character references mapping, that does differ from the specification and is quite undocumented.
There seem to be some character references that would normally be mapped onto the Latin 1 supplement but that are actually mapped onto different characters. This is due the mapping that would rather result from mapping the characters from Windows-1252 instead of ISO 8859-1, on which the Unicode character set is build on. Jukka Korpela wrote an extensive article on this topic.
Now here’s an extension to the function mentioned above that handles this quirk:
If anonymous functions are not available (introduced with 5.3.0), you could also use
create_function
:Here’s another function that tries to comply to the behavior of HTML 5:
I’ve also noticed that in PHP 5.4.0 the
html_entity_decode
function was added another flag named ENT_HTML5 for HTML 5 behavior.