I found this statement under another SO question concerning Unicode and I'd like to ask for further elaboration of this rather surprising fact.
- Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you'll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!
When does this happen and what to do about it?
The first example which comes to my mind: If you create a file under OSX that is named
é
(singleU+00E9
codepoint), the OS will store it actually asU+0065 U+0301
(Unicode decomposition). The file will be still accessible under the original name, but listed as decomposed.How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII.
Second: On Windows, if you have a file called
e
, try creating (with overwriting enabled) a file calledE
, the OS will still list a file callede
. Ife
didn't exists beforehand, a file calledE
would be created.How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII, and take case into account. Try using a consistent capitalisation style. I suggest going all lowercase.
Third: on Windows, if for example you have Windows 1250 as your system encoding, and you want to create a file named
ê
via the narrow, char-based API, a file callede
will be created instead. This of course is easy to avoid, but this exact problem bit me once: WinRAR extracted filesê.png
,è.png
ande.png
all intoe.png
, overwriting data. Similar problems can happen with other encoding mixups, too.How to avoid: don't use API's that take the filename as a
char*
on Windows.