I have some text, that will be written to 2 files using UTF-8 and 1252 encoding.
Observation when comparing these 2 files:
- most text characters will remain the same.
- some UTF-8 characters that don't exist in 1252 will be represented as "?" in the 1252 file version.
- some characters will be converted somehow: e.g. a "Ф" or "σ" (Greek phi and sigma) will be converted to a "F" or "s" (which makes sense).
Question: Can I calculate which character in the UTF8 file will be represented by what character in the 1252 file without actually writing the files?
Or to put it another way: Is there more efficient code than this to find out the differences without writing to a text file?
File.WriteAllText("tmp-utf8.txt", text, Encoding.UTF8);
File.WriteAllText("tmp-cp1252.txt", text, Encoding.GetEncoding(1252));
string textUtf8 = File.ReadAllText("tmp-utf8.txt", Encoding.UTF8);
string text1252 = File.ReadAllText("tmp-cp1252.txt", Encoding.GetEncoding(1252));
if (textUtf8 != text1252)
{
... do something
}
Finally I want to print out something like this:
"a"->"a"
"b"->"b"
"Ф"->"F"
"σ"->"s"
"ξ"->"?"
"ψ"->"?"
You can use
Encoding.GetBytesto get the exact byte representation, andSequenceEqualto compare.To find the exact index of differences is difficult, because UTF-8 uses multi-byte sequences in some cases.
Maybe something like