Compare Windows-1252 string to UTF-8 string

3.1k views Asked by At

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.

For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".

However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.

These are my test strings:

String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";

This is how I convert the string in the first place:

using (MemoryStream ms = new MemoryStream())
{
    using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
    {
        sw.Write(decoded);
        sw.Flush();
        ms.Seek(0, SeekOrigin.Begin);
        using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
        {
            encoded = sr.ReadToEnd();
        }
    }
}

Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.

This is how I try to find out if I need base64 and produce it:

private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
    Byte[] utf8Bytes;
    Byte[] windows1252Bytes;
    String base64;

    utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
    windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
    base64 = null;
    if (utf8Bytes.Length != windows1252Bytes.Length)
    {
        base64 = Convert.ToBase64String(utf8Bytes);
    }
    else
    {
        for(Int32 i = 0; i < utf8Bytes.Length; i++)
        {
            if(utf8Bytes[i] != windows1252Bytes[i])
            {
                base64 = Convert.ToBase64String(utf8Bytes);
                break;
            }
        }
    }
    return (base64);
}

The first string doena is completely identical and doesn't produce a base64 result

Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));

results in

DJ Doena /

But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:

äöüßéèâ / w6TDtsO8w5/DqcOow6I=

And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):

< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+

Any clues how my Base64 getter could be enhanced a) for performance b) for better results?

Thank you in advance. :-)

2

There are 2 answers

3
Peter Duniho On BEST ANSWER

I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:

static void Main(string[] args)
{
    string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };

    foreach (string text in testStrings)
    {
        Console.WriteLine(ReencodeText(text));
    }
}

private static string ReencodeText(string text)
{
    Encoding encoding = Encoding.GetEncoding(1252);
    string text1252 = encoding.GetString(encoding.GetBytes(text));

    return text.Equals(text1252, StringComparison.Ordinal) ?
        text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}

I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.

It produces the following output:

DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+

In other words, the first two strings are left intact, while the third is encoded as base64.

0
Guffa On

In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.

The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.

Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.

What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.

This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:

Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
  e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
  ok = false;
}

As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.

Used in your method it would be:

private static String GetBase64Alternate(string text) {
  Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
  bool ok = true;
  try {
    e.GetByteCount(allIn);
  } catch (EncoderFallbackException) {
    ok = false;
  }
  return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}