Strings encoded ASCII and UTF8 have different lengths!

7k views Asked by At

I'm reading a stream and am wondering why the UTF-8 encoded string is shorter than the ASCII one.

  ASCIIEncoding encoder = new ASCIIEncoding();
  UTF8Encoding enc = new UTF8Encoding();   
  string response = encoder.GetString(message, 0, bytesRead); //4096
  string responseUtf8 = enc.GetString(message, 0, bytesRead);  //3955
4

There are 4 answers

0
Guffa On BEST ANSWER

That's because the stream is actually UTF-8 encoded. If it was ASCII encoded, the strings would be identical.

When read as ASCII, the byte combinations that represent characters outside the 0-127 code set will be read as separate characters, and they will look like garbage.

When read as UTF-8, the byte combinations will be decoded into the correct characters, each multi-byte combination ending up as a single character.

(Note: Strings are not encoded, it's the stream that is encoded. You decode the stream from ASCII or UTF-8 into a Unicode character string.)

0
Martin Törnwall On

Perhaps the message contained some characters that couldn't be encoded as a single byte in UTF-8.

6
Adonais On

UTF-8 handles different the strings than ASCII: On UTF-8, each character may be of 1, 2 or 3 bytes length. However, ASCII considers each byte as a character. The C# UTF-8 encoder counts well-formed UTF-8 characters, instead of bytes. I hope this helps you.

0
Timwi On

Because when decoding bytes, ASCIIEncoding replaces all bytes greater than 127 (0x7F) with a question mark (?) which is one character, while UTF8Encoding decodes UTF-8 multi-byte sequences correctly into single characters (for example, the three bytes 232,170,158 become the single character 語).