Check if file contains only < 10 bit characters

130 views Asked by At

This is homework. I'm not looking for code just discussion, high level suggestions on how to proceed.

I am currently working on an assignment where we are converting UTF-16 chars in a file to UTF-32 in an output file and visa versa. The assignment says that for a first step to handle files containing only characters less than 10 bits, but I am stumped. This is our first assignment and while I've used C++ I've never really used C.

I have been reading the RFC about such conversions (S.2.1) and I feel like I understand it pretty well. I understand that UTF-32 characters are actually 10-bits preceded by 6-bits defining it's composition (I believe 110110 indicates the first pair of 16 bits and 110111 indicates the second pair of the "32"). Do UTF-16 chars start with 6 leading 0s?

Or is it that UTF-16 chars are just less than 10 bits, and once you hit a 10 bit character you know you've run into a UTF-32 bit one?

I guess my real question is what they mean by "10 bit chars" when it can either be 8, 16, etc. But any insight to anything I mentioned would be great!

1

There are 1 answers

1
Remy Lebeau On BEST ANSWER

The assignment is badly worded and misleading.

Unicode defines codepoint values that can take up to 20 bits (U+0000 to U+10FFFF). All of the UTF encodings (UTF-8, UTF-16, and UTF-32) support all 20 bits, just in different ways.

UTF-8 and UTF-16 are variable-length encodings. The number of bytes needed to encode a given codepoint depends on the actual codepoint value. UTF-8 uses 1, 2, 3, or 4 8-bit codeunits. UTF-16 uses either 1 or 2 16-bit codeunits.

UTF-32 is a fixed-length encoding. It always uses 1 32-bit codeunit, since most systems do not have a 20-bit data type.

Implementing UTF conversions is very easy (they are designed to be interchangable), but you first need to know which encoding the source file is actually using. If the file starts with a UTF-16 BOM , that is very easy to detect. However, if no BOM is present, then you need to either ask the user for the encoding, or else use heuristic analysis of the data to try to detect the encoding dynamically.

Once you know the encoding, the rest is straight forward:

  1. If UTF-16, read the file in 16-bit chunks (1 codeunit at a time), combining adjacent UTF-16 surrogate codeunits as needed (very easy to detect). For each completed sequence, extract the encoded 16/20 bits and output them in a single UTF-32 codeunit.

  2. If UTF-32, read the file in 32-bit chunks (1 codeunit at a time), extract the 20 bits, and out them as either 1 or 2 UTF-16 codeunits as needed.

The most difficult part of the assignment is determining the encoding of the source file.