I have a working algorithm to convert a UTF-8 string to a UTF-32 string, however, I have to allocate all the space for my UTF-32 string ahead of time. Is there any way to know how many characters in UTF-32 that a UTF-8 string will take up.
For example, the UTF-8 string "¥0" is 3 chars, and once converted to UTF-32 is 2 unsigned ints. Is there any way to know the number of UTF-32 'chars' I will need before doing the conversion? Or am I going to have to re-write the algorithm?
There are two basic options:
You could make two passes through the UTF-8 string, the first one counting the number of UTF-32 characters you'll need to generate, and the second one actually writing them to a buffer.
Allocate the max number of 32-bit chars you could possibly need -- i.e., the length of the UTF-8 string. This is wasteful of memory, but means you can transform utf8->utf32 in one pass.
You could also use a hybrid -- e.g., if the string is shorter than some threshold then use the second approach, otherwise use the first.
For the first approach, the first pass would look something like this: