converting utf8 to utf32

112 views Asked by At

I really am unable to find any help online! just like I have seen in many c++23 programs, what I want to do is:

for(char32_t c : utf8string | utf8to32())

so that I can work on each individual code-point. preferably I'd like to use boost::locale for that since I already did do a

boost::locale::normalize(in.begin(),in.end(),boost::locale::norm_nfc)

in the constructor. online I've seen mention of some mysterious boost/text/transcode_iterator.hpp which does not exist on my system. but even that wouldn't provide the right class for above utf8to32. any hints on where I could find that? how do I even go about writing such a class? obviously I need to warp around another iterator over the existing iterator of std::string. and then I need to put that new iterator into a container so it is treated as such in the for loop? any examples out there how to do that if I already found a fitting iterator implementation?

just to be clear: I want to read utf8 code-points, compare them to some given code-points and maybe mark where they are for future use. easy to do in 7-bit ascii, how do I do it in utf8? not like there's plenty of fully 21-bit utf8-capable grammar-parsers around yet...

1

There are 1 answers

2
n. m. could be an AI On
  1. Boost.Text is a proposed library, not an actual part of Boost. You can download it from Github, but it won't be in your package manager just now.
  2. Boost actually has several implementations of what you want, more or less, in particular boost/regex/pending/unicode_iterator.hpp has u8_to_u32_iterator class which seems particularly close.
  3. If you are not firmly set on using iterator adaptors, ranges, or other such machinery, I would recommend https://github.com/simdutf/simdutf, which is a Unicode transcoding library built for speed.
  4. Having said that, you don't really need it. If you only need to check your input stream against several characters, just store these characters as UTF-8 strings, and compare to subsequences of your stream. As an optimisation, you can skip characters which cannot start a UTF-8 sequence (a really simple check to write).