CoreFoundation UTF-16 un-paired surrogate

248 views Asked by At

I'm trying to encode from utf16 to say utf32 using Apple Core Foundation API :

cfString = CFStringCreateWithBytes(nullptr, str, strLen, kCFStringEncodingUTF16, FALSE);

auto range = CFRangeMake(0, CFStringGetLenth(cfString));

CFStringGetBytes(cfString, range, kCFStringEncodingUTF32, 0, false, buffer, bufferSize, usedsize); 

Most of the time that works, untill input buffer contains first part of surrogate pair say U+df9f, Corefoundation will simply return output without ill-formed characters.

So to be a bit unicode compliant, I have to manually determine that situation and follow unicode documentation to create standard substitution for that in form of U+FFFD: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Same situation for other encodings: like symbol 0x80 in the middle of utf-8, then CFStringCreateWithBytes always return nullptr instead of pointing to invalid character.

Is that expected behaviour or UB of Corefoundation, or may be there is a hint to tune CF to be reporting malformed input somehow?

UPDATE:

I did exactly following:

UInt8 str[] = {0x41, 0x00, 0x9f, 0xdf}; // coresponding to unicode A + invalid surogate pair

CFStringRef mystr = CFStringCreateWithBytes(nullptr, str, 4, kCFStringEncodingUTF16, false, FALSE);

after that mystr has 2 characters len according to CFStringGetLength(), so looks invalid char gets processed

std::vector<char> str(7);
CFStringGetCString(mystr, &*str.begin(), str.size(), kCFStringEncodingUTF8);

that gives me false, so no conversion to utf8 is possible, and Xcode debug watches shows nothing for string myStr. So output is nothing for utf8, and c-string, ok after that i checked with conversion to utf-32 with get bytes routine

result = CFStringGetBytes(s, range, kCFStringEncodingUTF32BE, 0, false, buffer,  bufferSize, usedSize);

that gives me usedSize=4, result=1, and output contains 0x0041, so only A symbol converted. So that is why i’m thinking no substitution happened for malformed surogate pair.

0

There are 0 answers