Handling the utf8 encoded char* array

743 views Asked by At

A file contains non-latin content and is encoded in UTF8. Currently the existing code uses "fopen" to open the file, parses it and calls my validate function with the non-latin content and passes data as char*.

void validate(const char* str)
{
    ....
}

I have to do some validation on passed char array.

The application uses Sun C++ 5.11 and which I think doesn't supports unicode. (I googled for unicode support on Sun C++ 5.11, I didn't get any proper pointers about the unicode support. So I wrote a simple program to check if Sun C++ supports unicode and the program didn't compile).

How do I do the validation on the input char*? Is it possible using wchar_t?

1

There are 1 answers

1
eerorika On

The application uses <compiler> and which I think doesn't supports unicode

This isn't a problem. You only need compiler support for unicode to embed unicode string literals in the code, or for fixed width character types to represent UTF-16 or UTF-32. Your unicode is UTF-8 and comes from user input, so no unicode compiler support should be needed.

How do I do the validation on the input char*?

The C++ standard library has very few tools for processing unicode. The provided tools primarily consist of conversion between different unicode formats, and even those tools were not available prior to C++11.

Input and output is mostly just copying of bytes, so no significant processing is required to do that. For other processing (which you presumably need for "validation") you will need to implement the tools yourself, or use third party tools. You will need to refer to the ~1000 pages of the unicode standard if you choose to implement yourself: http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf

Is it possible using wchar_t?

wchar_t is the native wide character type used for the native wide character encoding of the system. UTF-8 does not use wide code-units.