I have created a Java wrapper around a native C library and have a question about the string encodings. There are slight differences in the “Java modified UTF-8” encoding that is used by Java from the regular UTF-8. And these differences may cause serious problems: the JNI functions may crash the app when passed the regular UTF-8 because it may contain byte sequences forbidden for the “Java modified UTF-8”. Please see the following topic: What does it mean to say "Java Modified UTF-8 Encoding"?
My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?
First, consider whether you really need or want to do that. The only reason I can think of for doing so in the context of wrapping a C library is to use the JNI functions that work with Java
Strings in terms of byte arrays encoded in modified UTF-8, but that's neither the only nor the best way to proceed except in rather specific circumstances.For most cases, I would recommend going directly from UTF-8 to String objects, and getting Java to do most of that work. Simple tools Java provides for that include the constructor
String(byte[], String), which initializes a String with data whose encoding you specify, andString.getBytes(String), which gives you the string's character data in the encoding of your choice. Both of these are limited to encodings known to the JVM, but UTF-8 is guaranteed to be among those. You can use those directly from your JNI code, or provide suitable for-purpose wrapper methods for your JNI code to invoke.If you really do want the modified UTF-8 form for its own sake, then your JNI code can obtain it from the corresponding Java string (obtained as summarized above) via the
GetStringUTFCharsJNI function, and you can go the other way withNewStringUTF. Of course, this makes JavaStrings the intermediate form, which is entirely appropos in this case.