How to convert the "Java modified UTF-8" to the regular UTF-8 and back?

Question

How to convert the "Java modified UTF-8" to the regular UTF-8 and back?

2k views Asked by Roman Zacharia At 08 August 2019 at 19:52

I have created a Java wrapper around a native C library and have a question about the string encodings. There are slight differences in the “Java modified UTF-8” encoding that is used by Java from the regular UTF-8. And these differences may cause serious problems: the JNI functions may crash the app when passed the regular UTF-8 because it may contain byte sequences forbidden for the “Java modified UTF-8”. Please see the following topic: What does it mean to say "Java Modified UTF-8 Encoding"?

My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?

Original Q&A

There are 3 answers

**John Bollinger** · Answer 1 · 2019-08-08T21:33:17+00:00

My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?

First, consider whether you really need or want to do that. The only reason I can think of for doing so in the context of wrapping a C library is to use the JNI functions that work with Java Strings in terms of byte arrays encoded in modified UTF-8, but that's neither the only nor the best way to proceed except in rather specific circumstances.

For most cases, I would recommend going directly from UTF-8 to String objects, and getting Java to do most of that work. Simple tools Java provides for that include the constructor String(byte[], String), which initializes a String with data whose encoding you specify, and String.getBytes(String), which gives you the string's character data in the encoding of your choice. Both of these are limited to encodings known to the JVM, but UTF-8 is guaranteed to be among those. You can use those directly from your JNI code, or provide suitable for-purpose wrapper methods for your JNI code to invoke.

If you really do want the modified UTF-8 form for its own sake, then your JNI code can obtain it from the corresponding Java string (obtained as summarized above) via the GetStringUTFChars JNI function, and you can go the other way with NewStringUTF. Of course, this makes Java Strings the intermediate form, which is entirely appropos in this case.

**Roman Zacharia** · Answer 2 · 2019-08-09T00:34:30+00:00

Thanks everyone for your replies! I finally found the answer. The only documented way of such conversions is using InputStreamReader and OutputStreamWriter

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter (if it is the platform's default character set or as requested by the program).

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Also the NewStringUTF JNI method expects the Modified UTF-8 input, not the standard one. And it will crash the app if it receives a forbidden byte sequence and the JNI Exception handling can't prevent it from crashing the app.

So my second conclusion is that passing String/jstring from JNI to Java or the other way is always a bad idea. Never do that. Perform all of the conversions with the InputStreamReader and OutputStreamWriter on the Java layer and pass the raw byte arrays to/from the JNI.

**Mike Nakis** · Answer 3 · 2020-05-30T13:37:18+00:00

There is absolutely nothing that can only be achieved by using some library. You can always do it yourself.

^{Note: class Buffer below just wraps an array of byte the same way a String wraps an array of char.}

public static String stringFromBuffer( Buffer buffer )
{
    String result = stringFromBuffer0( buffer );
    assert bufferFromString0( result ).equals( buffer );
    return result;
}

public static Buffer bufferFromString( String s )
{
    Buffer result = bufferFromString0( s );
    assert stringFromBuffer( result ).equals( s );
    return result;
}

private static String stringFromBuffer0( Buffer buffer )
{
    byte[] bytes = buffer.getBytes();
    int end = bytes.length;
    char[] chars = new char[end];
    int t = 0;
    for( int s = 0; s < end; )
    {
        int b1 = bytes[s++] & 0xff;
        assert b1 >> 4 >= 0;
        if( /*b1 >> 4 >= 0 &&*/ b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            chars[t++] = (char)b1;
        else if( b1 >> 4 >= 8 && b1 >> 4 <= 11 ) /* 0x10xx_xxxx */
            throw new MalformedUtf8Exception( s - 1 );
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x1f) << 6) | (b2 & 0x3f));
        }
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b3 = bytes[s++] & 0xff;
            assert (b3 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x0f) << 12) | ((b2 & 0x3f) << 6) | (b3 & 0x3f));
        }
        else /* 0x1111_xxxx */
            throw new MalformedUtf8Exception( s - 1 );
    }
    return new String( chars, 0, t );
}

private static Buffer bufferFromString0( String s )
{
    char[] chars = s.toCharArray();
    byte[] bytes = new byte[chars.length * 3];
    int p = 0;
    for( char c : chars )
    {
        if( (c >= 1) && (c <= 0x7f) )
            bytes[p++] = (byte)c;
        else if( c > 0x07ff )
        {
            bytes[p++] = (byte)(0xe0 | ((c >> 12) & 0x0f));
            bytes[p++] = (byte)(0x80 | ((c >> 6) & 0x3f));
            bytes[p++] = (byte)(0x80 | (c & 0x3f));
        }
        else
        {
            bytes[p++] = (byte)(0xc0 | ((c >> 6) & 0x1f));
            bytes[p++] = (byte)(0x80 | (c & 0x3f));
        }
    }
    if( p > 0xffff )
        throw new StringTooLongException( p );
    return Buffer.create( bytes, 0, p );
}

TechQA.

How to convert the "Java modified UTF-8" to the regular UTF-8 and back?

There are 3 answers

Related Questions in JAVA

Related Questions in C

Related Questions in UTF-8

Related Questions in JNIWRAPPER

Popular Questions

Trending Questions