Java reading in character streams with supplementary unicode characters

Question

Java reading in character streams with supplementary unicode characters

2k views Asked by wabledoodle At 11 October 2011 at 04:12

I'm having trouble reading in supplementary unicode characters using Java. I have a file that potentially contains characters in the supplementary set (anything greater than \uFFFF). When I setup my InputStreamReader to read the file using UTF-8 I would expect the read() method to return a single character for each supplementary character, instead it seems to split on the 16 bit threshold.

I saw some other questions about basic unicode character streams, but nothing seems to deal with the greater than 16 bit case.

Here's some simplified sample code:

InputStreamReader input = new InputStreamReader(file, "UTF8");
int nextChar = input.read();
while(nextChar != -1) {
    ...
    nextChar = input.read();
}

Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains supplementary characters?

Original Q&A

There are 2 answers

John Flatness On 11 October 2011 at 04:26

Though read() is defined to return int, and could theoretically return a supplementary character's code point "all at once", I believe the return type is only int to allow a value of -1 to be returned.

The value you're getting from read() is basically a char by another name, and Java a char is limited to 16 bits.

Java can only represent supplementary characters as a UTF-16 surrogate pair, there is no such thing as a "single character" (at least in the char sense) once you get above 0xFFFF as far as Java is concerned.

**C. K. Young** · Accepted Answer · 2011-10-11T04:24:49+00:00

C. K. Young On 11 October 2011 at 04:24 BEST ANSWER

Java works with UTF-16. So, if your input stream has astral characters, they will appear as a surrogate pair, i.e., as two chars. The first character is the high surrogate, and the second character is the low surrogate.

TechQA.

Java reading in character streams with supplementary unicode characters

There are 2 answers

Related Questions in JAVA

Related Questions in UNICODE

Related Questions in ASTRAL-PLANE

Related Questions in SUPPLEMENTARY

Popular Questions

Trending Questions