How to generate a random Unicode string including supplementary characters?

2.8k views Asked by At

I'm working on some code for generating random strings. The resulting string appears to contain invalid char combinations. Specifically, I find high surrogates which are not followed by a low surrogate.

Can anyone explain why this is happening? Do I have to explicitly generate a random low surrogate to follow a high surrogate? I had assumed this wasn't needed, as I was using the int variants of the Character class.

Here's the test code, which on a recent run produced the following bad pairings:

Bad pairing: d928 - d863
Bad pairing: da02 - 7bb6
Bad pairing: dbbc - d85c
Bad pairing: dbc6 - d85c
public static void main(String[] args) {
  Random r = new Random();
  StringBuilder builder = new StringBuilder();

  int count = 500;
  while (count > 0) {
    int codePoint = r.nextInt(Character.MAX_CODE_POINT + 1);

    if (!Character.isDefined(codePoint)
        || Character.getType(codePoint) == Character.PRIVATE_USE) {
      continue;
    }

    builder.appendCodePoint(codePoint);
    count--;
  }

  String result = builder.toString();

  // Test the result
  char lastChar = 0;
  for (int i = 0; i < result.length(); i++) {
    char c = result.charAt(i);
    if (Character.isHighSurrogate(lastChar) && !Character.isLowSurrogate(c)) {
      System.out.println(String.format("Bad pairing: %s - %s",
          Integer.toHexString(lastChar), Integer.toHexString(c)));
    }
    lastChar = c;
  }
}
2

There are 2 answers

2
nwellnhof On BEST ANSWER

It's possible to randomly generate high or low surrogates. If this results in a low surrogate, or a high surrogate not followed by a low surrogate, the resulting string is invalid. The solution is to simply exclude all surrogates:

if (!Character.isDefined(codePoint)
    || (codePoint <= Character.MAX_CHAR && Character.isSurrogate((char)codePoint))
    || Character.getType(codePoint) == Character.PRIVATE_USE) {
  continue;
}

Alternatively, it should work to only look at the type returned from getType:

int type = Character.getType(codePoint);
if (type == Character.PRIVATE_USE ||
    type == Character.SURROGATE ||
    type == Character.UNASSIGNED)
    continue;

(Technically, you could also allow randomly generated high surrogates and add another random low surrogate, but this would only create other random code points >= 0x10000 which might in turn be undefined or for private use.)

0
JosefZ On

You need to exclude all surrogate orphans (i.e. high-surrogate as well as low-surrogate ones).

FYI, next excerpt from UnicodeData.txt shows codepoint intervals for surrogates:

D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;