NLTokenizer fails to enumerate words when text starts with Unicode "Other Symbol" followed directly by word

211 views Asked by At

I'm trying to tokenize a string into words in a Cocoa app, but ran into a problem with NLTokenizer.

When the input string starts with a symbol from the Unicode category "Other Symbol" or the block "Specials", like the NSTextAttachment.character, tokenizing fails (i.e. returns empty list).

The problem only occurs when the symbol is followed directly by a word without a space (see examples below).

Use case:

I have an NSAttributedString that can contain images anywhere in the text. Those are represented internally by the Object Replacement Character (U+FFFC). When a document starts with an image followed directly by a word, not a space, tokenizing fails.

To reproduce:

/// Splits by natural language words.
static let tokenizeByWord:(String)-> [String] = { input in
    
    let tokenizer = NLTokenizer(unit: .word)
    tokenizer.string = input
    
    var tokens = [String]()
    
    tokenizer.enumerateTokens(in: input.startIndex..<input.endIndex) { tokenRange, _ in
        let token = input[tokenRange]
        tokens.append(String(token))
        return true
    }
    return tokens
}
//  These all fail: (string starts with symbol, followed by word)
XCTAssertEqual(tokenizeByWord("\u{FFFC}hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("©hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("®hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("|hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("\\hello world"), ["hello", "world"])

// ✅ These all pass: (space after symbol)
XCTAssertEqual(tokenizeByWord("\u{FFFC} hello world"), ["\u{FFFC}", "hello", "world"])
XCTAssertEqual(tokenizeByWord("© hello world"), ["©", "hello", "world"])
XCTAssertEqual(tokenizeByWord("® hello world"), ["®", "hello", "world"])
XCTAssertEqual(tokenizeByWord("| hello world"), ["|", "hello", "world"])
XCTAssertEqual(tokenizeByWord("\\ hello world"), ["\\", "hello", "world"])

// ✅ These all pass: (no space, but symbol rigth before second word)
XCTAssertEqual(tokenizeByWord("hello \u{FFFC}world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ©world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ®world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello |world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello \\world"), ["hello", "world"])

// ✅ Emoji pass with and without space:
XCTAssertEqual(tokenizeByWord("hello world" ), ["", "hello", "world"])
XCTAssertEqual(tokenizeByWord(" hello world"), ["", "hello", "world"])

System:

  • macOS Catalina 10.15.7 (19H2)
  • Xcode 12.4 (12D4e)
0

There are 0 answers