I'm trying to tokenize a string into words in a Cocoa app, but ran into a problem with NLTokenizer
.
When the input string starts with a symbol from the Unicode category "Other Symbol" or the block "Specials", like the NSTextAttachment.character
, tokenizing fails (i.e. returns empty list).
The problem only occurs when the symbol is followed directly by a word without a space (see examples below).
Use case:
I have an NSAttributedString
that can contain images anywhere in the text. Those are represented internally by the Object Replacement Character (U+FFFC). When a document starts with an image followed directly by a word, not a space, tokenizing fails.
To reproduce:
/// Splits by natural language words.
static let tokenizeByWord:(String)-> [String] = { input in
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = input
var tokens = [String]()
tokenizer.enumerateTokens(in: input.startIndex..<input.endIndex) { tokenRange, _ in
let token = input[tokenRange]
tokens.append(String(token))
return true
}
return tokens
}
// These all fail: (string starts with symbol, followed by word)
XCTAssertEqual(tokenizeByWord("\u{FFFC}hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("©hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("®hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("|hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("\\hello world"), ["hello", "world"])
// ✅ These all pass: (space after symbol)
XCTAssertEqual(tokenizeByWord("\u{FFFC} hello world"), ["\u{FFFC}", "hello", "world"])
XCTAssertEqual(tokenizeByWord("© hello world"), ["©", "hello", "world"])
XCTAssertEqual(tokenizeByWord("® hello world"), ["®", "hello", "world"])
XCTAssertEqual(tokenizeByWord("| hello world"), ["|", "hello", "world"])
XCTAssertEqual(tokenizeByWord("\\ hello world"), ["\\", "hello", "world"])
// ✅ These all pass: (no space, but symbol rigth before second word)
XCTAssertEqual(tokenizeByWord("hello \u{FFFC}world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ©world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ®world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello |world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello \\world"), ["hello", "world"])
// ✅ Emoji pass with and without space:
XCTAssertEqual(tokenizeByWord("hello world" ), ["", "hello", "world"])
XCTAssertEqual(tokenizeByWord(" hello world"), ["", "hello", "world"])
System:
- macOS Catalina 10.15.7 (19H2)
- Xcode 12.4 (12D4e)