Consider the string "abc를". According to unicode's demo implementation of word segmentation, this string should be split into two words, "abc" and "를". However, 3 different Rust implementations of word boundary detection (regex, unic-segment, unicode-segmentation) have all disagreed, and grouped that string into one word. Which behavior is correct?
As a follow up, if the grouped behavior is correct, what would be a good way to scan this string for the search term "abc" in a way that still mostly respects word boundaries (for the purpose of checking the validity of string translations). I'd want to match something like "abc를" but don't match something like abcdef.
I'm not so certain that the demo for word segmentation should be taken as the ground truth, even if it is on an official site. For example, it considers
"abc를"("abc\uB97C") to be two separate words but considers"abc를"("abc\u1105\u1173\u11af") to be one, even though the former decomposes to the latter.The idea of a word boundary isn't exactly set in stone. Unicode has a Word Boundary specification which outlines where word-breaks should and should not occurr. However, it has an extensive notes section for elaborating on other cases (emphasis mine):
My understanding is that the crates you list are following the spec without further contextual analysis. Why the demo disagrees I cannot say, but it may be an attempt to implement one of these edge cases.
To address your specific problem, I'd suggest using
Regexwith\bfor matching a word boundary. This unfortunately follows the same unicode rules and will not consider"를"to be a new word. However, this regex implementation offers an escape hatch to fallback to ascii behaviour. Simply use(?-u:\b)to match a non-unicode boundary:You can run it for yourself on the playground to test your cases and see if this works for you.