Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

Question

Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

424 views Asked by Rentrop At 15 December 2016 at 23:23

Using stringr i tried to detect a € sign at the end of a string as follows:

str_detect("my text €", "€\\b") # FALSE

Why is this not working? It is working in the following cases:

str_detect("my text a", "a\\b") # TRUE - letter instead of €
grepl("€\\b", "2009in €") # TRUE - base R solution

But it also fails in perl mode:

grepl("€\\b", "2009in €", perl=TRUE) # FALSE

So what is wrong about the €\\b-regex? The regex €$ is working in all cases...

Original Q&A

There are 2 answers

ikegami On 15 December 2016 at 23:47

\b

is equivalent to

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

which is to say it matches

between a word char and a non-word char,
between a word char and the start of the string, and
between a word char and the end of the string.

€ is a symbol, and symbols aren't word characters.

$ uniprops €
U+20AC <€> \N{EURO SIGN}
    \pS \p{Sc}
    All Any Assigned Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Print X_POSIX_Print Symbol Unicode

If your language supports look-behinds and look-aheads, you could use the following to find a boundary between a space and non-space (treating the start and end as a space).

(?:(?<!\S)(?=\S)|(?<=\S)(?!\S))

**Wiktor Stribiżew** · Accepted Answer · 2016-12-15T23:47:55+00:00

When you use base R regex functions without perl=TRUE, TRE regex flavor is used.

It appears that TRE word boundary:

When used after a non-word character matches the end of string position, and
When used before a non-word character matches the start of string position.

See the R tests:

> gsub("\\b\\)", "HERE", ") 2009in )")
[1] "HERE 2009in )"
> gsub("\\)\\b", "HERE", ") 2009in )")
[1] ") 2009in HERE"
>

This is not a common behavior of a word boundary in PCRE and ICU regex flavors where a word boundary before a non-word character only matches when the character is preceded with a word char, excluding the start of string position (and when used after a non-word character requires a word character to appear right after the word boundary):

There are three different positions that qualify as word boundaries:

- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

TechQA.

Why does is this end of line (\\b) not recognised as word boundary in stringr/ICU and Perl

There are 2 answers

Related Questions in R

Related Questions in REGEX

Related Questions in PCRE

Related Questions in STRINGR

Popular Questions

Popular Tags

Trending Questions