Word boundary between ASCII and Unicode letters in Python3

Question

Word boundary between ASCII and Unicode letters in Python3

97 views Asked by Sakurai At 30 May 2022 at 07:58

Python3:

import re
k = "X"
s = "X测试一Q测试二XQ测试三"
print(re.split((r"\b" + k + r"\b"), s))

Output:

['X测试一Q测试二XQ测试三']

Expected:

['', '测试一Q测试二XQ测试三']

Original Q&A

There are 1 answers

**Wiktor Stribiżew** · Accepted Answer · 2022-05-30T08:05:15+00:00

The 测 is a letter belonging to the \p{Lo} class and there is no word boundary between X and 测.

A \b word boundary construct is Unicode-aware by default in Python 3.x re patterns, so you might switch this behavior off by using the re.ASCII / re.A option, or the inline (?a) flag:

import re
k = "X"
print( re.split(fr"(?a)\b{k}\b", "X测试一Q测试二XQ测试三") )

See the regex demo and the Python demo.

If you need to make sure there is no ASCII letter before and after X, use (?<![a-zA-Z])X(?![a-zA-Z]). Or, including digits, (?<![a-zA-Z0-9])X(?![a-zA-Z0-9]).

TechQA.

Word boundary between ASCII and Unicode letters in Python3

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in REGEX

Related Questions in WORD-BOUNDARY

Popular Questions

Trending Questions