How to find word with subscript?

Asked by At

Input: s = "test1 this is a sample subscript o₁"

I've tried: re.compile(r'\b[^\W\d_]{2,}\b').findall(s)

It finds the word with more than 2 chars and doesn't contain number 'this', 'is', 'sample', 'subscript', 'o₁',

but it still has the subscript number.

Is there a way to remove word that contains subscript in it?

Desire output: 'this', 'is', 'sample', 'subscript'

1 Answers

2
Wiktor Stribiżew On Best Solutions

The point is that the Unicode aware \d in Python 3 regex does not match No Unicode category.

If you need to work with ASCII only letter words, use

r'\b[a-zA-Z]{2,}\b'

Or, make the pattern non-Unicode aware by using re.A / re.ASCII flag:

re.compile(r'\b[^\W\d_]{2,}\b', re.A)

See this Python 3 demo.

If you need to work with any Unicode letters you may fix it by either adding all the No characters to the regex negated character class (which might make it a tedious solution), or add a programmatic check after a match is found to see if the match contains any char from the No category.

See this Python 3 demo:

import re, sys, unicodedata
s = "test1 this is a sample subscript o₁"
No = [chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)) == 'No']
print([x for x in re.findall(r'\b[^\W\d_]{2,}\b', s) if not any(y in x for y in No)])
# => ['this', 'is', 'sample', 'subscript']

Make sure you are using the latest Python version to support the latest Unicode standard, or rely on the PyPi regex module:

p = regex.compile(r"\b\p{L}{2,}\b")
print(p.findall(s))