Regex - Extract substrings starting with capitalized letter in a list, with french special symbols

524 views Asked by At

I have a set of french strings like this one:

text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"

I want to extract the substrings starting with capitalized letters into a list, as follows:

list = ["Français", "Langues bantoues", "Presse écrite", "Gabon", "Particularité linguistique"]

I did try something like that but it does not take the following words, and it stops because of the french symbols.

import re
pattern = "([A-Z][a-z]+)"

text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"

list = re.findall(pattern, text)
list

Output ['Fran', 'Langues', 'Presse', 'Gabon', 'Particularit']

I did not manage to find the solution on the forum unfortunately.

1

There are 1 answers

1
Wiktor Stribiżew On BEST ANSWER

Since this is related to specific Unicode character handling, I'd recommend to use the PyPi regex module (install using pip install regex) and then you can use

import regex
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
matches = regex.split(r'(?!\A)\b(?=\p{Lu})', text)
print( list(map(lambda x: x.strip(), matches)) )
# => ['Français', 'Langues bantoues', 'Presse écrite', 'Gabon', 'Particularité linguistique']

See the online Python demo and the regex demo. Details:

  • (?!\A) - position other than start of a string
  • \b - a word boundary
  • (?=\p{Lu}) - a positive lookahead that requires the next char to be a Unicode uppercase letter.

Note that map(lambda x: x.strip(), matches) is used to strip off redundant whitespace from the resulting chunks.

You can do this with re, too:

import re, sys
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
matches = re.split(fr'(?!\A)\b(?={pLu})', text)
print( list(map(lambda x: x.strip(), matches)) )
# => ['Français', 'Langues bantoues', 'Presse écrite', 'Gabon', 'Particularité linguistique']

See this Python demo, but bear in mind that the amount of supported Unicode uppercase letters differs from version to version, and using the PyPi regex module makes it more consistent.