I have a set of french strings like this one:
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
I want to extract the substrings starting with capitalized letters into a list, as follows:
list = ["Français", "Langues bantoues", "Presse écrite", "Gabon", "Particularité linguistique"]
I did try something like that but it does not take the following words, and it stops because of the french symbols.
import re
pattern = "([A-Z][a-z]+)"
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
list = re.findall(pattern, text)
list
Output
['Fran', 'Langues', 'Presse', 'Gabon', 'Particularit']
I did not manage to find the solution on the forum unfortunately.
Since this is related to specific Unicode character handling, I'd recommend to use the PyPi regex module (install using
pip install regex
) and then you can useSee the online Python demo and the regex demo. Details:
(?!\A)
- position other than start of a string\b
- a word boundary(?=\p{Lu})
- a positive lookahead that requires the next char to be a Unicode uppercase letter.Note that
map(lambda x: x.strip(), matches)
is used to strip off redundant whitespace from the resulting chunks.You can do this with
re
, too:See this Python demo, but bear in mind that the amount of supported Unicode uppercase letters differs from version to version, and using the PyPi regex module makes it more consistent.