Looking for Regex pattern to return similar results to my current function

Question

Looking for Regex pattern to return similar results to my current function

102 views Asked by Ian Thompson At 28 March 2024 at 02:21

I have some pascal-cased text that I'm trying to split into separate tokens/words. For example, "Hello123AIIsCool" would become ["Hello", "123", "AI", "Is", "Cool"].

Some Conditions

"Words" will always start with an upper-cased letter. E.g., "Hello"
A contiguous sequence of numbers should be left together. E.g., "123" -> ["123"], not ["1", "2", "3"]
A contiguous sequence of upper-cased letters should be kept together except when the last letter is the start to a new word as defined in the first condition. E.g., "ABCat" -> ["AB", "Cat"], not ["ABC", "at"]
There is no guarantee that each condition will have a match in a string. E.g., "Hello", "HelloAI", "HelloAIIsCool" "Hello123", "123AI", "AIIsCool", and any other combination I haven't provided are potential candidates.

I've tried a couple regex variations. The following two attempts got me pretty close to what I want, but not quite.

Version 0

import re

def extract_v0(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]*"
    num_pattern = r"\d+"
    pattern = f"{word_pattern}|{num_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts

string = "Hello123AIIsCool"
extract_v0(string)

['Hello', '123', 'A', 'I', 'Is', 'Cool']

Version 1

import re

def extract_v1(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"\d+"
    upper_pattern = r"[A-Z][^a-z]*"
    pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts

string = "Hello123AIIsCool"
extract_v1(string)

['Hello', '123', 'AII', 'Cool']

Best Option So Far

This uses a combination of regex and looping. It works, but is this the best solution? Or is there some fancy regex that can do it?

import re

def extract_v2(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"\d+"
    upper_pattern = r"[A-Z][A-Z]*"
    groups = []
    for pattern in [word_pattern, num_pattern, upper_pattern]:
        while string.strip():
            group = re.search(pattern=pattern, string=string)
            if group is not None:
                groups.append(group)
                string = string[:group.start()] + " " + string[group.end():]
            else:
                break
    
    ordered = sorted(groups, key=lambda g: g.start())
    return [grp.group() for grp in ordered]

string = "Hello123AIIsCool"
extract_v2(string)

['Hello', '123', 'AI', 'Is', 'Cool']

Original Q&A

There are 5 answers

Johnny C. On 28 March 2024 at 02:55

use re.sub and split()

import re

def pascal_case_split(identifier):
    return re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', re.sub('([0-9]+)', r' \1', identifier))).split()

a = pascal_case_split("Hello123AIIsCool")
a

['Hello', '123', 'AI', 'Is', 'Cool']

reference

Chris On 28 March 2024 at 02:56

re.findall should do the trick with much less work on your part. With re.X to allow for spacing out the pattern a bit.

>>> re.findall(
...   r'( [A-Z]{2,} (?! [a-z] ) | \d+ | [A-Z] [a-z]+ )', 
...   'Hello12 3AIIsCool', 
...   re.X
... )
['Hello', '123', 'AI', 'Is', 'Cool']

Pattern	Explanation
`[A-Z]{2,} (?! [a-z] )`	Matches two or more capital letters, not followed by a lowercase letter.
`\d+`	One or more numbers.
`[A-Z] [a-z]+`	A single uppercase letter followed by one or more lowercase letters.

As noted in comments, the first subpattern does not match a single capital letter. We can amend this by replacing [A-Z]{2,} with [A-Z]+ to match one or more capital letters not followed by a lowercase letter.

Hao Wu On 28 March 2024 at 03:01

You may try this regex:

[A-Z](?:[a-z]+|[A-Z]+(?![a-z]))?|\d+

See the test case

import re

pattern = r"[A-Z](?:[a-z]+|[A-Z]+(?![a-z]))?|\d+"
text = "Hello123AIIsCoolAndHTML5IsAMarkupLanguage"

print(re.findall(pattern, text))
# ['Hello', '123', 'AI', 'Is', 'Cool', 'And', 'HTML', '5', 'Is', 'A', 'Markup', 'Language']

no comment On 28 March 2024 at 04:20

Seems easier to do backwards:

import re

def extract(string: str) -> list[str]:
    backwards = re.findall(r'[a-z]+[A-Z]|\d+|[A-Z]+', string[::-1])
    return [s[::-1] for s in backwards[::-1]]

string = "Hello123AIIsCool"
print(extract(string))

Output (Attempt This Online!):

['Hello', '123', 'AI', 'Is', 'Cool']

**Chris Fu** · Accepted Answer · 2024-03-28T02:52:46+00:00

Based on your Version 1:

import re


def extract_v1(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"\d+"
    upper_pattern = r"[A-Z]+(?![a-z])"  # Fixed
    pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts


string = "Hello123AIIsCool"
extract_v1(string)

Result:

['Hello', '123', 'AI', 'Is', 'Cool']

The fixed upper_pattern will match as many uppercased letters as possible, and will stop one before a lowercased letter if it exists.

TechQA.

Looking for Regex pattern to return similar results to my current function

Some Conditions

Version 0

Version 1

Best Option So Far

There are 5 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PASCALCASING

Popular Questions

Trending Questions