Add a single space and comma between words that are connected using regex

181 views Asked by At

I have a nested list_3 which looks like:

[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: $240,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]

I would like to use regex to add a comma followed by a single space between each joined word ie(HowSector:, SoftwareYear, 2010One), So far I have tried to write a re.sub code to do, by selecting all the characters without whitespace and replacing this, but have run into some issues:


for i, list in enumerate(list_3):
    list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
    list_33.append(list_3[i])
print(list_33)

error:

return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

I would like the output to be:

[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]

Any ideas how I can use regex to do this?

3

There are 3 answers

0
DeepSpace On BEST ANSWER

The main problem is that your nested list has no constant levels. Sometimes it has 2 levels and sometimes it has 3 levels. This is why you are getting the above error. In the case the list has 3 levels, re.sub receives a list as the third argument instead of a string.

The second problem is that the regex you are using is not the correct regex. The most naive regex we can use here should (at the very least) be able to find a non-whitespace charcter followed by a capital letter.

In the below example code, I'm using re.compile (since the same regex will be used over and over again, we might as well pre-compile it and gain some performance boost) and I'm just printing the output. You'll need to figure out a way to get the output in the format you want.

regex = re.compile(r'(\S)([A-Z])')
replacement = r'\1, \2'
for inner_list in nested_list:
    for string_or_list in inner_list:
        if isinstance(string_or_list, str):
            print(regex.sub(replacement, string_or_list))
        else:
            for string in string_or_list:
                print(regex.sub(replacement, string))

Outputs

Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly
2
dawg On

If your list of lists is arbitrary deep, you can recursively traverse it and process (with THIS regex) the strings and yield the same structure:

import re   
from collections.abc import Iterable 

def process(l):
    for el in l:
        if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
            yield type(el)(process(el))
        else:
            yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))   

Given your example as LoL here is the result:

>>> list(process(LoL))
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]
0
Cary Swoveland On

I believe you can use the following Python code.

rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
rep = r', \1:'
re.sub(rgx, rep, s)

where s is the string.

Start your engine! | Python code

Python's regex engine performs the following operations when matching.

(?<=          : begin positive lookbehind
  [a-z\d]     : match a letter or digit
)             : end positive lookbehind
(             : begin capture group 1
  [A-Z$]      : match a capital letter or '$'
  [A-Za-z]*   : match 0+ letters
  (?: +\S+?)  : match 1+ spaces greedily, 1+ non-spaces
                non-greedily in a non-capture group
  *           : execute non-capture group 0+ times
)             : end capture group
:             : match ':'

Note that the positive lookbehind and permissible characters for each token in the capture group may need to be adjusted to suit requirements.

The regular expression employed to construct replacement strings (, \1:) creates the string ', ' followed by the contents of capture group 1 followed by a colon.