Python re: I would like to capture multiple lines between a string delimiter

67 views Asked by At

I have a file like this which has multiple lines between a delimiter and i wanted to capture everything in between start_of_compile and end_of_compile excluding the comments.

string i want to parse below text

#####start_of_compile - DO NOT MODIFY THIS LINE#############
##################################################################

parse these lines in between them
...
....

###################################################################
#####end_of_compile -DO NOT MODIFY THIS LINE#################
###################################################################

I want to see match (.*) to match multiple lines between start and end. Currently its not

Instead i see below error

def checkcompileqel():
    compiledelimiters = ['start_of_compile_setup']
    with open("compile.qel", "r") as compile_fh:
        lines = compile_fh.read()
        matchstart = re.compile(r'^#+\n#+start.*#+\n#+(.*)#+\n#+end.*#+\n#+',re.MULTILINE)
        print(matchstart.match(lines).group(1))
Traceback (most recent call last):
  File "/process_tools/testcode.py", line 25, in <module>
    print(checkcompileqel())
  File "/home/process_tools/testcode.py", line 10, in checkcompileqel
    print(matchstart.match(lines).group(0))
AttributeError: 'NoneType' object has no attribute 'group'
2

There are 2 answers

0
The fourth bird On

Your regex starts to try matching a starting line that consists of only # that is not present.

Apart from that, you have to use re.S instead of re.MULTILINE for your pattern and make the quantifier non greedy to not have a last line with only # chars.

If the data always looks like that, you don't have to use the re.S and a non greedy quantifier which prevent unnecessary backtracking.

^#+\n#+start_of_compile\b.*\n#+\n\s*^(.+(?:\n(?!#+$).*)*)

regex demo

Example

matchstart = re.compile(
    r'^#+\n#+start_of_compile\b.*\n#+\n\s*^(.+(?:\n(?!#+$).*)*)',
    re.MULTILINE
)
print(matchstart.match(lines).group(1))

Output

parse these lines in between them
...
....
0
salah On

You can try loading the data into a dataframe column that can hold a list and then looping it over the lines and extracting the data into a column then do the regex on the column

t1 = pd.DataFrame(columns=['List'])

pattern0 = r'\bstart_of_compile\b.*\n#+\n\s*^(.+(?:\n(?!#+$).*)*)'

with open('compile.qel', 'r') as file:
    lines = file.readlines()

for i in range(0, len(lines), 1):
    # Check if i+1 is a valid index
    if i+1 < len(lines):
        # Combine the two lines
        line = lines[i].strip()
        
        # Apply the regex patterns
        match = re.search(pattern0, line)
       
        # If both patterns matched, append the results to the DataFrame
        if match:
            t1.loc[len(t1)] = [match.group(1)]

This should return a dataframe t(1) with the parsed data