Scrape text file with Python and save to new text file as values

67 views Asked by At

I have multiple files with different names in a directory with content like this:

Combination :3   Tuple Number:3
Request Type:ADD
Firewall Type:JP
Firewall Policy Name :STI-CEP31

Rule Type: ALLOW

Requested Values:

Policy Name Suffix: 

Source IP: GRN

Source Groups:
10.151.2.0/24
10.151.1.0/24

Destination IP: Untrusted 
Destination Group:
169.176.39.0/24
169.176.38.0/24

Application(s):
Application Group:

Service Mode:Use Protocol/Ports
Service Group:

Protocol/Ports:

TCP      |       21099


Combination :5   Tuple Number:16
Request Type:ADD
Firewall Type:JP
Firewall Policy Name :STI-CEP31

Rule Type: ALLOW

Requested Values:

Policy Name Suffix: 

Source IP: GRN

Source Groups:
10.151.2.0/24
10.151.1.0/24


Destination IP: Untrusted 
Destination Group:
169.176.39.0/24
169.176.38.0/24
154.154.55.221
154.25.55.662
148.55.465.653

Application(s):
Application Group:

Service Mode:Use Protocol/Ports
Service Group:

Protocol/Ports:

TCP      |       219


Combination :100   Tuple Number:100
Request Type:ADD
Firewall Type:JP
Firewall Policy Name :STI-CEP31

Rule Type: ALLOW

Requested Values:

Policy Name Suffix: 

Source IP: GRN

Source Groups:
10.151.2.0/24
10.151.1.0/24

Destination IP: Untrusted 
Destination Group:
169.176.38.0/24
154.154.55.222
154.25.55.61
148.55.465.651

Application(s):
Application Group:

Service Mode:Use Protocol/Ports
Service Group:

Protocol/Ports:

TCP      |       210

I am trying to create a python code that will consider each line starting with Combination as one block, For eg: There are three blocks in this text file (One starting with Combination :3, One starting with Combination: 5, and one starting with Combination: 100). Search through each block for the IP provided by the user (Script should be able to accept or find partial matches from the file as well, for eg: if the input is 169.176.39.0 and there is an entry with 169.176.39.0/24, it should consider that as a match. As long as the basic IP provided matches the input, the subnets shouldn't be relevant)

If there is an entry found (For eg: in blocks starting with 3 and 5 there is an entry for 169.176.39.0 in the Destination Group section), then print the values of Source Groups IPs, Destination Group: IPs, and Ports. The code I have is hard coded to print the exact lines after Source Group IPs, Destination Group: IPs, and Ports but not print them as values of the respective headings (Destination Group, Source group: Ports). Maybe creating a dictionary pair with Keys as the Combination, Tuple Number, Source Groups, Destination Group: and Protocol/Ports with values being listed under each of these headings or Keys.

For eg: from the file above st.txt When I execute the file and request for the input of the IP address and I enter 169.176.39.0 The script should start the search for the matches, in this scenario the match is found in Combination: 3 Tuple Number: 3 block and Combination :5 Tuple Number:16 block. The script should print the values from the file in this format or save the output in a different file called output.txt and should have things in the following format

Output Expected

File Name: st.txt
Combination Number: 3
Tuple Number: 3

Source Groups:
10.151.2.0/24
10.151.1.0/24

Destination Groups:
169.176.39.0/24
169.176.38.0/24


Ports: 21099

Combination:5   
Tuple Number:16

Source Groups: 
10.151.2.0/24 
10.151.1.0/24 

Destination Group: 
169.176.38.0/24
154.154.55.222
154.25.55.61
148.55.465.651 

Ports: 219

Scan the directory for all text files (All text files will be in the same format as pasted above in this example with the same level of indentations) and print the output in a similar format with File name Changed at the start for each file in the output.txt

Code I have which works when scanning one file and gives results in hard-coded form

import os

filename = input("Please provide the name of the file: ")
ip_search = input("Please provide the IP:")
blocks = []
attributes = ("Source Groups:", "Destination Group", "Protocol/Ports")

with open(filename, 'r') as f:
    block = {}
    in_section = None
    for line in f:
        line = line.strip()
        if line:
            # Stop consuming values for multi-line attributes if the line 
            # contains a colon. It's assumed to be another single-line 
            # attribute:value pair.
            if ":" in line:
                in_section = None
            # The end of a block starts/ends at a line starting with 
            # "Combination" or the end of the file
            if line.startswith("Combination"):
                if block:
                    blocks.append(block)
                    block = {}
                block["block"] = line
            # Consume the line since we're in a multi-line attribute section
            elif in_section:
                values = block.setdefault(in_section, [])
                
                # We only want the port
                if in_section == "Protocol/Ports":
                    line = line.split("|", maxsplit=5)[1].strip()
                
                values.append(line)
                
            # Check if the line is the start of a multi-line attribute
            else:
                for attribute in attributes:
                    if line.startswith(attribute):
                        in_section = attribute
                        break

# The end of a block starts/ends at a line starting with 
# "Combination" or the end of the file
if block:
    blocks.append(block)

# Create a new list of blocks if it contains a particular IP address
blocks_with_certain_ip = []
for block in blocks:
    search_string = ip_search 
    if search_string in block["Source Groups:"] or search_string in block["Destination Group"]:
        blocks_with_certain_ip.append(block)

# Format and print the blocks as desired
for block in blocks_with_certain_ip:
    string = (f'{filename}'
              f'# {block["block"]} '
              f'# Source Groups: {" ".join(block["Source Groups:"])} '
              f'# Destination Group: {" ".join(block["Destination Group"])} '
              f'# Ports # {" ".join(block["Protocol/Ports"])}')
    print(string)

Please if anyone can solve this issue, it will be a great help.

1

There are 1 answers

9
SimonUnderwood On

I don't have the output format exactly as you requested it but I did wrap the results in a nice list of dictionaries which you can then format however you want. Hope this helps:

import os

def scrape_ip(txt: str) -> str:
    blocks = txt.split('\n\n')
    res: list[dict] = []
    block: dict = {}
    for unparsed in blocks:
        tokens = [word for line in unparsed.splitlines() for word in line.split(' ') if word]
        match tokens:
            case ['Combination', comb, 'Tuple', tup, *_]:
                if block: 
                    res.append(block)
                block = {'Combination': int(comb[1:]), 'Tuple': int(tup[7:])}
            case ['Source', 'Groups:', *ips]:
                block['Source Groups'] = ips
            case ['Destination', 'IP:', *rest]:
                s = ' '.join(rest)
                s = s[s.index('Destination Group:') + 19:]
                block['Destination Groups'] = s.split(' ')
            case ['TCP', '|', port]:
                block['Port'] = int(port)
        
    if block:
        res.append(block)
    return res


def main():
    files = [file for file in os.listdir() if file.endswith('.txt')]
    res: dict[str: list[dict]] = {}
    for file in files:
        with open(file, 'r') as f:
            txt = f.read()
            f_res = scrape_ip(txt)
            res[file] = f_res
    
    for file, f_res in res.items():
        print(file)
        for block in f_res:
            for k, v in block.items():
                print(f'{k}: {v}')
            print()
        print()

    while search := input('Search For an IP (Enter nothing to exit) \n>>> '):
        search_res: list[str] = []
        for file, f_res in res.items():
            for block in f_res:
                if 'Source Groups' in block:
                    for ip in block['Source Groups']:
                        if search in ip:
                            search_res.append(f'{file} - Combination: {block["Combination"]}, Tuple: {block["Tuple"]} - Source Group: {ip}')
                if 'Destination Groups' in block:
                    for ip in block['Destination Groups']:
                        if search in ip:
                            search_res.append(f'{file} - Combination: {block["Combination"]}, Tuple: {block["Tuple"]} - Destination Group: {ip}')
        if search_res:
            print('Search Results:')
            for result in search_res:
                print(result)
        else:
            print('No Results Found')


if __name__ == '__main__':
    main()

(I'm a bit tired right now so I don't have it commented, but if want a detailed walk-through of the code, just make a reply and I'll make an edit tomorrow morning.)

Edit: Added searching functionality.

Edit 2: Fixed pattern matching for edge cases.