Rearranging sections of lines of a file using regular expressions python

305 views Asked by At

so I am creating a script that will go through a file with a certain format and rearrange it to the same format as another file. Here is a sample of the unformatted file

, 0x40a846, mov [ecx+2bh],al, 88 41 2B, , , , \par
, 0x40a849, jmp $+001775cbh (0x581e14), E9 C6 75 17 00, , , , \par
, 0x40a84e, int3, CC, , , , \par
, 0x40a84f, int3, CC, , , , \par
, 0x40a850, push esi, 56, , , , \par
, 0x40a851, mov esi,ecx, 8B F1, , , , \par

the end goal is to have each line of the file looking like this

0x40a846, 0x 88 41 2B ,"mov [ecx+2bh],al",,,

My main issue is some lines of the file only have one section of source code while others have 2, making it difficult for me to make a regular expression that will grab both of them without grabbing the code bytes on accident. I wanted to use capture groups to rearrange the information on each line. Below is my script as of now:

import csv
import string
import re, sys
file_to_change = 'testingthecodexlconverter.csv'
    # = raw_input("Please specify what codexl file you would like to convert: ")
file1 = open(file_to_change, 'r+')

with file1  as f:
    for line in f:
        line = line[2:-12]
        line = line.rstrip('\n') + ',,'
       # mo = re.search(r'(.*?),.*?.*?,.*?(.*?),.*?.*?,.*?(.*?),.*?.*?,.*?(.*?)', line)
       #mo = re.search(r'(.*?),.*?(.*?,.*?.*?,).*?.*?,.*?(.*?),.*?.*?,.*?(.*?)', line)
        mo = re.search(r'(.*?),.*?(.*?.*?,\S*?,).*?.*?.*?,.*?(.*?),', line)  
        if mo:
            print(mo.group(2))

Can anyone lend me a hand?

3

There are 3 answers

1
Alexander McFarlane On

I'd use pandas and just rearrange the columns according to your need as it seems they are in a reasonable csv format. This method also allows you to visualise how you manipulate the data in your csv whilst you edit it:

import pandas as pd
df = pd.read_csv('inputCSV.csv', header=None).fillna('')
df = df.astype(str)
out = df[[4,1,2]].to_csv(index=False, header=False, coding='utf-8', lineterminator='\r\n', mode='wb')

Your problem is a littler unclear in what data format you are exacting in each individual column.

I believe you might have missing comas in your input csv file. My suggestion is to do a search for these missing commas and add them to have a properly formatted input file.

The fastest way of course is by just splitting the string as mentioned above using .split() but it seems you are not sure what you are doing hence my suggestion of pandas for parsing.

1
Dan On

You can tokenize your lines as suggested by others by splitting at the commas and then just add them back when you print

file_to_change = 'testingthecodexlconverter.csv'

file1 = open(file_to_change, 'r+')

with file1  as f:
    for line in f:
        line = line[2:-12]

        tokens = line.split(',')

        # if column index 3 is empty then print without formatting for
        # unnecessary space.
        if not tokens[3]:
            print(tokens[0] + ", " + tokens[2].strip(" ") + ", " + tokens[1] + ",,,")
        else:
            print(tokens[0] + "," + tokens[3] +  ", " + tokens[2].strip(" ") + ", " + tokens[1] + ",,,")

this will print in the format:

0x40a846, 88 41 2B, al,  mov [ecx+2bh],,,
0x40a849, E9 C6 75 17 00,  jmp $+001775cbh (0x581e14),,,
0x40a84e, CC,  int3,,,
0x40a84f, CC,  int3,,,
0x40a850, 56,  push esi,,,
0x40a851, 8B F1, ecx,  mov esi,,,
0
Rick Sullivan On

You can use the csv module, which you have already included, but aren't currently using.

import csv 

file_path = 'test.csv' 

with open(file_path) as csvfile: 
    reader = csv.reader(csvfile) 
    writer = csv.writer(open('tempfile.csv', 'w'), delimiter=',') 
    for row in reader: 
        new_row = [e.strip() for e in row if len(e.strip()) > 0] 
        # The new row should have the first element, then the last,
        # followed by everything else that wasn't empty.
        new_row = [new_row[0], new_row[-1]] + new_row[1:-1] 
        writer.writerow(new_row)

The new csv file looks like this:

0x40a846,88 41 2B,mov [ecx+2bh],al 
0x40a849,E9 C6 75 17 00,jmp $+001775cbh (0x581e14) 
0x40a84e,CC,int3
0x40a84f,CC,int3
0x40a850,56,push esi
0x40a851,8B F1,mov esi,ecx