Regular Expression with parenthesis in a string from PyPDF2 and DataFrame with positive and negative numbers

379 views Asked by At

I am currently using a script that pulls a PDF to a string via PyPDF2, then searches that PDF with regular expressions for expense amounts in brackets. These expenses are then summed and manipulated in a pd.DataFrame and the result is outputted to a CSV file.

All was fine until a refund was issued. Essentially all expenses where contained within parenthesis and I would use re to search for the x-expense eg.

adminFee = re.findall('Administration Fees\s*\((.*?)\)', data) 

This was then added to list of totalFees

totalFees = adminFee + govFee + commFee....ect. 

The list is then converted to a DataFrame for manipulation.

Recently, the format of the PDF changed to contain a further expense string. Which also at times may constitute as a refund. The string I search for with re looks as such

""Excess Expenses (Refund from client) (150.00)""

or

""Excess Expenses (Refund from client) 200.00""

My first issue is with the parenthesis, I have tried the following to no avail

excessExpenses = re.findall('Excess expenses \(Refund From client\)\s*(\d*\,*\d*\.+\d+\))',data)

But it returns an empty list [] ? I have tried several re's but none seem to be working.

My next issue is that previously these where all expenses (negative amounts), but now is including positive values with refunds. This messes with the DataFrame calculation. How best should I resolve this. I could either go back to the previous expenses within parenthesis and make them negative via something like:

adminFee = re.findall('Administration Fees\s*(\(?.*?\)?)',data)
adminFee = ''.join(adminFee) 
adminFee = adminFee.replace(',','') #for float manipulation in dataframe
adminFee = adminFee.replace('(','-') # replace ( to make -
adminFee = adminFee.replace(')','') #remove last braket
adminFee = [adminFee] #make list again

And then pull the positive amounts, it would allow the DataFrame to be calculated correctly, less expenses and adding refunds.

Or would it be more advisable to follow another root?

I know this code may be the least pythonic you have ever seen, but it has been working well for the past year.

1

There are 1 answers

3
JosefScript On BEST ANSWER

It's mainly typos: a missing space after the opening bracket before 'Refund', a lower case 'e' instead of 'E' in ''Expenses' and a superfluous closing bracket '\)' following the number. This one should work:

'Excess Expenses\s*\(\s*Refund From client\)\s*(\d*\,*\d*\.+\d+)'

But it is not the best solution. You will never have more than one dot, so no need for the quantifier '+'. '?' does it. In case you are dealing with more than millions, you might want to handle more than one comma, which by the way needs no escape, so instead of '\d*\,*' this would be better: '(?:\d*,)*'. And finally, to match negative numbers just add an optional minus in the beginning: '-?'.

Putting it together you get:

'Excess Expenses\s*\(\s*Refund From client\)\s*(-?/(?(?:\d*,)*\d*\.?\d+/)?)'

Hope this helps.

Edit : I added optional brackets (same as in your comment.) '(?:' signifies a non-capturing group, i. e. a group that can for example quantified by '*' without being captured as separate result group.