Splitting String numerical values into new columns-Pandas Dataframe

2k views Asked by At

I have a dataframe column having values like this:

Salary Offered
----------------------
£18,323 per annum 
£18,000 - £22,000 per annum 
Salary not specified 
£15,000 - £17,000 per annum, pro-rata 
£37,000 - £45,000 per annum 
£9,100 - £9,152 per annum, OTE 
£9.25 - £10.15 per hour 
£35,000 - £40,000 per annum 
£23,000 - £26,600 per annum 
£18,000 - £25,000 per annum, inc benefits 

So I ran the following command, which did a good job by replacing the pure string values (like: "Salary not specified") with None, which I can replace with random values, but I have to again split them by £:

In[13]: df = pd.DataFrame(df.salary_offered.str.split('£',1).tolist(),
                                   columns = ['flips','row'])
In[14]: df['row']
Out[14]: 
0                                     18,323 per annum 
1                           18,000 - £22,000 per annum 
2                                                  None
3                 15,000 - £17,000 per annum, pro-rata 
4                           37,000 - £45,000 per annum 
5                        9,100 - £9,152 per annum, OTE 
6                               9.25 - £10.15 per hour 
7                           35,000 - £40,000 per annum 
8                           23,000 - £26,600 per annum 
9             18,000 - £25,000 per annum, inc benefits 

Also there are few rows having salaries given in per hour, so will need to replace them as well, which can be done, intuitively. But I want to separate into different columns having the mean values, something like this:

Salary (£)
---------------
18323
20000
18000
16000
41000
...
2

There are 2 answers

3
sacuL On

If I understand correctly, you can extract what you need (numbers) with a regex, and do your calculations on the result:

salaries = (df['Salary Offered']
            .str.replace(',','')
            .str.findall(r'(\d+\.?\d+)')
            .apply(lambda x: pd.Series(x).astype(float))
            .mean(1))


>>> salaries
0    18323.0
1    20000.0
2        NaN
3    16000.0
4    41000.0
5     9126.0
6        9.7
7    37500.0
8    24800.0
9    21500.0

regex explanation: \d finds any digit character. \d+ finds any sequence of multiple digits (+ means one or more in regex). \.? means "optionally, find any .".

So all together, \d+\.?\d+ says: "find any sequence of digits, optionally followed by a . and another sequence of digits on the other side of that .".

dealing with the per hour vs per annum

I'm not sure what you mean to do about the per hour rows, but you said that you can do it intuitively, so I suppose you have a plan for it.

Personally, I would do something along the lines of the following, though you might have to tweak it based on your dataframe and what you're trying to capture specifically.

salaries = (df['Salary Offered']
            .str.replace(',','')
            .str.findall(r'(\d+\.?\d+)')
            .apply(lambda x: pd.Series(x).astype(float))
            .mean(1)
            .to_frame('salary offered'))

salaries['per'] = df['Salary Offered'].str.extract(r'(per\s\w+)')

>>> salaries
   salary offered        per
0         18323.0  per annum
1         20000.0  per annum
2             NaN        NaN
3         16000.0  per annum
4         41000.0  per annum
5          9126.0  per annum
6             9.7   per hour
7         37500.0  per annum
8         24800.0  per annum
9         21500.0  per annum
0
user3483203 On

Setup

df = pd.DataFrame({'salary': ['£18,000 - £22,000 per annum', '£9.25 - £10.15 per hour']})

Helper function to extract all currency values from a column and return average:

def extract_average(s):
    money = list(map(float, re.findall(r'£([\d\.]+)', s)))
    return sum(money)/len(money)

str.replace and apply:

df.salary.str.replace(',', '').apply(extract_average)

0    20000.0
1        9.7
Name: salary, dtype: float64