I have a dataframe column having values like this:
Salary Offered
----------------------
£18,323 per annum
£18,000 - £22,000 per annum
Salary not specified
£15,000 - £17,000 per annum, pro-rata
£37,000 - £45,000 per annum
£9,100 - £9,152 per annum, OTE
£9.25 - £10.15 per hour
£35,000 - £40,000 per annum
£23,000 - £26,600 per annum
£18,000 - £25,000 per annum, inc benefits
So I ran the following command, which did a good job by replacing the pure string values (like: "Salary not specified") with None, which I can replace with random values, but I have to again split them by £:
In[13]: df = pd.DataFrame(df.salary_offered.str.split('£',1).tolist(),
columns = ['flips','row'])
In[14]: df['row']
Out[14]:
0 18,323 per annum
1 18,000 - £22,000 per annum
2 None
3 15,000 - £17,000 per annum, pro-rata
4 37,000 - £45,000 per annum
5 9,100 - £9,152 per annum, OTE
6 9.25 - £10.15 per hour
7 35,000 - £40,000 per annum
8 23,000 - £26,600 per annum
9 18,000 - £25,000 per annum, inc benefits
Also there are few rows having salaries given in per hour, so will need to replace them as well, which can be done, intuitively. But I want to separate into different columns having the mean values, something like this:
Salary (£)
---------------
18323
20000
18000
16000
41000
...
If I understand correctly, you can extract what you need (numbers) with a regex, and do your calculations on the result:
regex explanation:
\d
finds any digit character.\d+
finds any sequence of multiple digits (+
means one or more in regex).\.?
means "optionally, find any.
".So all together,
\d+\.?\d+
says: "find any sequence of digits, optionally followed by a.
and another sequence of digits on the other side of that.
".dealing with the
per hour
vsper annum
I'm not sure what you mean to do about the
per hour
rows, but you said that you can do it intuitively, so I suppose you have a plan for it.Personally, I would do something along the lines of the following, though you might have to tweak it based on your dataframe and what you're trying to capture specifically.