Remove special characters and non standard values in a pandas

243 views Asked by At

I am working with a pandas dataframe where a column has non standard values in it. Is there a way that i can extract or replace char and digits in the column. I am very new to applying regex patterns to clean data.

one col is Precise_Age and second col is Browser.

In browser col i want only name and version.( if version is 10.1.2 then i want only 10)- Android 10 , Android 4 , iOS 11 etc.

Browser

Browser                                        desired_output
75.0.3770.143 | Chrome Dev    | Android | 9       Android 9
78.0.3904.108 | Chrome Dev    | Android | 9       Android 9
79.0.3945.93  | Chrome Dev    | Android | 9       Android 9
79.0.3945.93  | Chrome Dev    | Android | 8.0.0   Android 8
              |               | Android | 8.1.0   Android 8
79.0.3945.116 | Chrome Dev    | Android | 10      Android 10
79.0.3945.93  | Chrome Dev    | Android | 5.1     Android 5
              |               | Android | 10      Android 10
              | Facebook      | Android | 8.1.0   Android 8
79.0.3945.116 | Chrome Dev    | Android | 4.4.4   Android 4
              |               | Android | 8.1.0   Android 8
79.0.3945.79  | Chrome Dev    | Windows | 8       Windows 8
77.0.3865.116 | Chrome Dev    | Android | 9       Android 9
88.1.284108841| Google Search | iOS     | 13.3    iOS 13

In Age col , i want only standard values , replaces blanks , commas etc. if age has more than 100 values then make it all values to missing.

Age

Age            desired_output
67                 67
66                 66
67.5               67
60대후반        60
1949ë…„            null
63세              63
83ë…„ìƒ        83
11세              11
7217861839         null
59 years           59
60세              60
73.87083774        73
54ë…„ìƒ        54
55세              55
327                null
37ë…„ìƒ        37
642                null
523                null
0.61               0
53세              53
42ë…„ìƒ        42
757575             null
91.98192554        91
1.11991            1
83세(만82세)    83
4324234            null
8827               null
11 Years           11
1

There are 1 answers

0
n1colas.m On

After split the Browser column using | as separator, you can extract or replace char and digits in the column using the map to transform the data the way you need. Join the last two columns to obtain the desired output of this frame.

The same principle used earlier can be applied again to replace data on the column Age, now using the re.sub as the map function to get only "the standard values".

import pandas as pd
import re

br = pd.read_csv("browser.csv")
age = pd.read_csv("age.csv")

br = br.iloc[:, 0].str.split("|", expand=True)
br.iloc[:, -2] = br.iloc[:, -2].map(lambda x: x.strip())
br.iloc[:, -1] = br.iloc[:, -1].map(lambda x: x.split(".")[0])

df = pd.DataFrame()

df["Browser Version"] = br.iloc[:, -2] + br.iloc[:, -1]

# remove blanks, commas, symbols, etc
age.iloc[:, -1] = age.iloc[:, -1].map(lambda x: re.sub(r"\D+.*", "", x))

# if the number in age is up to a 100 keep it, else replace with "null".
age.iloc[:, -1] = age.iloc[:, -1].map(lambda x: int(x) if int(x) <= 100 else "null")

print(df)
print(age)

Output from df and age

   Browser Version    |          Age
0        Android 9    |    0      67
1        Android 9    |    1      66
2        Android 9    |    2      67
3        Android 8    |    3      60
4        Android 8    |    4    null
5       Android 10    |    5      63
6        Android 5    |    6      83
7       Android 10    |    7      11
8        Android 8    |    8    null
9        Android 4    |    9      59
10       Android 8    |    10     60
11       Windows 8    |    11     73
12       Android 9    |    12     54
13          iOS 13    |    13     55
14                    |    14   null
15                    |    15     37
16                    |    16   null
17                    |    17   null
18                    |    18      0
19                    |    19     53
20                    |    20     42
21                    |    21   null
22                    |    22     91
23                    |    23      1
24                    |    24     83
25                    |    25   null
26                    |    26   null
27                    |    27     11