Delete rows where any column contains a certain string

Question

Delete rows where any column contains a certain string

942 views Asked by Niamh At 12 December 2022 at 19:55

I am dealing with a dataset that uses ".." as a placeholder for null values. These null values span across all of my columns. My dataset looks as follows:

Country Code	Year	GDP growth (%)	GDP (constant)
AFG	2010	3.5	..
AFG	2011	..	2345
AFG	2012	1.4	3372
ALB	2010	..	4567
ALB	2011	..	5678
ALB	2012	4.2	..
DZA	2010	2.0	4321
DZA	2011	..	5432
DZA	2012	3.8	6543

I want to remove the rows containing missing data from my data however my solutions are not very clean.

I have tried:

df_GDP_1[df_GDP_1.str.contains("..")==False]

Which I had hoped to be a solution to deal with all columns at once, however this returns an error.

Otherwise I have tried:

df_GDP_1[df_GDP_1.col1 != '..' | df_GDP_1.col2 != '..']

However this solution requires me to alter names of columns to remove spaces and then reverse this after, and even at that, which seems unnecessarily long for the task at hand.

Any ideas which enable me to perform this in a cleaner manner would be greatly appreciated!

Original Q&A

There are 3 answers

Tom On 12 December 2022 at 20:15

This is the typical case of world bank data. Here's the simplest way to deal with this:

# This is just for reproducting your example dataset
your_example = """Country Code  Year    GDP growth (%)  GDP (constant)
AFG 2010    3.5 ..
AFG 2011    ..  2345
AFG 2012    1.4 3372
ALB 2010    ..  4567
ALB 2011    ..  5678
ALB 2012    4.2 ..
DZA 2010    2.0 4321
DZA 2011    ..  5432
DZA 2012    3.8 6543"""
your_example = your_example.split("\n")
your_example = pd.DataFrame(
    [row.split("\t") for row in your_example[1:]], columns=your_example[0].split("\t")
)
# You just have to do this:
your_example = your_example.replace({"..": None})
your_example = your_example.dropna()
print("DF after dropping rows with ..", your_example)


>>>  Country Code  Year GDP growth (%) GDP (constant)
>>> 2          AFG  2012            1.4           3372
>>> 6          DZA  2010            2.0           4321
>>> 8          DZA  2012            3.8           6543

I'm just replacing the ".." by None since you are saying this ".." represents a NULL. Then I'm deleting it using dropna() method of pandas dataframe, which is what you wanted to achieve.

user19077881 On 12 December 2022 at 20:21

Following you original approach (you were almost there!) you could use:

df_GDP_1 = df_GDP_1[(df_GDP_1['GPD Growth (%)']+ != '..') & (df_GDP_1['GDP (constant)'] != '..')]

names with spaces have to go in [ ] instead of dot notation. Also you want to keep rows where both the columns do not have the .. marker so use & not |. Each condition needs to be in ( ) brackets.

**RomanPerekhrest** · Accepted Answer · 2022-12-12T20:19:01+00:00

With combination of pandas.DataFrame.eq and pandas.DataFrame.any functions.

.any(1) tells to find a match over the columns (axis=1)
the negation ~ tells to omit records with matches

In [269]: df[~df.eq("..").any(1)]
Out[269]: 
  Country Code  Year GDP growth (%) GDP (constant)
2          AFG  2012            1.4           3372
6          DZA  2010            2.0           4321
8          DZA  2012            3.8           6543

TechQA.

Delete rows where any column contains a certain string

There are 3 answers

Related Questions in PYTHON

Related Questions in DATA-CLEANING

Related Questions in EDA

Popular Questions

Popular Tags

Trending Questions