How to replace non-duplicated values in columns of csv files by stars("*")?

Question

How to replace non-duplicated values in columns of csv files by stars("*")?

58 views Asked by hiva At 28 December 2019 at 06:16

everybody.I need to anonymize the raw table to make a anonymized table. In another word, I need to replace the non_ duplicated sets with stars.

Actually, I have run this code:

    for j in range(len(zz_new)):
        for i in range(len(zz)):
            if zz_new.iloc[j][0] != zz.iloc[i][0]:
                zz_new.iat[j,0]="*"

            if zz_new.iloc[j][1] != zz.iloc[i][1]:
                zz_new.iat[j,1]="*"

            if zz_new.iloc[j][2] != zz.iloc[i][2]:
                zz_new.iat[j,2]="*"

            if zz_new.iloc[j][3] != zz.iloc[i][3]:
                zz_new.iat[j,3]="*"

            if zz_new.iloc[j][4] != zz.iloc[i][4]:
                zz_new.iat[j,4]="*"

, but the result is like this My anonymized table. I was wondering if you could help me to reach the anonymized table.

Original Q&A

There are 2 answers

Yacine Mahdid On 28 December 2019 at 06:28

What you need to do is iterate over each of the row and find out which rows are duplicate. There is many way of doing this but the brute force algorithm looks like this:

start an empty list that keep track of non_duplicate_id
iterate over each row and check if there is one row that is exactly similar to this current element.
If yes there is an element exactly similar do nothing, if no add the id of this row to the non_duplicate_id list.
iterate over your non_duplicate_id list and set each of the row to star for the two field of interest (age and education)
save the new anonymized table

However, this solution do a lot of redundant lookup at step 2 and 3 and if the size of your dataset is large it might not scale well.

**kantal** · Accepted Answer · 2019-12-28T13:28:56+00:00

Use the value_counts() method:

df                                                                                                                   
     age  education
0  30-39    HS-grad
1  40-49  Bachelors
2  30-39    HS-grad
3  30-39       11th

vcnt= df.education.value_counts().eq(1)                                                                              

HS-grad      False
Bachelors     True
11th          True
Name: education, dtype: bool

df["education"]= df.education.replace(vcnt.loc[vcnt].index,"*")                                                      

     age education
0  30-39   HS-grad
1  40-49         *
2  30-39   HS-grad
3  30-39         *

TechQA.

How to replace non-duplicated values in columns of csv files by stars("*")?

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in JUPYTER

Related Questions in ANONYMITY

Popular Questions

Popular Tags

Trending Questions