When I run Pandas df.drop_duplicates() in python the duplicated lines don't disappear. It behaves just as if the three 3542.87 were different values (and indeed they are, read the end of my post).
a slice of my data:
print(df.iloc[234:237,:])
Date Libellé Montant solde
291 04/11/2009 CHEQUE N°9667007 -65.00 5030.93
2 05/11/2009 DEBIT CARTE BANCAIRE DIFFERE -1488.06 3542.87
131 05/11/2009 DEBIT CARTE BANCAIRE DIFFERE -1488.06 3542.87
290 05/11/2009 DEBIT CARTE BANCAIRE DIFFERE -1488.06 3542.87
1 05/11/2009 VIREMENT SUR PEA 3E350018328051 VOTRE OPERATI...-25.00 3517.87
My dataframe has unusual dtypes:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9404 entries, 111 to 8814
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 9404 non-null object
1 Libellé 9404 non-null object
2 Montant 9404 non-null float64
3 solde 9404 non-null float64
dtypes: float64(2), object(2)
memory usage: 367.3+ KB
As "Date" are datetime.date objects and "Libellé" are strings.
After if I try df.groupby(['Date', 'Libellé', 'Montant'])['solde'].nunique()
On aggregated row (05/11/2009, DEBIT CARTE BANCAIRE DIFFERE, -1488.06)
it returns 3 instead of 1.
Do you have any idea on how this is possible and how to clean my solde values.
Thanks.
I tried to force float format using .astype(float)
And also:
all4.iloc[233:238,:].groupby(['Date', 'Libellé', 'Montant'])['solde'].apply(list)
Out:
Date Libellé Montant
2009-11-04 CHEQUE N°9667007 -65.00 [5030.93]
2009-11-05 DEBIT CARTE BANCAIRE DIFFERE -1488.06 [3542.87, 3542.8700000000003, 3542.870000000001]
VIREMENT SUR PEA 3E350018328051 VOTRE OPERATION DU 04/11/2009 -25.00 [3517.87]
Name: solde, dtype: object
Now the question is how to clean these data and avoid this mistake in the future.
Solved my issue, however the data were extracted from .csv, .xls and .xlsx files maybe it could explain the bad import :
[3542.87, 3542.8700000000003, 3542.870000000001]
In hope this would help somebody facing this tricky problem.After checking it doesn't come from import but from calculation inside my python code. I further investigate and let you know.
The strange thing is that editor doesn't show more than 2 decimals vs code variables editor
Poking around in the code, I came across this:
In:
Out:
Typical floating point numbers issue ! Fixed using round()