I'm currently working with pandas profiling and I have a problem with creating a proper report. Because when I just read the csv file, the columns are in the wrong data type. Instead of categorical, the values were labeled as numeric values. When I now try to define the specific datatype within the read_csv method, the creation of the report stucks at a certain point and takes forever (I canceled it after 30 mins). When I dont change the datatype of the values, the report is done in less then a minute.
Here are also the output of df_data.isnull().sum():
A 0
B 0
C 3
D 0
E 0
F 0
G 86317
H 39
I 6871
J 0
I tried to cast the datatypes within the read_csv:
df_data = pd.read_csv('example.csv', parse_dates=['A', 'B'], dtype={
'C' : 'string',
'D' : 'string',
'E' : 'string'
}
)
And I've also tried to cast the datatypes with dtypes() after a normal read_csv:
df_data = pd.read_csv('example.csv')
df_data['C'] = df_data['A'].astype(str)
df_data['D'] = df_data['A'].astype(str)
df_data['E'] = df_data['A'].astype(str)
Both ways had the same result: a report that stucks halfway through
I converted the data in the type_schema, like this:
df_data = pd.read_csv('example.csv') type_schema = { 'A' : 'datetime', 'B' : 'categorical', 'C' : 'categorical', 'D' : 'categorical', 'E' : 'categorical', 'F' : 'categorical', 'G' : 'categorical', 'H' : 'categorical', 'I' : 'categorical' } profile = ProfileReport(df_data, type_schema=type_schema)