reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time
import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
,dtype='unicode'
,index_col=False
,error_bad_lines=False
,sep = ';'
,low_memory = False
,names =['DATE'
,'IMSI'
,'WEBSITE'
,'LINKUP'
,'LINKDOWN'
,'COUNT'
,'CONNECTION']
)
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
,'LINKUP':'sum'
, 'LINKDOWN':'sum'
, 'COUNT':'max'
,'CONNECTION':'sum'
})
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')
One solution is offered by
dask.dataframe
, which chunks internally:This isn't tested. I suggest you read the documentation to familiarize yourself with the syntax. The important point to understand is
dd.read_csv
does not read the whole file in memory and no operations are processed untilcompute
is called, at which pointdask
processes in constant memory via chunking.