use pandas to handle massive csv file

66 views Asked by At

reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time

import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
                ,dtype='unicode'
                ,index_col=False
                ,error_bad_lines=False
                ,sep = ';'
                ,low_memory = False
                ,names =['DATE'
                ,'IMSI'
                ,'WEBSITE'
                ,'LINKUP'
                ,'LINKDOWN'
                ,'COUNT'
                ,'CONNECTION']

                 )
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
    ,'LINKUP':'sum'
    , 'LINKDOWN':'sum'
    , 'COUNT':'max'
    ,'CONNECTION':'sum'
            })
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')
1

There are 1 answers

0
jpp On BEST ANSWER

One solution is offered by dask.dataframe, which chunks internally:

import dask.dataframe as dd

df = dd.read_csv(...)
group = df.groupby(...).aggregate({...}).compute()
group.to_csv('output.txt')

This isn't tested. I suggest you read the documentation to familiarize yourself with the syntax. The important point to understand is dd.read_csv does not read the whole file in memory and no operations are processed until compute is called, at which point dask processes in constant memory via chunking.