use pandas to handle massive csv file

Question

use pandas to handle massive csv file

66 views Asked by Mahmoud Odeh At 21 January 2019 at 13:42

reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time

import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
                ,dtype='unicode'
                ,index_col=False
                ,error_bad_lines=False
                ,sep = ';'
                ,low_memory = False
                ,names =['DATE'
                ,'IMSI'
                ,'WEBSITE'
                ,'LINKUP'
                ,'LINKDOWN'
                ,'COUNT'
                ,'CONNECTION']

                 )
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
    ,'LINKUP':'sum'
    , 'LINKDOWN':'sum'
    , 'COUNT':'max'
    ,'CONNECTION':'sum'
            })
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')

Original Q&A

There are 1 answers

**jpp** · Accepted Answer · 2019-01-21 14:20:56

One solution is offered by dask.dataframe, which chunks internally:

import dask.dataframe as dd

df = dd.read_csv(...)
group = df.groupby(...).aggregate({...}).compute()
group.to_csv('output.txt')

This isn't tested. I suggest you read the documentation to familiarize yourself with the syntax. The important point to understand is dd.read_csv does not read the whole file in memory and no operations are processed until compute is called, at which point dask processes in constant memory via chunking.

TechQA.

use pandas to handle massive csv file

There are 1 answers

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in PANDAS-GROUPBY

Related Questions in CHUNKING

Popular Questions

Popular Tags

Trending Questions