Pandas / odo / bcolz selective loading of rows from a large CSV file

Question

Pandas / odo / bcolz selective loading of rows from a large CSV file

578 views Asked by Amelio Vazquez-Reina At 04 February 2016 at 15:40

Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.

Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).

Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?

Original Q&A

There are 1 answers

**root** · Accepted Answer · 2016-02-04T19:44:15+00:00

I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.

The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.

import pandas as pd

pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
    chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
    good_data.append(chunk)

df = pd.concat(good_data)

Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.

If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.

TechQA.

Pandas / odo / bcolz selective loading of rows from a large CSV file

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in BLAZE

Related Questions in BCOLZ

Related Questions in ODO

Popular Questions

Popular Tags

Trending Questions