Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo
or Pandas facilitate methods for this type of selective loading of rows into a dataframe?
I don't know of anything in
odo
orpandas
that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a shortpandas
script that gets the job done.The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
Add/alter parameters for
pd.read_csv
andpd.concat
as necessary for your specific situation.If performance is an issue, you may be able to speed things up by using an alternative to
.isin
, as described in this answer.