Processing data on disk with a Pandas DataFrame

926 views Asked by At

Is there a way to take a very large amount of data on disk (a few 100 GB) and interact with it on disk as a pandas dataframe?

Here's what I've done so far:

  1. Described the data using pytables and this example: http://www.pytables.org/usersguide/introduction.html

  2. Run a test by loading a portion of the data (a few GB) into an HDF5 file

  3. Converted the data into a dataframe using pd.DataFrame.from_records()

This last step loads all the data in memory.

I've looked for some way to describe the data as a pandas dataframe in step 1 but haven't been able to find a good set of instructions to do that. Is what I want to do feasible?

1

There are 1 answers

3
Jeff On

blaze is a nice way to interact with out-of-core data by using lazy expression evaluation. This uses pandas and PyTables under the hood (as well as a host of conversions with odo)