Is there a way to take a very large amount of data on disk (a few 100 GB) and interact with it on disk as a pandas dataframe?
Here's what I've done so far:
Described the data using pytables and this example: http://www.pytables.org/usersguide/introduction.html
Run a test by loading a portion of the data (a few GB) into an HDF5 file
Converted the data into a dataframe using pd.DataFrame.from_records()
This last step loads all the data in memory.
I've looked for some way to describe the data as a pandas dataframe in step 1 but haven't been able to find a good set of instructions to do that. Is what I want to do feasible?
blaze is a nice way to interact with out-of-core data by using lazy expression evaluation. This uses
pandas
andPyTables
under the hood (as well as a host of conversions with odo)