I have the following code that is reading from a CSV and writing to PyTables. However, pd.read_csv creates a dataframe and this is not handled in PyTables. How do I solve this problem? I can create numpy arrays but this seems like over kill and possibly time consuming? (Transaction Record is a class i created with the right data types - i have to replicate this if using numpy)
def get_transaction_report_in_chunks(transaction_file):
transaction_report_data = pd.read_csv(transaction_file, index_col=None, parse_dates=False, chunksize=500000)
return transaction_report_data
def write_to_hdf_from_multiple_csv(transaction_file_path):
hdf = tables.open_file(filename='MyDB.h5', mode='a')
transaction_report_table = hdf.create_table(hdf.root, 'Transaction_Report_Table_x', Transaction_Record, "Transaction Report Table")
all_files = glob.glob(os.path.join(transaction_file_path, "*.csv"))
for transaction_file in all_files:
for transaction_chunk in get_transaction_report_in_chunks(transaction_file):
transaction_report_table.append(transaction_chunk)
transaction_report_table.flush()
hdf.Close()
I would use Pandas HDF Store, which is very convinient API for PyTables under the hood: