Loading larger than memory data into bcolz from Redshift

Question

Loading larger than memory data into bcolz from Redshift

246 views Asked by Daniel Mahler At 11 February 2016 at 20:41

I would like to save the output of a redshift query locally. I have tried using blaze/odo but with default settings that tries to load all the data into memory before writing and trying to stream the data throws other errors, described another question: Streaming results with Blaze and SqlAlchemy

Since that question does not have any answers, I am looking for an alternative way to do this.

What is a simple, reliable and reasonably efficient way of loading larger than memory query result into bcolz from Redshift?

Original Q&A

There are 1 answers

**tihovin** · Answer 1 · 2017-02-09T14:35:27+00:00

I see this is an old question, but I'll answer it to help if someone with the same problem stumble upon it. I assume you use Blaze to access Redshift. You'll have to partition table on some column and do it iteratively, like this:

import blaze as bz
table = bz.Data('postgresql://[email protected]::table_name')
val_list = bz.odo(table['column_name'].distinct(), list)
np_dtype = table.schema[0].to_numpy_dtype()
for val in val_list:
    table_partition = table[table['column_name'] == val]
    partition_len = int(table_partition.count())
    bcolz.fromiter(table_partition, np_dtype, partition_len,
                   rootdir='rootdir_for_partition', mode='w',
                   cparams=compression_params)

and than use bcolz.walk to iterate over result.

TechQA.

Loading larger than memory data into bcolz from Redshift

There are 1 answers

Related Questions in PYTHON

Related Questions in SQLALCHEMY

Related Questions in PSYCOPG2

Related Questions in AMAZON-REDSHIFT

Related Questions in BCOLZ

Popular Questions

Popular Tags

Trending Questions