When selecting from an hdf5
file in chunks, I would like to know how many chunks there are in the resulting selection.
Number of rows in the input data nrows
can be up to 100mln, chunksize
is 100k, but for most selections the number of rows in a chunk nrows_chunk
is smaller, so for different where
I can have selections with one or many chunks. Before doing operations with chunks and at the time of calling iteratorGenerator()
I would like to know how many chunks there will be. Intuitively, I want something like len(list(enumerate(iteratorGenerator())))
in my syntax, but this would give length=1 (I suppose because only one chunk at a time is considered by iteratorGenerator()
).
I suspected there is no solution to this issue as the whole idea of using generator is not to perform all selections at once but do it chunk by chunk. But actually, when I run the for
loop below, the very first iteration takes really long, but the following iterations take just a seconds, which suggests that on the first iteration most of the data about chunks is collected. This is puzzling to me and I would appreciate any explanation on how selection by chunks works.
iteratorGenerator = lambda: inputStore.select(
groupInInputStore,
where=where,
columns=columns,
iterator=True,
chunksize=args.chunksize
)
nrows = inputStore.get_storer(groupInInputStore).nrows
# if there is more than one chunk in the selection:
for i, chunk in enumerate(iteratorGenerator()):
# check the size of a chunk
nrows_chunk = len(chunk)
# do stuff with chunks, mainly groupby operations
# if there is only one chunk do other stuff
Moreover, I am not sure what the chunksize
in HDFStore.select
refers to. From my experience, it is a maximal size of the selected chunk after applying where
condition. On the other hand, http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.select.html defines chunksize: nrows to include in iteration, which to me sounds like the number of rows to read from. Which is correct?