Can a dask dataframe with a unordered index cause silent errors?

350 views Asked by At

Methods around dask.DataFrame all seem to make sure, that the index column is sorted. However, by using from_delayed, it is possible to construct a dask dataframe that has a index column, which is not sorted:

pdf1 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A'))
pdf2 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A'))
ddf = dd.from_delayed([pdf1,pdf2]) #dask.DataFrame with unordered index

The combination [index is set, index is not sorted, divisions are unknown] is something that I have never seen among dataframes that dask created itself. So my questions are:

  • Is dask tested to work well with dataframes like this?
  • Might it even be that calculations on such dataframes give wrong results silently, e.g. because they assume the index to be sorted or are performed on an incomplete subset of data?
  • Or more general: If the index column is not sorted, does it only slow down access by index or does it break functionality?
1

There are 1 answers

0
MRocklin On BEST ANSWER

Many dask.dataframe operations will refuse to operate or will operate with slower algorithms on dataframes without known divisions. See http://dask.pydata.org/en/latest/dataframe-design.html#partitions

For example df.loc is fast if dask.dataframe knows that the index is sorted and it knows the min/max of each partition. However if this information is not known then df.loc has to look through all of the partitions exhaustively.

Generally speaking dask.dataframe is aware of the possibility that you bring up and should act accordingly. Some operations will be slower. Some operations will refuse to operate.