Methods around dask.DataFrame all seem to make sure, that the index column is sorted. However, by using from_delayed
, it is possible to construct a dask dataframe that has a index column, which is not sorted:
pdf1 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A'))
pdf2 = delayed(pd.DataFrame(dict(A=[1,2,3], B = [1,1,1])).set_index('A'))
ddf = dd.from_delayed([pdf1,pdf2]) #dask.DataFrame with unordered index
The combination [index is set, index is not sorted, divisions are unknown] is something that I have never seen among dataframes that dask created itself. So my questions are:
- Is dask tested to work well with dataframes like this?
- Might it even be that calculations on such dataframes give wrong results silently, e.g. because they assume the index to be sorted or are performed on an incomplete subset of data?
- Or more general: If the index column is not sorted, does it only slow down access by index or does it break functionality?
Many dask.dataframe operations will refuse to operate or will operate with slower algorithms on dataframes without known divisions. See http://dask.pydata.org/en/latest/dataframe-design.html#partitions
For example
df.loc
is fast if dask.dataframe knows that the index is sorted and it knows the min/max of each partition. However if this information is not known thendf.loc
has to look through all of the partitions exhaustively.Generally speaking dask.dataframe is aware of the possibility that you bring up and should act accordingly. Some operations will be slower. Some operations will refuse to operate.