I'm evaluating the possibility of using arrow-based data types in our data flows.
Our flows are based on pandas and using dtype_backend='pyarrow' seems working pretty well (basically this options prioritize the arrow type in constructors).
I'm finding some problems using pyarrow.DictionaryArray instead of Categorical.
Here an example:
import pyarrow as pa
import pandas as pd
def pyarrow_cat_dtype(vals):
as_dict_vals = pa.array(vals).dictionary_encode()
return pd.ArrowDtype(as_dict_vals.type)
vals = ['A', 'B', 'C']
ser = pd.Series(vals*2, dtype=pyarrow_cat_dtype(vals))
The series is of type:
dictionary<values=string, indices=int32, ordered=0>[pyarrow]
During the ETL process, the series is then changed and is not clear to me how to reflect these changes also in the categories
Suppose we need to delete most of the values or merge categorical from multiple sources.
Using categorical this can be done by accessing to .categories attribute
or using observed=True in a groupby as described in the documentation
but I cannot find something similar for DictionaryArray to manage it.
How and where are stored these values in pandas? Is there a way to do an introspection of the dtype and read or manage the underlying values and indices?
It's not clear to me, why, for example, adding a new value to a pd.Categorical Series it raises a TypeError, while this doesn't happen using DictionaryArrays.