Using pyarrow.DictionaryArray instead of Categorical in pandas DataFrame

25 views Asked by Glauco At 20 March 2024 at 14:47

I'm evaluating the possibility of using arrow-based data types in our data flows.

Our flows are based on pandas and using dtype_backend='pyarrow' seems working pretty well (basically this options prioritize the arrow type in constructors). I'm finding some problems using pyarrow.DictionaryArray instead of Categorical.

Here an example:

import pyarrow as pa
import pandas as pd

def pyarrow_cat_dtype(vals):
    as_dict_vals = pa.array(vals).dictionary_encode()
    return pd.ArrowDtype(as_dict_vals.type)

vals = ['A', 'B', 'C']
ser = pd.Series(vals*2, dtype=pyarrow_cat_dtype(vals))

The series is of type:

dictionary<values=string, indices=int32, ordered=0>[pyarrow]

During the ETL process, the series is then changed and is not clear to me how to reflect these changes also in the categories

Suppose we need to delete most of the values or merge categorical from multiple sources.

Using categorical this can be done by accessing to .categories attribute or using observed=True in a groupby as described in the documentation but I cannot find something similar for DictionaryArray to manage it.

How and where are stored these values in pandas? Is there a way to do an introspection of the dtype and read or manage the underlying values and indices?

It's not clear to me, why, for example, adding a new value to a pd.Categorical Series it raises a TypeError, while this doesn't happen using DictionaryArrays.

Original Q&A

TechQA.

Using pyarrow.DictionaryArray instead of Categorical in pandas DataFrame

There are 0 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PYARROW

Popular Questions

Trending Questions