How can I use Pandera to assert whether a column has one of multiple data types?

692 views Asked by At

My Pandas dataframes need to adhere to the following Pandera schema:

import pandera as pa
from pandera.typing import Series

class schema(pa.SchemaModel):
    name: Series[str]
    id: Series[str]

However, in some dataframe instances, the "id" column will only contain integers and thus will get the "int" datatype when using pd.read_csv().

For example, I have the following dataframe:

example of a dataframe containing columns "name" and "id" with three rows, where "id" is always an integer

When I run schema(df).validate() I get the error: pandera.errors.SchemaError: expected series 'id' to have type str, got int64

However, in other cases the dataframe might look something like this:

example of a dataframe containing columns "name" and "id" with three rows, where "id" is sometimes a string

I would like to account for both situations by allowing the column to be one of both datatypes.

This is what I tried (but it doesn't seem to be the correct syntax, as the validation method won't run):

import pandera as pa
from pandera.typing import Series
from typing import Union

class schema(pa.SchemaModel):
    name: Series[str]
    id: Union[Series[str], Series[int]]

Is there any way to do this in Pandera?

1

There are 1 answers

0
thalhamm On

It's not implemented yet: https://github.com/unionai-oss/pandera/issues/1152

There is also a linked pull request that seems stale since a couple of months: https://github.com/unionai-oss/pandera/pull/1227