How to create mixed type data in pandas

1.3k views Asked by At

This is a rather non-standard question. For educational purposes, I'm trying to create a mixed type column in a csv file, so that I get a warning message when importing the dataset in a pandas DataFrame and later on, deal with that column to show how it's done.

The problem is that I'd type 0s in a string column in Excel, save it and close the file, but the clever pandas still imports that column as a string column, so it doesn't detect that there are in fact floats in it.

I also tried to change the format of only these 0s in pandas using astype('float'), exporting and re-importing. Still doesn't work.

Does anyone have an idea how can I create a column that pandas will read a mixed type?

Thanks in advance!

2

There are 2 answers

0
jso On

You can create a DataFrame with mixed values.

pandas tries to guess the type for each column by chunks, if chunks have different guessed type then a warning is emitted but the type is preserved.

df = pd.DataFrame({'a': (['1'] * 100000 + 
                         ['X'] * 100000 +
                         ['1'] * 100000),
                   'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)

# when reading it pandas emits a DtypeWarning: Columns (0) have mixed types
df2 = pd.read_csv('test.csv')
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> type(df2.iloc[262150, 0])
<class 'int'>

It's probably not the best way to write a production-ready code but it can be useful in tests, and when debugging your code.

See the documentation: https://pandas.pydata.org/docs/reference/api/pandas.errors.DtypeWarning.html

1
Matus Dubrava On

I'm trying to create a mixed type column in a csv file, so that I get a warning message when importing the dataset in a pandas

Pandas will always infer the type of a column (Series object) and this is always going to be a single type. If every value in the column is string then pandas will load it as a column of type string.

If there are "mixed" values that can't be reasonably loaded as a strings, integers... then the inferred type will simply be dtype: object. Which also means that you will get no warning.


You can force the type when loading dataframe via dtype parameter.

pd.read_csv("test_file.csv", index_col=0, dtype=int)

Now the pandas will try to convert everything to int and if there are values that can't be converted to int, you will get an exception such as

ValueError: invalid literal for int() with base 10: 'a'

When trying to load dataset that contains string a in it. But again, this will not produce a warning, the operation will simply fail.


Here is how you can create a mixed column.

df = pd.DataFrame()
df["mix"] = ["a", "b", 1, True]

df.to_csv("test_file.csv")
df_again = pd.read_csv("test_file.csv", index_col=0)
print(df_again["mix"])

Type of the mix column is object

...

Name: mix, dtype: object


If you change the read_csv in the above code into

df_again = pd.read_csv("test_file.csv", index_col=0, dtype=int)

you will get the mentioned error.