Pandas HDFStore: append fails when min_itemsize is set to the maximum of the string column

192 views Asked by At

I'm detecting the maximum lengths of all string columns of multiple dataframes, then attempting to build a HDFStore:

import pandas as pd

# Detect max string length for each column across all DataFrames
max_lens = {}
for df_path in paths:
    df = pd.read_pickle(df_path)
    for col in df.columns:
        ser = df[col]
        if ser.dtype == 'object' and isinstance(
            ser.loc[ser.first_valid_index()], str
        ):
            max_lens[col] = max(
                ser.dropna().map(len).max(), max_lens.setdefault(col, 0)
            )
print('Setting min itemsizes:', max_lens)

hdf_path.unlink()  # Delete of file for clean retry
store = pd.HDFStore(hdf_path, complevel=9)
for df_path in paths:
    df = pd.read_pickle(df_path)
    store.append(hdf_key, df, min_itemsize=max_lens, data_columns=True)
store.close()

The detected maximum string lengths are as follows:

     max_lens = {'hashtags': 139,
                 'id': 19,
                 'source': 157,
                 'text': 233,
                 'urls': 2352,
                 'user_mentions_user_ids': 199,
                 'in_reply_to_screen_name': 17,
                 'in_reply_to_status_id': 19,
                 'in_reply_to_user_id': 19,
                 'media': 286,
                 'place': 56,
                 'quoted_status_id': 19,
                 'user_id': 19}

Yet still I'm getting this error:

ValueError: Trying to store a string with len [220] in [hashtags] column but
this column has a limit of [194]!
Consider using min_itemsize to preset the sizes on these columns

Which is weird, because the detected maximum length of hashtags is 139.

1

There are 1 answers

0
tsorn On BEST ANSWER

HDF stores strings in UTF-8, and thus you need to encode the strings as UTF-8 and then find the maximum length.

a_pandas_string_series.str.encode('utf-8').str.len().max()