How to create synthetic data based on dataset with mixed data types for classification problem?

666 views Asked by At

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features? I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features. Sample toy code of my problem is below

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y

# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df =  X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)

# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']

label_encoders = {}
for c in cat_cols:
    cat_proc = preprocessing.LabelEncoder()
    col_proc = cat_proc.fit_transform(df[c])
    df[c] = col_proc
    label_encoders[c] = cat_proc

# Fit a copula
copula = VineCopula('regular')
copula.fit(df)

# Sample synthetic data
df_synthetic = copula.sample(1000)

All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features? Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.

1

There are 1 answers

0
rikyeah On

To have your columns converted to ints, use round and then .astype(int):

df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)

You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.