How to create synthetic data based on dataset with mixed data types for classification problem?

Question

How to create synthetic data based on dataset with mixed data types for classification problem?

740 views Asked by gmh At 21 April 2022 at 04:32

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features? I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features. Sample toy code of my problem is below

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y

# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df =  X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)

# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']

label_encoders = {}
for c in cat_cols:
    cat_proc = preprocessing.LabelEncoder()
    col_proc = cat_proc.fit_transform(df[c])
    df[c] = col_proc
    label_encoders[c] = cat_proc

# Fit a copula
copula = VineCopula('regular')
copula.fit(df)

# Sample synthetic data
df_synthetic = copula.sample(1000)

All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features? Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.

Original Q&A

There are 1 answers

**rikyeah** · Answer 1 · 2022-04-22T11:52:56+00:00

To have your columns converted to ints, use round and then .astype(int):

df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)

You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.

TechQA.

How to create synthetic data based on dataset with mixed data types for classification problem?

There are 1 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in DEEP-LEARNING

Related Questions in EXPERIMENTAL-DESIGN

Popular Questions

Trending Questions