I am fitting a Kernel Density Estimation instance on multi-variate data using scikit-learn implementation. As parameters I am using a 'gaussian' kernel and as bandwidth estimator 'silverman'
- .fit() is used to fit KDE on training data
- .sample(n_istances) is used to generate n new data points from the distribution
However, i have noticed that many of the values generated are outside the ranges of the original variables. As example, i attach the output of the code used. where, i compare ranges from original training data and newly generated datapoints.
kde = KernelDensity(kernel = kernel, bandwidth= 'silverman').fit(data)
print('dat:',data)
examples = kde.sample(n_istances, random_state=0)
Each tuple represents a variable, where: index column, min variable, max variable and average
Imagine this, behaviour replicated to n dimensions since i am working with many variables. Its problematic. Is there a way to select a random sample from the distribution that allows to have data within the ranges of my original ones?