I have dataset like the following fromat and im trying to find out the Kernel density estimation with optimal bandwidth.
data = np.array([[1, 4, 3], [2, .6, 1.2], [2, 1, 1.2],
[2, 0.5, 1.4], [5, .5, 0], [0, 0, 0],
[1, 4, 3], [5, .5, 0], [2, .5, 1.2]])
but I couldn't figure out how to approach it. also how to find the Σ matrix.
UPDATE
I tried KDE function from scikit-learn toolkits to find out univariate(1D) kde,
# kde function
def kde_sklearn(x, x_grid, bandwidth):
kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth).fit(x)
log_pdf = kde.score_samples(x_grid[:, np.newaxis])
return np.exp(log_pdf)
# optimal bandwidth selection
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'bandwidth': np.linspace(.1, 1.0, 30)}, cv=20)
grid.fit(x)
bw = grid.best_params_
# pdf using kde
pdf = kde_sklearn(x, x_grid, bw)
ax.plot(x_grid, pdf, label='bw={}'.format(bw))
ax.legend(loc='best')
plt.show()
Can any one help me to extend this to multivariate / in this case 3D data?
Interesting problem. You have a few options:
This blog post goes into detail about the relative merits of various library implementations of Kernel Density Estimation (KDE).
I'm going to show you what (in my opinion - yes this is a bit opinion based) is the simplest way, which I think is option 2 in your case.
NOTE This method uses a rule of thumb as described in the linked docs to determine bandwidth. The exact rule used is Scott's rule. Your mention of the Σ matrix makes me think rule of thumb bandwidth selection is OK for you, but you also talk about optimal bandwidth and the example you present uses cross-validation to determine the best bandwidth. Therefore, if this method doesn't suit your purposes - let me know in comments.
Caveat
this may give terrible results, depending on your particular problem. Things to bear in mind are obviously:
as your number of dimensions goes up, the number of observed data points you want will have to go up exponentially - your sample data of 9 points in 3 dimensions is pretty sparse - although I assume the dots indicate that in fact you have many more.
As mentioned in the main body - the bandwidth is selected in a particular way - this may result in over- (or conceivably but unlikely under-) smoothing of the estimated pdf.