I have this dataset of 1s and 0s (7248x2048) and I want to do a feature reduction from 2048 to 256 . I have already tried an autoencoder which performs well and now I thought that maybe a deep belief network( stack of BernoulliRBMs from scikit-learn) could also reduce the features and maybe faster. I followed this previous implementation of dbn.
How can I assess the performance of the dbn? I tried building a pipeline with layers 1024-> 512 -> 256 -> 512 -> 1024 -> 2048 and then calculate the "reconstruction error" of it. Does this make sense?
The decreasing pseudo-likelihood in the encoding part is promising? If you know other similar DBN implementations in tensorflow or pytorch, I would appreciate it.
The .score_samples function calculates the pseudo likelihood and I am not sure how to interpret it.
import numpy as np
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neural_network import BernoulliRBM df = pd.DataFrame(np.random.randint(0,2,size=(7248, 2048))) X_train, X_test = train_test_split(df, test_size=0.2, random_state=0) X_train, X_val = train_test_split(X_train, test_size=0.15, random_state=0)
learning_rate = 0.1
total_units = 2048 total_epochs = 20 batch_size = 16 rbm1 = BernoulliRBM(n_components=total_units // 2, learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1) rbm2 = BernoulliRBM(n_components=total_units // 4 , learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1) rbm3 = BernoulliRBM(n_components=total_units // 8, learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1) rbm4 = BernoulliRBM(n_components=total_units // 4 , learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1) rbm5 = BernoulliRBM(n_components=total_units // 2, learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1) rbmout = BernoulliRBM(n_components=total_units , learning_rate=learning_rate, batch_size=batch_size, n_iter=total_epochs, verbose=1)
model = Pipeline(steps=[('rbm1', rbm1),('rbm2', rbm2),('rbm3',rbm3),('rbm4',rbm4),('rbm5', rbm5),('rbmout', rbmout)])
model.fit(X_train)
actual = pd.DataFrame(X_val)
preds = pd.DataFrame(model.fit_transform(X_val)) dif = preds.subtract(actual) dif2 = np.square(dif) dif2['loss'] = dif2.sum(axis=1) dif2['loss'].mean()