So the logistic regression from the sklearn library from Python has the .fit()
function which takes x_train
(features) and y_train
(labels) as arguments to train the classifier.
It seems that x_train.shape = (number_of_samples, number_of_features)
For x_train I should use the extracted xvector.scp file, which I am reading like so:
b = kaldiio.load_scp('xvector.scp')
And I can print the content like so:
for file_id in b:
xvector = b[file_id]
print(xvector)
Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.
My question is how can I make an array that contains only the xvector variables?
PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array
It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:
However, this will run into OOM issues if the dictionary is large. In this case, you would need to use
np.memmap
as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.