I have a set of data for sequence labeling. I did PCA with (with 2 principal components on the x and y axis) on the dataset and it turns out as below:
Using an LSTM network to classify the dataset above, I then decided to extract the activations from the hidden layer of the LSTM. What I obtain is like the figure below:
My question is, what conclusion can I draw by comparing both the results? Is it fair to say that the features of the original dataset are now self-organized after running it through an LSTM classifier?