I am using t-SNE to make a 2D projection for visualization from a higher dimensional dataset (in this case 30-dims) and I have a question about the perplexity hyperparameter.
It's been a while since I used t-SNE and had previously only used it on smaller datasets <1000 data points, where the advised perplexity of 5-50 (van der Maaten and Hinton) was sufficient to display the underlying data structure.
Currently, I am working with a dataset with 340,000 data points and feel that as the perplexity influences the local vs non-local representation of the data, more data points would require a perplexity much higher than 50 (especially if the data is not highly segregated in the higher dimensional space).
Does anyone have any experience with setting the optimal perplexity on datasets with a larger number of data points (>100k)?
I would be really interested to hear your experiences and which methods you go about using to determine the optimal perplexity (or optimal perplexity range).
An interesting article suggests that the optimal perplexity follows a simple power law (~N^0.5), would be interested to know what others think about that?
Thanks for your help
Largely this is empirical, and so I recommend just playing around with values. But I can share my experience...
I had a dataset of about 400k records, each of ~70 dimensions. I reran scikit learn's implementation of tsne with perplexity values 5, 15, 50, 100 and I noticed that the clusters looked the same after 50. I gathered that 5-15 was too small, 50 was enough, and increased perplexity didn't make much difference. That run time was a nightmare though.
The openTSNE implementation is much faster, and offers an interesting guide on how to use smaller and larger perplexity values at different stages of the same run of the algorithm in order to to get the advantages of both. Loosely speaking what it does is initiate the algorithm with high perplexity to find global structure for a small number of steps, then repeats the algorithm with the lower perplexity.
I used this implementation on a dataset of 1.5million records with dimension ~ 200. The data came from the same source system as the first dataset I mentioned. I didn't play with the perplexity values here because total runtime on a 32 cpu vm was several hours, but the clusters looked remarkably similar to the ones on the smaller dataset (recreated binary classification-esque distinct clusters), so I was happy.