I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.
Should I use numpy.random.seed or random.seed?
From the link in the comments, I understand that they are different, and that the numpy version is not thread-safe. I want to know specifically which one to use to create IPython notebooks for data analysis. Some of the algorithms from scikit-learn involve generating random numbers, and I want to be sure that the notebook shows the same results on every run.
That depends on whether in your code you are using numpy's random number generator or the one in
random.The random number generators in
numpy.randomandrandomhave totally separate internal states, sonumpy.random.seed()will not affect the random sequences produced byrandom.random(), and likewiserandom.seed()will not affectnumpy.random.randn()etc. If you are using bothrandomandnumpy.randomin your code then you will need to separately set the seeds for both.Update
Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses
numpy.randomthroughout, so you should usenp.random.seed()rather thanrandom.seed().One important caveat is that
np.randomis not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them usingnp.random, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (ornumpy.random.Randominstance) to each subprocess, such that each one has a separate local RNG state.Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an
np.random.RandomStateinstance (e.g. therandom_state=parameter tosklearn.decomposition.MiniBatchSparsePCA). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.