I am creating one figure with around one hundred subplots/axes, each with a few thousand data points. Currently, I am looping through each subplot and using plt.scatter
to place the points. However, this is quite slow. Is it possible to use multiple CPUs to speed up the plotting, by dividing the labor either one core per subplot or in terms of plotting the data points within a single subplot?
So far, I have attempted using joblib
to use parallel processes for the subplot creation, but rather than creating new subplots within the same figure, it spawns a new figure for each subplot. I have tried with the backends PDF
, Qt5Agg
, and Agg
. Here is a simplified example of my code.
import matplotlib as mpl
mpl.use('PDF')
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
def plotter(name, df, ax):
ax.scatter(df['petal_length'], df['sepal_length'])
iris = sns.load_dataset('iris')
fig, axes = plt.subplots(3,1)
Parallel(n_jobs=2)(delayed(plotter)
(species_name, species_df, ax)
for (species_name, species_df), ax in zip(iris.groupby('species'), axes.ravel()))
fig.savefig('test.pdf')
Setting n_jobs=1
works, all points are then plotted within the same figure. However, increasing it to above one creates four figures: one that I initiate with plt.subplots
and then one for each time ax.scatter
is called.
Since I am passing the axes from the first figure to plotter
, I am not sure how/why the additional figures are created. Is there some fallback in matplotlib, that causes new figures to be created automatically if the specified figure is "locked" by another plotting process?
Any advice on how to improve my current approach or achieve the speedups through alternative approaches are appreciated.
Joblib's
parallel
uses themultiprocessing
module for spawning processes, so each job will run in a different process. That is why you'll get a new figure for each job. The processes don't share any memory, like threads would do, so they don't have access to the original figure.You could probably try using threads, but it is questionable if you'll get any speed gains, because of the global interpreter lock (GIL).
To speed up the plotting, you could maybe try to avoid using
pyplot
. It adds some overhead and a helper thread that redraws the plot after each plotting command. This is mostly geared toward making for example ipython feel more like Matlab - but for speed this is bad. If you only usematplotlib
then you can select to draw the plot only when you have finished it, and it will probably save some considerable time.Note: @Faultier mentioned in a comment that you can enable and disable interactive plotting with
pyplot.ion()
andpyplot.ioff()
.