How many threads does Clojure's pmap function spawn for URL-fetching operations?

6.6k views Asked by At

The documentation on the pmap function leaves me wondering how efficient it would be for something like fetching a collection of XML feeds over the web. I have no idea how many concurrent fetch operations pmap would spawn and what the maximum would be.

4

There are 4 answers

4
Alex Miller On BEST ANSWER

If you check the source you see:

> (use 'clojure.repl)
> (source pmap)
(defn pmap
  "Like map, except f is applied in parallel. Semi-lazy in that the
  parallel computation stays ahead of the consumption, but doesn't
  realize the entire result unless required. Only useful for
  computationally intensive functions where the time of f dominates
  the coordination overhead."
  {:added "1.0"}
  ([f coll]
   (let [n (+ 2 (.. Runtime getRuntime availableProcessors))
         rets (map #(future (f %)) coll)
         step (fn step [[x & xs :as vs] fs]
                (lazy-seq
                 (if-let [s (seq fs)]
                   (cons (deref x) (step xs (rest s)))
                   (map deref vs))))]
     (step rets (drop n rets))))
  ([f coll & colls]
   (let [step (fn step [cs]
                (lazy-seq
                 (let [ss (map seq cs)]
                   (when (every? identity ss)
                     (cons (map first ss) (step (map rest ss)))))))]
     (pmap #(apply f %) (step (cons coll colls))))))

The (+ 2 (.. Runtime getRuntime availableProcessors)) is a big clue there. pmap will grab the first (+ 2 processors) pieces of work and run them asynchronously via future. So if you have 2 cores, it's going to launch 4 pieces of work at a time, trying to keep a bit ahead of you but the max should be 2+n.

future ultimately uses the agent I/O thread pool which supports an unbounded number of threads. It will grow as work is thrown at it and shrink if threads are unused.

2
mikera On

Building on Alex's excellent answer that explains how pmap works, here's my suggestion for your situation:

(doall
  (map
    #(future (my-web-fetch-function %))
    list-of-xml-feeds-to-fetch))

Rationale:

  • You want as many pieces of work in-flight as you can, since most will block on network IO.
  • Future will fire off an asynchronous piece of work for each request, to be handled in a thread pool. You can let Clojure take care of that intelligently.
  • The doall on the map will force the evaluation of the full sequence (i.e. the launch of all the requests).
  • Your main thread can start dereferencing the futures right away, and can therefore continue making progress as the individual results come back
0
John Lawrence Aspden On

No time to write a long response, but there's a clojure.contrib http-agent which creates each get/post request as its own agent. So you can fire off a thousand requests and they'll all run in parallel and complete as the results come in.

0
user1181316 On

Looking the operation of pmap, it seems to go 32 threads at a time no mater what number of processors you have, the issue is that map will go ahead of the computation by 32 and the futures are started in their own. (SAMPLE) (defn samplef [n] (println "starting " n) (Thread/sleep 10000) n) (def result (pmap samplef (range 0 100)))

; you will wait for 10 seconds and see 32 prints then when you take the 33rd an other 32 ; prints this mins that you are doing 32 concurrent threads at a time ; to me this is not perfect ; SALUDOS Felipe