How to utilize memory/performance when processing a big file in Clojure

275 views Asked by At

How to utilize memory/performance when processing a large data set of time series data ?

Size : ~3.2G

Lines : ~54 million

First few line of dataset

{:ts 20200601040025269 :bid 107.526000 :ask 107.529000}
{:ts 20200601040025370 :bid 107.525000 :ask 107.529000}
{:ts 20200601040026421 :bid 107.525000 :ask 107.528000}
{:ts 20200601040026724 :bid 107.524000 :ask 107.528000}
{:ts 20200601040027424 :bid 107.524000 :ask 107.528000}
{:ts 20200601040033535 :bid 107.524000 :ask 107.527000}
{:ts 20200601040034230 :bid 107.523000 :ask 107.526000}

Helper functions

(defn lines [n filename]
  (with-open [rdr (io/reader filename)]
    (doall (take n (line-seq rdr)))))

(def dataset (into [] (lines 2000 "./data/rawdata.map")))

For best performance, I should retrieve data into memory as much as possible. However, my notebook has 16GB only, when I retrieve more data into memory, CPU/memory is almost 95% utilized.

  1. Can I do a better memory management with large dataset in Clojure?
  2. Can I reserve a memory buffer to store data set?
  3. Because this is time series data in small memory environment. When first batch of data processed, the next batch can be retrieved by line-seq.
  4. Please suggest what data structure is used to implement this function?

Please feel free to comment.

Thanks

2

There are 2 answers

3
Rulle On BEST ANSWER

Because the dataset consists of only 54000000 lines, you can fit this dataset into memory if you pack the data together in memory. Assuming this is what you want to do, e.g. for the sake of convenience of random access, here is one approach.

The reason why you are not able to fit it into memory is likely the overhead of all the objects used to represent each record read from the file. But if you flatten out the values into, for example, a byte buffer, the amount of space needed to store those values is not that great. You could represent the timestamp simply as one byte per digit, and the amounts using some fixed-point representation. Here is a quick and dirty solution.

(def fixed-pt-factor 1000)
(def record-size (+ 17 4 4))
(def max-count 54000000)

(defn put-amount [dst amount]
  (let [x (* fixed-pt-factor amount)]
    (.putInt dst (int x))))


(defn push-record [dst m]
  ;; Timestamp (convert to string and push char by char)
  (doseq [c (str (:ts m))]
    (.put dst (byte c)))
  (put-amount dst (:bid m))
  (put-amount dst (:ask m))
  dst)

(defn get-amount [src pos]
  (/ (BigDecimal. (.getInt src pos))
     fixed-pt-factor))

(defn record-count [dataset]
  (quot (.position dataset) record-size))

(defn nth-record [dataset n]
  (let [offset (* n record-size)]
    {:ts (edn/read-string (apply str (map #(->> % (+ offset) (.get dataset) char) (range 17))))
     :bid (get-amount dataset (+ offset 17))
     :ask (get-amount dataset (+ offset 17 4))}))

(defn load-dataset [filename]
  (let [result (ByteBuffer/allocate (* record-size max-count))]
    (with-open [rdr (io/reader filename)]
      (transduce (map edn/read-string) (completing push-record) result (line-seq rdr)))
    result))

You can then use load-dataset to load the dataset, record-count to get the number of records, and nth-record to get the nth record:

(def dataset (load-dataset filename))

(record-count dataset)
;; => 7

(nth-record dataset 2)
;; => {:ts 20200601040026421, :bid 107.525M, :ask 107.528M}

Exactly how you choose to represent the values in the byte buffer is up to you, I did not optimize it particularly. The loaded dataset in this example will only require about 54000000*25 bytes = 1.35 GB which will fit in memory (you may have to tweak some flag of the JVM though...).

In case you need to load larger files than this, you may consider putting the data into a memory-mapped file instead of an in-memory byte buffer.

0
pete23 On

Use deftype to create a type with a long ts and doubles for bid ask. If you parse your line strings into instances of this type you will find that a 54 million row dataset should fit in memory easily. 24 bytes of data, plus 8 bytes of object header, plus ~8 bytes of reference in the array makes 40 bytes / record. Around 2G heap.

More exotic solutions (primitive arrays for a column store, or flyweights to access packed byte buffers) are possible but unneeded for your stated problem parameters.

Example code to follow, I only have my phone to hand.