How to utilize memory/performance when processing a large data set of time series data ?
Size : ~3.2G
Lines : ~54 million
First few line of dataset
{:ts 20200601040025269 :bid 107.526000 :ask 107.529000}
{:ts 20200601040025370 :bid 107.525000 :ask 107.529000}
{:ts 20200601040026421 :bid 107.525000 :ask 107.528000}
{:ts 20200601040026724 :bid 107.524000 :ask 107.528000}
{:ts 20200601040027424 :bid 107.524000 :ask 107.528000}
{:ts 20200601040033535 :bid 107.524000 :ask 107.527000}
{:ts 20200601040034230 :bid 107.523000 :ask 107.526000}
Helper functions
(defn lines [n filename]
(with-open [rdr (io/reader filename)]
(doall (take n (line-seq rdr)))))
(def dataset (into [] (lines 2000 "./data/rawdata.map")))
For best performance, I should retrieve data into memory as much as possible. However, my notebook has 16GB only, when I retrieve more data into memory, CPU/memory is almost 95% utilized.
- Can I do a better memory management with large dataset in Clojure?
- Can I reserve a memory buffer to store data set?
- Because this is time series data in small memory environment. When first batch of data processed, the next batch can be retrieved by line-seq.
- Please suggest what data structure is used to implement this function?
Please feel free to comment.
Thanks
Because the dataset consists of only 54000000 lines, you can fit this dataset into memory if you pack the data together in memory. Assuming this is what you want to do, e.g. for the sake of convenience of random access, here is one approach.
The reason why you are not able to fit it into memory is likely the overhead of all the objects used to represent each record read from the file. But if you flatten out the values into, for example, a byte buffer, the amount of space needed to store those values is not that great. You could represent the timestamp simply as one byte per digit, and the amounts using some fixed-point representation. Here is a quick and dirty solution.
You can then use
load-dataset
to load the dataset,record-count
to get the number of records, andnth-record
to get the nth record:Exactly how you choose to represent the values in the byte buffer is up to you, I did not optimize it particularly. The loaded dataset in this example will only require about 54000000*25 bytes = 1.35 GB which will fit in memory (you may have to tweak some flag of the JVM though...).
In case you need to load larger files than this, you may consider putting the data into a memory-mapped file instead of an in-memory byte buffer.