I have to process a csv file with millions of lines "value, datetime": I need to find if the value shifted (drop or raise) by GAP_VALUE (first parameter ) within a given time period TIME_GAP (second parameter).
I am not able to find an efficient algorithm to perform this analysis.
I tried to analyze the csv file line by line:
Choose (TIME_GAP, GAP_VALUE) parameters for the analysis.
Creating the data subset
- Get the current datetime, the UPPER_TIME_BOUND
- Substract the TIME_GAP to find the LOWER_TIME_BOUND
- Extract all data >= LOWER_TIME_BOUND && <= UPPER_TIME_BOUND
- Detect the drop or the raise within the data subset (not sure about my method)
- Find MAX_VALUE and MIN_VALUE in the extracted dataset (not sure about that because of the data high fluctuation)
- Substract MIN_VALUE to MAX_VALUE to find the VALUE_SHIFT (the datetime order is important to detect if it is a drop or a raise)
- If the VALUE_SHIFT >= GAP_VALUE output the datetime when it occured and the shift value.
Do the same for the next line in the csv file
If (TIME_GAP, GAP_VALUE) parameters change, redo the test.
My algorithm is highly resources-consuming. Could you help me to find a more efficient one please?
Thank you, BR.
You need a sorted list of counting buckets of values within the current sliding window. (E.g. a
sorted_map<value, counter> buckets)I assume you already know how to walk a sliding window over an already sorted set of entry lines in
O(n), so I'm only going to describe the actions you need to take on the corresponding window updates.The requirement for a sorted list of buckets gives a runtime of
O(n * log(m))and a space complexity ofO(m)withnbeing the total number of entry lines andmbeing the maximum number of unique values within the window size.