Summarized huge data, How to handle it with R?

154 views Asked by At

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:

datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062

2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071

2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)

To summarize the data, They have two rules(I have changed it a bit for simplicity):

  1. If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).

  2. The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.

    Example:2008/01/28,09:11:28.800,0,1,

I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.

1

There are 1 answers

0
flodel On

You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.

I'd suggest you first filter and split your data into two data.frames:

best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)

Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:

best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]

(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)

Then it should be easy:

min.spread <- min(c(best.bid$ask - best.bid$price,
                    best.ask$bid - best.ask$price))

I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.


For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.