Decompress and read Dukascopy .bi5 tick files

6.6k views Asked by At

I need to open a .bi5 file and read the contents to cut a long story short. The problem: I have tens of thousands of .bi5 files containing time-series data that I need to decompress and process (read, dump into pandas).

I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma library, as I ran into compiling nightmares using the lzma back-ports for Python 2.7, so I conceded and ran with Python 3, but with no success. The problems are too numerous to divulge, no one reads long questions!

I have included one of the .bi5 files, if someone could manage to get it into a Pandas Dataframe and show me how they did it, that would be ideal.

ps the fie is only a few kb, it will download in a second. Thanks very much in advance.

(The file) http://www.filedropper.com/13hticks

5

There are 5 answers

11
ptrj On BEST ANSWER

The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.

import lzma
import struct
import pandas as pd


def bi5_to_df(filename, fmt):
    chunk_size = struct.calcsize(fmt)
    data = []
    with lzma.open(filename) as f:
        while True:
            chunk = f.read(chunk_size)
            if chunk:
                data.append(struct.unpack(fmt, chunk))
            else:
                break
    df = pd.DataFrame(data)
    return df

The most important thing is to know the right format. I googled around and tried to guess and '>3i2f' (or >3I2f) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f' doesn't produce sensible floats - regardless whether big or little endian.) For struct and format syntax see the docs.

df = bi5_to_df('13h_ticks.bi5', '>3i2f')
df.head()
Out[177]: 
      0       1       2     3     4
0   210  110218  110216  1.87  1.12
1   362  110219  110216  1.00  5.85
2   875  110220  110217  1.00  1.12
3  1408  110220  110218  1.50  1.00
4  1884  110221  110219  3.94  1.00

Update

To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy, I compiled and run test_read_bi5 from there. The first lines of the output are:

time, bid, bid_vol, ask, ask_vol
2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

And bi5_to_df on the same input file gives:

bi5_to_df('01h_ticks.bi5', '>3I2f').head()
Out[295]: 
      0       1       2     3    4
0  3581  131966  131945  1.50  1.5
1  5142  131964  131943  1.50  1.5
2  5202  131964  131943  2.25  1.5
3  5321  131964  131944  1.50  1.5
4  5441  131964  131944  1.50  1.5

So everything seems to be fine (ninety47's code reorders columns).

Also, it's probably more accurate to use '>3I2f' instead of '>3i2f' (i.e. unsigned int instead of int).

0
dsapandora On

Did you try using numpy as to parse the data before transfer it to pandas. Maybe is a long way solution, but I will allow you to manipulate and clean the data before you made the analysis in Panda, also the integration between them are pretty straight forward,

0
user3920015 On

In case someone used endpoint https://datafeed.dukascopy.com/datafeed/EURUSD/2022/11/06/BID_candles_min_1.bi5

the format is '>IIIIIf' (big endian, 5x integer, 1x float ) and the columns are Seconds, O, H, L, C, V

1
bitbang On
import requests
import struct
from lzma import LZMADecompressor, FORMAT_AUTO

# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 file
res = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)
print(res.headers.get('content-type'))

rawdata = res.content

decomp = LZMADecompressor(FORMAT_AUTO, None, None)
decompresseddata = decomp.decompress(rawdata)

firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])
print("firstrow:", firstrow)
# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)
# time = 2020/06/15/10h + (1 month) + 436 milisecond

secondrow = struct.unpack('!IIIff', decompresseddata[20: 40])
print("secondrow:", secondrow)
# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)

# time = 2020/06/15/10h + (1 month) + 537 milisecond
# ask = 114271 / 100000 = 1.14271
# bid = 114267 / 100000 = 1.14267
# askvolume = 4.31
# bidvolume = 2.25

# note that 00 -> is january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january
# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february

#  iterating
print(len(decompresseddata), int(len(decompresseddata) / 20))
for i in range(0, int(len(decompresseddata) / 20)):
    print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))
0
tomas.r On

I know I'm re-opening an old thread, but if someone needs to know what the structure of .bi5 file is, find it here. Hope even after those years somebody'd find it helpful.

The bi5 structure background The size of one tick record in a bi5 file is 5 × 4 bytes. These bytes carry the timestamp, bid and ask quotes, and tick volume for both quote sides. The structure of the bi5 record is the following:

1st 4 bytes → The time part of the timestamp

2nd 4 bytes → Bid

3rd 4 bytes → Ask

4th 4 bytes → Bid Volume

5th 4 bytes → Ask Volume

The timestamp is not date and time, but actually, it’s a count of milliseconds from the starting hour of the file. The bytes must also be uncompressed first, and then reverted during translation.