Big data with very fast access

1.2k views Asked by At

I am facing to a problem: database for process plants. There are up to 50,000 sensors at sampling rate of 50 ms. All measured values need to be stored at least 3 years and must support real-time queries (i.e. users can see historical data with delay less than 1 second). I recently read an article about Time-series Database, many options are on hand: OpenTSDB, KairosDB, InfluxDB, ...

I am confused which one would be proper for the purpose? Any one know about this please help me!

UPDATE 15.06.25

Today I run a test based on OpenTSDB. I used Virtual Box to create a cluster of 3 CentOS x64 VMs (1 master, 2 slaves). The host configuration is 8 GB RAM, core i5. The master VM configuration is 3 GB RAM, and the slaves configuration is 1.5 GB RAM. I write a python program to send data to OpenTSDB as below:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("192.168.10.55", 4242))

start_time = time.time()
start_epoch = 1434192418;

for x in range(0, 1000000):
    curr_epoch = start_epoch + x

    tag1 = "put TAG_1 %d 12.9 stt=good\n" % (curr_epoch)
    tag2 = "put TAG_2 %d 12.9 stt=good\n" % (curr_epoch)
    tag3 = "put TAG_3 %d 12.9 stt=good\n" % (curr_epoch)
    tag4 = "put TAG_4 %d 12.9 stt=good\n" % (curr_epoch)
    tag5 = "put TAG_5 %d 12.9 stt=good\n" % (curr_epoch)
    tag6 = "put TAG_6 %d 12.9 stt=good\n" % (curr_epoch)
    tag7 = "put TAG_7 %d 12.9 stt=good\n" % (curr_epoch)
    tag8 = "put TAG_8 %d 12.9 stt=good\n" % (curr_epoch)
    tag9 = "put TAG_9 %d 12.9 stt=good\n" % (curr_epoch)
    tag10 = "put TAG_10 %d 12.9 stt=good\n" % (curr_epoch)
    str = tag1 + tag2 + tag3 + tag4 + tag5 + tag6 + tag7 + tag8 + tag9 + tag10

    s.send(str)

print("--- %s seconds ---" % (time.time() - start_time))

I run the python on host, and the work completes after ~220 seconds. So, I got an avg. speed of ~45000 records per second.

UPDATE 15.06.29

This time I used only 1 VM (5 GB RAM, 3 cores, CentOS x64, pseudo-distributed Hadoop). I run 2 python processes on Windows 7 host to send 2 halves of data to the OpenTSDB. The avg. speed of putting data was ~100,000 records per second.

4

There are 4 answers

4
Thefonkleurrr On

To handle the million writes per seconds, you will need to put some serious engineering in place.

Not all databases will be able to store that amount of data in a compact form.

For example ATSD uses 5 to 10 bytes per sample (float data type), depending on observed variance.

There is a type of distributed (clustered) databases built on HBase that will be able to handle this kind of load.

For example, you can try looking at openTSDB and ATSD.

Update 1.

We have run the following test for your particular use case:

30.000 analog sensors writing float type data, resulting in 540.000.000 records

20.000 digital sensors writing short type data (zeros and ones), resulting in 552.000.000 records

The data took up 3.68 gigabytes. The compression was lossless.

Resulting in an average 3.37 bytes per record.

This was a storage efficiency test.

Full disclosure, I work for the company that develops ATSD.

1
beckettsean On

InfluxDB won't handle a sustained million writes per second right now, but that's within the performance target for later this year. The bigger challenge I see is the sheer volume of data you want to store. If you need to keep three years' worth at full resolution with no downsampling, that's hundreds of terabytes of data. If that's not all on SSD, the query performance will not be good. If that is all on SSD, that's an extremely expensive database. Also, with that much raw data, it would be very easy to craft a query that explodes the RAM no matter how much is installed.

I would say check back with the InfluxDB team in 8-12 weeks and we might have a better idea how to handle your problem. My advice, though, is to find a way to split that up. If you really are sampling 50k machines at 50ms intervals, it's a huge amount of data, network traffic, etc.

0
Nikita Shamgunov On

MemSQL can handle 1M writes a second. I would recommend to use Kafka and MemSQL column stores for that. Depending on the type of queries subsecond response is very possible.

1
Loic On

Very interresting use case.

I cannot speak for others since I only strongly evaluated and used KairosDB & Cassandra but I can speak of my own experience.

KairosDB + Cassandra would stand the throughput but for 1M writes per second you would need a cluster with several front-ends (I'd recommend at least 10 nodes, but you have to test) and back-end servers (they can be colocated if needed).

Anyway with a throughput of 200 samples per second for each sensor... Retrieving historical data could be a problem for long queries (1s for historical data is interesting but you have to determine the amount of samples of the query duration).

As far as I can tell storage efficiency will not be as goot as ATSD (probably twice the size, which is already good).

What I personally like with KairosDB & Cassandra and why we adopted it: it's simple, really distributed, KairosDB doesn't mess with the data, the design is elegant and doesn't involve complex systems, Cassandra manages the data and kairosDB offers the TSDB services, and performances and efficiency are not bad at all... Moreover Cassandra is much easier to manage as a cluster than its counterparts.