I am facing to a problem: database for process plants. There are up to 50,000 sensors at sampling rate of 50 ms. All measured values need to be stored at least 3 years and must support real-time queries (i.e. users can see historical data with delay less than 1 second). I recently read an article about Time-series Database, many options are on hand: OpenTSDB, KairosDB, InfluxDB, ...
I am confused which one would be proper for the purpose? Any one know about this please help me!
UPDATE 15.06.25
Today I run a test based on OpenTSDB. I used Virtual Box to create a cluster of 3 CentOS x64 VMs (1 master, 2 slaves). The host configuration is 8 GB RAM, core i5. The master VM configuration is 3 GB RAM, and the slaves configuration is 1.5 GB RAM. I write a python program to send data to OpenTSDB as below:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("192.168.10.55", 4242))
start_time = time.time()
start_epoch = 1434192418;
for x in range(0, 1000000):
curr_epoch = start_epoch + x
tag1 = "put TAG_1 %d 12.9 stt=good\n" % (curr_epoch)
tag2 = "put TAG_2 %d 12.9 stt=good\n" % (curr_epoch)
tag3 = "put TAG_3 %d 12.9 stt=good\n" % (curr_epoch)
tag4 = "put TAG_4 %d 12.9 stt=good\n" % (curr_epoch)
tag5 = "put TAG_5 %d 12.9 stt=good\n" % (curr_epoch)
tag6 = "put TAG_6 %d 12.9 stt=good\n" % (curr_epoch)
tag7 = "put TAG_7 %d 12.9 stt=good\n" % (curr_epoch)
tag8 = "put TAG_8 %d 12.9 stt=good\n" % (curr_epoch)
tag9 = "put TAG_9 %d 12.9 stt=good\n" % (curr_epoch)
tag10 = "put TAG_10 %d 12.9 stt=good\n" % (curr_epoch)
str = tag1 + tag2 + tag3 + tag4 + tag5 + tag6 + tag7 + tag8 + tag9 + tag10
s.send(str)
print("--- %s seconds ---" % (time.time() - start_time))
I run the python on host, and the work completes after ~220 seconds. So, I got an avg. speed of ~45000 records per second.
UPDATE 15.06.29
This time I used only 1 VM (5 GB RAM, 3 cores, CentOS x64, pseudo-distributed Hadoop). I run 2 python processes on Windows 7 host to send 2 halves of data to the OpenTSDB. The avg. speed of putting data was ~100,000 records per second.
To handle the million writes per seconds, you will need to put some serious engineering in place.
Not all databases will be able to store that amount of data in a compact form.
For example ATSD uses 5 to 10 bytes per sample (float data type), depending on observed variance.
There is a type of distributed (clustered) databases built on HBase that will be able to handle this kind of load.
For example, you can try looking at openTSDB and ATSD.
Update 1.
We have run the following test for your particular use case:
30.000 analog sensors writing float type data, resulting in 540.000.000 records
20.000 digital sensors writing short type data (zeros and ones), resulting in 552.000.000 records
The data took up 3.68 gigabytes. The compression was lossless.
Resulting in an average 3.37 bytes per record.
This was a storage efficiency test.
Full disclosure, I work for the company that develops ATSD.