I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances
table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id
fk_to_device
fk_to_metric
metric_value
timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d
is the number of devices, m_d
is the accumulative number of metrics being recorded for all devices, f
is the frequency at which data is polled for and t
is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device
and fk_to_metric
scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp
(for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?
Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
The
Id
column does nothing, it is totally and completely redundant.Id
column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).The
Id
column requires an additional Index, which obviously impedes the speed ofINSERT/DELETE
, and adds to the disk space used.You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the
Id
"key", of course). I have an additional columnServer
. I support multiple customers remotely.(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie.
Devices
across the top andMetrics
down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.