I'm trying to find an appropriate data quality framework for very large amounts of time series data in a narrow format.
Image billions of rows of data that look kinda like this:
Sensor | Timestamp | Value |
---|---|---|
A | 12251 | 12 |
B | 12262 | "A" |
A | 12261 | 13 |
A | 12271 | 13 |
C | 12273 | 5.4545 |
There are hundreds of thousands of sensors, but for each timestamp only a very small percentage send values.
I'm building Data Quality Monitoring for this data that checks some expectations about the values (e.g. whether the value falls within the expected range for a given sensor, there are tens of thousands of different expectations). Due to the size of the data and existing infrastructure the solution has to be run on Spark. I would like to build this solution on an (ideally open source) data quality framework, but cannot find anything appropriate.
I've looked into Great Expectations and Deequ, but these fundamentally seem to be build for "wide data" where the expectations are defined for columns. I could theoretically reshape (pivot) my data to this format, but it would be a very expensive operation and result in an extremly sparse table that is awkward to work with (or require sampling on the time and in this way a loss of information).
Does anyone know of an existing (spark compatible) framework for such time series data in narrow format? Or can point me to best practices how to apply Deequ/Great Expectations in such a setting?
Have you tried
github.com/canimus/cuallee
It is an open-source framework, that supports the Observation API to make testing on billions of records, super-fast, and less resource greedy as pydeequ. Is intuitive, and easy to use.