Spark Compatible Data Quality Framework for Narrow Data

Question

Spark Compatible Data Quality Framework for Narrow Data

274 views Asked by Valentin At 04 May 2022 at 09:29

I'm trying to find an appropriate data quality framework for very large amounts of time series data in a narrow format.

Image billions of rows of data that look kinda like this:

Sensor	Timestamp	Value
A	12251	12
B	12262	"A"
A	12261	13
A	12271	13
C	12273	5.4545

There are hundreds of thousands of sensors, but for each timestamp only a very small percentage send values.

I'm building Data Quality Monitoring for this data that checks some expectations about the values (e.g. whether the value falls within the expected range for a given sensor, there are tens of thousands of different expectations). Due to the size of the data and existing infrastructure the solution has to be run on Spark. I would like to build this solution on an (ideally open source) data quality framework, but cannot find anything appropriate.

I've looked into Great Expectations and Deequ, but these fundamentally seem to be build for "wide data" where the expectations are defined for columns. I could theoretically reshape (pivot) my data to this format, but it would be a very expensive operation and result in an extremly sparse table that is awkward to work with (or require sampling on the time and in this way a loss of information).

Does anyone know of an existing (spark compatible) framework for such time series data in narrow format? Or can point me to best practices how to apply Deequ/Great Expectations in such a setting?

Original Q&A

There are 1 answers

**Canimus** · Answer 1 · 2022-10-25T22:13:56+00:00

Canimus On 25 October 2022 at 22:13

Have you tried github.com/canimus/cuallee It is an open-source framework, that supports the Observation API to make testing on billions of records, super-fast, and less resource greedy as pydeequ. Is intuitive, and easy to use.

TechQA.

Spark Compatible Data Quality Framework for Narrow Data

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in DATA-QUALITY

Related Questions in GREAT-EXPECTATIONS

Related Questions in AMAZON-DEEQU

Popular Questions

Popular Tags

Trending Questions