Data Quality Framework in AWS

220 views Asked by jtp At 08 May 2022 at 03:13

I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are:

The data pipelines widely vary and ingest very high volumes of data. They are developed using spark,python,emr clusters, kafka, Kinesis stream
Any new system that we onboard in the framework, it should be easily able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created
Our tech stack includes AWS,Python,Spark, Java, so kindly advise related services(AWS Databrew, PyDeequ, Greatexpectations libraries, various lambda event driven services are some I want to focus)
I am also looking for some sort of audit, balance and control mechanism. Auditing the source data, balancing # of records between 2 points and have some automated mechanism to remediate(control) them.
I am looking for testing frameworks for the different data pipelines.
Also for data profiling, kindly advise tools/libraries, Aws data brew, Pandas are some I am exploring.

I know there wont be one specific solution, and hence appreciate all and any different ideas. A flow diagram with Audit, balance and control with automated data validation and testing mechanism for data pipelines can be very helpful.

Thanks!!!

Original Q&A

TechQA.

Data Quality Framework in AWS

There are 0 answers

Related Questions in AMAZON-WEB-SERVICES

Related Questions in APACHE-SPARK

Related Questions in VALIDATION

Related Questions in DATA-QUALITY

Related Questions in AMAZON-DEEQU

Popular Questions

Popular Tags

Trending Questions