Is it possible to run Deequ anomaly detection on multiple partitions separately in parallel

Question

Is it possible to run Deequ anomaly detection on multiple partitions separately in parallel

1k views Asked by Sifang At 01 February 2021 at 21:03

We have Spark dataframes partitioned on multiple columns. For example, we have a partner column that can be Google, Facebook, and Bing. And we have a channel column that can be PLA and Text. We would like to run anomaly detection on Google-PLA, Google-TEXT, Facebook-TEXT,... etc. separately because they follow different patterns. So far I've figured out I can configure AnomalyCheckConfig with different filter description and using the same filter when checking for result. But first I need to filter out the data for each partition combo and then to run the anomaly test with its associated filter. One by one in serial. Is there a way to run them in parallel? Can I do addAnomalyCheck() with different AnomalyCheckConfigs multiple times to the whole dataframe and get the Verification result in one run?

Original Q&A

There are 1 answers

**Philipp** · Answer 1 · 2021-02-09T14:11:55+00:00

If you have the partitioning column in your Spark DataFrame, you can instantiate multiple anomaly checks in a single VerificationSuite by specifying where conditions for the quality metric you want to run anomaly detection on. Assuming you want to compute the Completeness of a column c1, you can control for the partition with where = Some("partition = 'GOOGL'"), for example.

val verificationResults = VerificationSuite()
  .onData(df)
  ...
  .addAnomalyCheck(
    AbsoluteChangeStrategy(Some(-17.0), Some(7.0)),
    Completeness("c1", where = Some("c0 <= 5")),
    Some(AnomalyCheckConfig(CheckLevel.Error, "First Anomaly check",
      Map.empty, Some(0), Some(11)))
  )
  .addAnomalyCheck(
    AbsoluteChangeStrategy(Some(-17.0), Some(7.0)),
    Completeness("c1", where = Some("c0 > 5")),
    Some(AnomalyCheckConfig(CheckLevel.Error, "Second Anomaly check",
      Map.empty, Some(0), Some(11)))
  )
  .run()

TechQA.

Is it possible to run Deequ anomaly detection on multiple partitions separately in parallel

There are 1 answers

Related Questions in PERFORMANCE

Related Questions in APACHE-SPARK

Related Questions in ANOMALY-DETECTION

Related Questions in AMAZON-DEEQU

Popular Questions

Popular Tags

Trending Questions