Is it possible to run Deequ anomaly detection on multiple partitions separately in parallel

1k views Asked by At

We have Spark dataframes partitioned on multiple columns. For example, we have a partner column that can be Google, Facebook, and Bing. And we have a channel column that can be PLA and Text. We would like to run anomaly detection on Google-PLA, Google-TEXT, Facebook-TEXT,... etc. separately because they follow different patterns. So far I've figured out I can configure AnomalyCheckConfig with different filter description and using the same filter when checking for result. But first I need to filter out the data for each partition combo and then to run the anomaly test with its associated filter. One by one in serial. Is there a way to run them in parallel? Can I do addAnomalyCheck() with different AnomalyCheckConfigs multiple times to the whole dataframe and get the Verification result in one run?

1

There are 1 answers

0
Philipp On

If you have the partitioning column in your Spark DataFrame, you can instantiate multiple anomaly checks in a single VerificationSuite by specifying where conditions for the quality metric you want to run anomaly detection on. Assuming you want to compute the Completeness of a column c1, you can control for the partition with where = Some("partition = 'GOOGL'"), for example.

val verificationResults = VerificationSuite()
  .onData(df)
  ...
  .addAnomalyCheck(
    AbsoluteChangeStrategy(Some(-17.0), Some(7.0)),
    Completeness("c1", where = Some("c0 <= 5")),
    Some(AnomalyCheckConfig(CheckLevel.Error, "First Anomaly check",
      Map.empty, Some(0), Some(11)))
  )
  .addAnomalyCheck(
    AbsoluteChangeStrategy(Some(-17.0), Some(7.0)),
    Completeness("c1", where = Some("c0 > 5")),
    Some(AnomalyCheckConfig(CheckLevel.Error, "Second Anomaly check",
      Map.empty, Some(0), Some(11)))
  )
  .run()