Using Deequ on AWS Glue

1.1k views Asked by At

I am using Deequ on AWS GLUE, surprisingly when I was to run the hasMaxLength which is listed under Checks for the verificationSuite. I get the following error, can someone help? All other checks are passing/running. It says the check hasMaxLength is not a member of amazon.deequ.checks

  download: s3://stg-dev-ire-library/jars/deequ-1.0.1.jar to ./jar0.jar

   SCRIPT_URL = /tmp/g-6aa13d15270ba0853894d7d6f2d26459f810d2ab- 
    4863242765668262538/script_2021-02-04-08-26-12.scala

   Compilation result: /tmp/g-6aa13d15270ba0853894d7d6f2d26459f810d2ab-4863242765668262538/script_2021-02-04-08-26-12.scala:15: error: object KLLParameters is not a member of package com.amazon.deequ.analyzers import com.amazon.deequ.analyzers.{Analyzer, Histogram, Patterns, State, KLLParameters} ^ /tmp/g-6aa13d15270ba0853894d7d6f2d26459f810d2ab-4863242765668262538/script_2021-02-04-08-26-12.scala:56: error: value hasMaxLength is not a member of com.amazon.deequ.checks.Check possible cause: maybe a semicolon is missing before `value hasMaxLength'? .hasMaxLength("* External Number", _==40) ^ two errors found Compilation failed.

here's the code:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel}
import com.amazon.deequ.constraints.{ConstrainableDataTypes}
import org.apache.spark.sql.functions.{length, max}





 object Deequ {


  def main(args: Array[String]) {
  val conf = new SparkConf().setAppName("dq")
  val spark = SparkSession.builder().appName("dq").getOrCreate()

 val dataset = spark.read.option("header",true).option("delimiter",",").csv("s3://ct-ire- 
        fin-stg-data-dev-raw-gib/templates /Contracts_And_Coverages/FPSL- 
   CONTRACTS=VALIDATIONS-v2 - Sheet1.csv") 
   val verificationResult: VerificationResult = { VerificationSuite()
 // data to run the verification on
    .onData(dataset)
 // define a data quality check
  .addCheck(
      Check(CheckLevel.Error, "Template Validations") 



  .hasDataType("* External Number", ConstrainableDataTypes.String)
  .hasMaxLength("* External Number", _==40)
  .isComplete("* External Number")


  .hasDataType("* Business Record Date",ConstrainableDataTypes.Integral )
  .hasMax("* Business Record Date", _ == 8)
  .isComplete("* Business Record Date")
  
  .hasDataType("Organizational Unit (Owner)",ConstrainableDataTypes.String )
  .hasMax("Organizational Unit (Owner)", _ == 10)
  .isComplete("Organizational Unit (Owner)")
  //failing
  .isContainedIn("Organizational Unit (Owner)", Array("50000252","50000256","50000257"))
  
  .hasDataType("Object Status",ConstrainableDataTypes.Integral )
  .hasMax("Object Status", _ == 3)
  .isComplete("Object Status")
  .isContainedIn("Object Status", Array("0","1"))
  
  .hasDataType("Delivery Package",ConstrainableDataTypes.String )
  .hasMax("Delivery Package", _ == 20)
  .isComplete("Delivery Package")
  
  
  
  .hasDataType("Product Code",ConstrainableDataTypes.Integral)
  .hasMax("Product Code", _ == 10)
  .isComplete("Product Code")
  
  
  
  
  .hasDataType("Source System Basic Data",ConstrainableDataTypes.String )
  .hasMax("Source System Basic Data", _ == 10)
  .isComplete("Source System Basic Data")
  //LFST, CLPB, CLCB, CLHR, CCLU
  //.isContainedIn("Source System Basic Data", Array("LSFT", "CLPB","CLCB","CLHR","CCLU"))
  .isContainedIn("Source System Basic Data", Array("LSFT"))
  
  .hasDataType("Production Control",ConstrainableDataTypes.String )
  .hasMax("Production Control", _ == 20)
  .isComplete("Production Control")
  .isContainedIn("Production Control", Array("Z_UL_PR1"))

  
  .hasDataType("Date of Start of Term",ConstrainableDataTypes.Integral )
  .hasMax("Date of Start of Term", _ == 8)
  .isComplete("Date of Start of Term")
  
  .hasDataType("Date of End of Term",ConstrainableDataTypes.Integral )
  .hasMax("Date of End of Term", _ == 8)
  .isComplete("Date of End of Term")
  
  .hasDataType("Legal Entity",ConstrainableDataTypes.Integral )
  .hasMax("Legal Entity", _ == 10)
  .isComplete("Legal Entity")
  //2001
  .isContainedIn("Legal Entity", Array("2001"))
  
  .hasDataType("IFRS17 Category",ConstrainableDataTypes.Integral )
  .hasMax("IFRS17 Category", _ == 80)
  .isComplete("IFRS17 Category")
  //To Change 
  .isContainedIn("IFRS17 Category", Array("1","2","3","4"))
  
  .hasDataType("IFRS17 Portfolio",ConstrainableDataTypes.String )
  .hasMax("IFRS17 Portfolio", _ == 40)
  .isComplete("IFRS17 Portfolio")


 )
  


   // compute metrics and verify check conditions
   .run()
     }
      //val metrics1 = successMetricsAsDataFrame(spark, analysisResult1)
        val resultDataFrame = checkResultsAsDataFrame(spark, verificationResult)
        resultDataFrame.write.mode("overwrite").parquet("s3://ct-ire-fin-stg-data-dev-raw- 
        gib/template_validations/template_validations_lifestyle/")
  
       }
      }
1

There are 1 answers

2
tomwollnik On

Your code looks good, I don't see an obvious reason for the error. I suggest you verify the following:

  • You should be using the latest version of deequ
  • You need to use the deequ release that matches your spark version

You might also want to check where the other import error is coming from (error: object KLLParameters is not a member of package com.amazon.deequ.analyzers import com.amazon.deequ.analyzers.{Analyzer, Histogram, Patterns, State, KLLParameters}). Solving this error might also help you with the hasMaxLength issue.