PyDeequ hasPattern fails with 'PatternMatch' object has no attribute '_Check'

910 views Asked by At

I'm trying to run the sample code for pattern check "hasPattern()" with PyDeequ and it fails with Exception

The code:

import pydeequ

from pyspark.sql import SparkSession, Row

spark = (SparkSession
         .builder
         .config("spark.jars.packages", pydeequ.deequ_maven_coord)
         .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
         .getOrCreate())

df = spark.sparkContext.parallelize([
    Row(a="foo", creditCard="5130566665286573", email="[email protected]", ssn="123-45-6789",
        URL="http://[email protected]:8080"),
    Row(a="bar", creditCard="4532677117740914", email="[email protected]", ssn="123456789",
        URL="http://example.com/(something)?after=parens"),
    Row(a="baz", creditCard="3401453245217421", email="[email protected]", ssn="000-00-0000",
        URL="http://[email protected]:8080")]).toDF()

from pydeequ.checks import *
from pydeequ.verification import *

check = Check(spark, CheckLevel.Error, "Integrity checks")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
    check.hasPattern(column='email',
                     pattern=r".*@baz.com",
                     assertion=lambda x: x == 1 / 3)) \
    .run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

After run I recieve:

AttributeError: 'NoneType' object has no attribute '_Check'

on line

    check.hasPattern(column='email',
                     pattern=r".*@baz.com",
                     assertion=lambda x: x == 1 / 3)

PyDeequ version: 1.0.1 Python version: Python 3.7.9

2

There are 2 answers

1
E. Ducateme On BEST ANSWER

At this moment in time, it appears that the code on the pydeequ repository doesn't actually have the function definition fully fleshed out. It has a docstring that indicates the desired behavior, but it does not seem to have any accompanying code to do the actual work.

Without any code to do this test, the function will always return a value of None (the default behavior for Python functions).

The correct expected behavior for the check methods in pydeequ is to return the check object (represented by the self parameter), which will allow the user to daisy chain multiple checks in a sequence.

For comparison, I provide a snippet of code from the hasPattern (which is not fully coded and only contains the docstring) method and the containsCreditCardNumber method which appears to be fully implemented.

hasPattern

def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
    """
    Checks for pattern compliance. Given a column name and a regular expression, defines a
    Check on the average compliance of the column's values to the regular expression.
    :param str column: Column in DataFrame to be checked
    :param Regex pattern: A name that summarizes the current check and the
            metrics for the analysis being done.
    :param lambda assertion: A function with an int or float parameter.
    :param str name: A name for the pattern constraint.
    :param str hint: A hint that states why a constraint could have failed.
    :return: hasPattern self: A Check object that runs the condition on the column.
    """

containsCreditCardNumber

def containsCreditCardNumber(self, column, assertion=None, hint=None):
    """
    Check to run against the compliance of a column against a Credit Card pattern.
    :param str column: Column in DataFrame to be checked. The column is expected to be a string type.
    :param lambda assertion: A function with an int or float parameter.
    :param hint hint: A hint that states why a constraint could have failed.
    :return: containsCreditCardNumber self: A Check object that runs the compliance on the column.
    """
    assertion = (
        ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
        if assertion
        else getattr(self._Check, "containsCreditCardNumber$default$2")()
    )
    hint = self._jvm.scala.Option.apply(hint)
    self._Check = self._Check.containsCreditCardNumber(column, assertion, hint)
    return self
0
b-j On

i am still facing the same error, even if, following the link above it's look like that the method has been implemented and merged into master. Indeed the implementation is:

def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
    """
    Checks for pattern compliance. Given a column name and a regular expression, defines a
    Check on the average compliance of the column's values to the regular expression.

    :param str column: Column in DataFrame to be checked
    :param Regex pattern: A name that summarizes the current check and the
            metrics for the analysis being done.
    :param lambda assertion: A function with an int or float parameter.
    :param str name: A name for the pattern constraint.
    :param str hint: A hint that states why a constraint could have failed.
    :return: hasPattern self: A Check object that runs the condition on the column.
    """
    assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) if assertion \
        else getattr(self._Check, "hasPattern$default$2")()
    name = self._jvm.scala.Option.apply(name)
    hint = self._jvm.scala.Option.apply(hint)
    pattern_regex = self._jvm.scala.util.matching.Regex(pattern, None)
    self._Check = self._Check.hasPattern(column, pattern_regex, assertion_func, name, hint)
    return self

However in 1.1.0 isn't included. It needs to wait another release.