Pandas dataframe data validation methods

143 views Asked by At

Currently using a dataframe to store information on data we've collected. Prior to submitting the data, we need to validate the data based off a list of rules. Trying to set up these validations in python, and part of the problem is readability vs. performance.

For example: (Each item is a column in the data frame) Rule: Dir_Through_Lanes must exist where F_System = 1 and Facility_Type in (1;2;6) and RoadEventCollectionMethods Pav_Rep_Method = 2

Two methods of validating the data:

  1. Using an apply
def sjpm201a(self):
        def sjpm201_check(row):
            print(row)
            if row['F_SYSTEM'] == 1 and row['FACILITY_TYPE'] in [1,2,6] and row['PAVE_REP_METHOD'] == 2:
                if row['DIR_THROUGH_LANES'] == np.nan:
                    row['SJPM201'] = False
                    return row
            row['SJPM201a'] = True
            return row
        self.df = self.df.apply(sjpm201_check, axis=1)
  1. Using boolean indexing
    def sjpm201b(self):
        df = self.df
        self.df['SJPM201b'] = ((df['DIR_THROUGH_LANES'].notna()) | ((df['F_SYSTEM'] != 1) | (~df['FACILITY_TYPE'].isin([1,2,6])) | (df['PAVE_REP_METHOD'] != 2)))

The problem is that the first method is easier to read/write/update whereas the second method is a bit harder just due to the grouping of ands/ors especially on longer rules. However the second method runs in about .01 seconds vs the first method taking about 22 seconds to run. This is a problem as we have hundreds of rules that need to be ran.

Is there a way to achieve the same performance as the second method, but using if/else if blocks to filter the data?

1

There are 1 answers

5
Arothas On BEST ANSWER

Instead of checking for (!condition1 or !condition2) in your second example, you could use (condition1 and condition2) as in the first example in separated lines. E.g.:

def sjpm201b(self):
    self.df = self.df[self.df['DIR_THROUGH_LANES'].notna()]
    self.df = self.df[self.df['F_SYSTEM'] == 1]
    self.df = self.df[self.df['FACILITY_TYPE'].isin([1,2,6])]
    self.df = self.df[self.df['PAVE_REP_METHOD'] == 2]

Edit: I am slightly confused. Do you want to retain rows where df['F_SYSTEM']==1 or do you only want to keep everything that is != 1? The above function would only retain values that are within your specification and delete the rest.

Edit2: The following will return a dataframe that will list which rows conform to your specifications but have na-values in the DIR_THROUGH_LANES column.

def sjpm201b(self):
    df = self.df
    df = df[df['F_SYSTEM'] == 1]
    df = df[df['FACILITY_TYPE'].isin([1,2,6])]
    df = df[df['PAVE_REP_METHOD'] == 2]
    df = df[df['DIR_THROUGH_LANES'].isna()]
    return df

In case you want a simple True/False result of your validation you could use:

return df['DIR_THROUGH_LANES'].isna().any()

Edit3: Improve readability by adding linebreaks:

def sjpm201b(self):
    self.df = self.df[
        df['DIR_THROUGH_LANES'].notna()
        | (df['F_SYSTEM'] != 1)
        | (~df['FACILITY_TYPE'].isin([1,2,6]))
        | (df['PAVE_REP_METHOD'] != 2))
    ]