Determining the type of error and its place in the text

54 views Asked by At

I have an Excel file with thousands of rows containing information about the fulfilment of a contract. The data is loaded into the system using a template. But sometimes the template is filled out incorrectly and the row with the fulfilment information is filled out with errors.

The correct template looks like this: Object_№_(number)_Serviced_(dd.mm.YYYY)_Fulfilment_of_obligations_under_agr._№_90/11/122_dated_20.10.2010,_VAT_exempt.

Now I need to check all the thousands of contract descriptions and write in an additional column what error is made and at what place (or after which word) it is contained.

For example,

original information text_verification
Object № 1001 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt. OK
Object № 10023__Serviced 30.11.2023 Fulfilment of obligations under agr. №90/11/122 dated 20.10.2010, VAT exempt. double space after 10023, no space after "№" sign
Object № 100221 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt No full stop at the end

So far, I've only been able to determine if the template is correctly populated to output information or not. Using regular expressions.

template = 'Object № \d+ Serviced (0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[0-2])\.(20\d\d) Fulfilment of obligations under agr. № 90\/11\/122 dated 20.10.2010, VAT exempt.'

new_col = []
for index, row in df.iterrows():
    if re.match(template, row['original information']):
        new_col.append('OK')
    else:
        new_col.append('not OK')

df = df.assign(text_verification=new_col)
original information text_verification
Object № 1001 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt. OK
Object № 10023__Serviced 30.11.2023 Fulfilment of obligations under agr. №90/11/122 dated 20.10.2010, VAT exempt. not OK
Object № 100221 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt not OK

Could you please tell me how to more correctly and concisely determine the type of error in the text, as well as its location?

2

There are 2 answers

0
Jan On

Somewhat elaborating on my own comment: you could write yourself a parser and see those entries as a domain specific language (dsa). I very much like the parsimonious library:

from parsimonious.grammar import Grammar
from parsimonious.nodes import Node, NodeVisitor
from parsimonious.exceptions import ParseError

entries = [
    "Object № 1001 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt.",
    "Object № 10023  Serviced 30.11.2023 Fulfilment of obligations under agr. №90/11/122 dated 20.10.2010, VAT exempt.",
    "Object № 100221 Serviced 30.11.2023 Fulfilment of obligations under agr. № 90/11/122 dated 20.10.2010, VAT exempt"
]

class TemplateVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        entry           = prefix ws object_no ws serviced ws date ws fulfilment ws obligation_no ws dated ws date vat

        prefix          = "Object №" 
        serviced        = "Serviced"
        fulfilment      = "Fulfilment of obligations under agr. №"
        dated           = "dated"
        vat             = ", VAT exempt."

        object_no       = ~r"\d+"
        date            = ~r"\b\d+\.\d+\.\d+\b"
        obligation_no   = ~r"\b\d+/\d+/\d+\b"
        ws              = ~r"\s"
    """
    )


    def visit_entry(self, node, visited_children):
        """ Makes a dict of the section (as key) and the key/value pairs. """
        return visited_children
    
    def generic_visit(self, node, visited_children):
        """ The generic visit method. """
        return visited_children or node
    


tv = TemplateVisitor()

for entry in entries:
    try:
        ok = tv.parse(entry)
        print("Entry is valid.")
        # everything ok then
    except ParseError as error:
        print("Not ok: ", error)

This would yield

Entry is valid.
Not ok:  Rule 'serviced' didn't match at ' Serviced 30.11.2023' (line 1, column 16).
Not ok:  Rule 'vat' didn't match at ', VAT exempt' (line 1, column 102).

Obviously, you'd need to check for the most common errors beforehand.

0
Georgina Skibinski On

For starters I'd break down your pattern to meaningful components, then you can flag which component is wrong, and further drill down, if you wish.

import re

templates = [
    'Object № \d+\s*',
    'Serviced (0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[0-2])\.(20\d\d)\s*',
    'Fulfilment of obligations under agr. № 90\/11\/122 dated 20.10.2010, VAT exempt.'
]
template = ''.join(templates)
new_col = []
err_desc_all = []
for index, row in df.iterrows():
    err_desc = ''
    if re.match(template, row['original information']):
        new_col.append('OK')
    else:
        new_col.append('not OK')
        val_text = row['original information']
        for pattern in templates:
            m = re.match(pattern, val_text)
            if m:
                val_text = val_text[m.end(0):]
            else:
                err_desc = f'The following pattern failed: {pattern}.'
                break
    err_desc_all.append(err_desc)
           

df = df.assign(text_verification=new_col)
df = df.assign(text_verification_err_desc=err_desc_all)

Notes:

  1. Above code assumes 0+ spaces as delimiters for components - sheer speculation on my part.

  2. It will only flag first encountered issue - I didn't know, if the order is important - hence to avoid false positives chose the safest option (however it's trivial to adjust).