Text parsing - date recogniser

158 views Asked by At

Does anyone know if there's a Python text parser that recognises embedded dates? For instance, given a sentence

"bla bla bla bla 12 Jan 14 bla bla bla 01/04/15 bla bla bla"

the parser could pick out the two date occurrences. I know of some Java tools, but are there Python ones? Would NTLK be an overkill?

Thanks

1

There are 1 answers

4
fferri On

Here is an attempt to nondeterministically (read: exhaustively) solve the problem of finding where the dates are in the tokenized text. It enumerates all ways of partitioning the sentence (as list of tokens), with partition size from minps to maxps.

Each partitioning is run into the parser, which outputs a list of parsed dates, and the token range where it was parsed.

Each parser output is scored with sum of token ranges squared (so to prefer a date parsed from 4 tokens rather than 2 dates parsed from 2 tokens each).

Finally, it find and outputs the parse with best score.

The three building blocks of the algorithm:

from dateutil.parser import parse as parsedate

def partition(lst, minps, maxps, i=0):
    if lst == []:
        yield []
    else:
        try:
            for l in range(minps, maxps+1):
                if l > len(lst): continue
                for z in partition(lst[l:], minps, maxps, i+l):
                    yield [(i, lst[:l])] + z
        except:
            pass

def parsedates(p):
    for x in p:
        i, pi = x
        try:
            d = parsedate(' '.join(pi))
            # output: (startIndex, endIndex, parsedDate)
            if d: yield i, i+len(pi), d
        except: pass

def score(p):
    score = 0
    for pi in p:
        score += (pi[1]-pi[0])**2
    return score

Finding the parse with best score:

def bestparse(toks, maxps=3):
    bestscore = 0
    bestparse = None
    for ps in partition(toks, 1, maxps):
        l = list(parsedates(ps))
        s = score(l)
        if s > bestscore:
            bestscore = s
            bestparse = l
    return bestparse

Some tests:

l=['bla', 'bla', 'bla', '12', 'Jan', '14', 'bla', 'bla', 'bla', '01/04/15', 'bla', 'bla']
for bpi in bestparse(l):
    print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

found date 2014-01-12 00:00:00 at tokens 3,4,5

found date 2015-01-04 00:00:00 at tokens 9

l=['Fred', 'was', 'born', 'on', '23/1/99', 'at', '23:30']
for bpi in bestparse(l, 5):
    print('found date %s at tokens %s' % (bpi[2], ','.join(map(str, range(*bpi[:2])))))

found date 1999-01-23 23:30:00 at tokens 3,4,5,6

Beware that this can be very computationally expensive, so you may want to run it on single short phrases, not on the whole document. You may even want to split long phrases in chunks.

Another point for improvement is the partitioning function. If you have prior information like how many dates can be at most in a single sentence, the number of ways of partitioning it can be greatly reduced.