Does anyone know if there's a Python text parser that recognises embedded dates? For instance, given a sentence
"bla bla bla bla 12 Jan 14 bla bla bla 01/04/15 bla bla bla"
the parser could pick out the two date occurrences. I know of some Java tools, but are there Python ones? Would NTLK be an overkill?
Thanks
Here is an attempt to nondeterministically (read: exhaustively) solve the problem of finding where the dates are in the tokenized text. It enumerates all ways of partitioning the sentence (as list of tokens), with partition size from
minps
tomaxps
.Each partitioning is run into the parser, which outputs a list of parsed dates, and the token range where it was parsed.
Each parser output is scored with sum of token ranges squared (so to prefer a date parsed from 4 tokens rather than 2 dates parsed from 2 tokens each).
Finally, it find and outputs the parse with best score.
The three building blocks of the algorithm:
Finding the parse with best score:
Some tests:
Beware that this can be very computationally expensive, so you may want to run it on single short phrases, not on the whole document. You may even want to split long phrases in chunks.
Another point for improvement is the partitioning function. If you have prior information like how many dates can be at most in a single sentence, the number of ways of partitioning it can be greatly reduced.