Metacharacters python extracting dates

112 views Asked by At

I want to extract dates in the format Month Date Year.

For example: 14 January, 2005 or Feb 29 1982

the code im using: date = re.findall(r'\d{1,3} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December \d{1,3}[, ]\d{4}',line)

python inteprets this as 1-2 digits and Jan or each of the months. So it would match with only "Feb" or "12 Jan", but not the rest of it

So how do I group ONLY the Months in a way where i can use the | only for the months but not the rest of the expression

1

There are 1 answers

0
soyapencil On

Answering your question directly, you can make two regexps for your "Day Month Year" and "Month Day Year" formats, then check them separately.

import datetime

# Make months using list comp
months_shrt = [datetime.date(1,m,1).strftime('%b') for m in range(1,13)]
months_long = [datetime.date(1,m,1).strftime('%B') for m in range(1,13)]

# Join together
months = months_shrt + months_long
months_or = f'({"|".join(months)})'

expr_dmy = '\d{1,3},? ' + months_or + ',? \d{4}'
expr_mdy = months_or + ',? \d{1,3},? \d{4}'

You can try both out and see which one matches. However, you'll still need to inspect it and convert it to your favourite flavour of date format.

Instead, I would advise not using regexp at all, and simply try different date formats.

str_a = ' ,'
str_b = ' ,'

base_fmts = [('%d', '%b', '%Y'),
             ('%d', '%B', '%Y'),
             ('%b', '%d', '%Y'),
             ('%B', '%d', '%Y')]

def my_formatter(s):
    for o in base_fmts:
        for i in range(2):
            for j in range(2):
                # Concatenate
                fmt = f'{o[0]}{str_a[i]} '
                fmt += f'{o[1]}{str_b[j]} '
                fmt += f'{o[2]}'
    
                try:
                    d = datetime.datetime.strptime(s, fmt)
                except ValueError:
                    continue
                else:
                    return d

The function above will take a string and return a datetime.datetime object. You can use standard datetime.datetime methods to get your day, month and year back.

>>> d = my_formatter('Jan 15, 2009')
>>> (d.month, d.day, d.year)
(1, 15, 2009)