Python make sure address matches specific format

4.5k views Asked by At

I have been playing around with regular expressions, but haven't had any luck yet. I need to introduce some address validation. I need to make sure that a user defined address matches this format:

"717 N 2ND ST, MANKATO, MN 56001"

or possibly this one too:

"717 N 2ND ST, MANKATO, MN, 56001"

and to throw everything else out and alert the user that it is the improper format. I have been looking at the documentation and have tried and failed with many regular expression patterns. I have tried this (and many variations) without any luck:

pat = r'\d{1,6}(\w+),\s(w+),\s[A-Za-z]{2}\s{1,6}'

This one works, but it allows too much junk because it is only making sure it starts with a house number and ends with a zip code (I think):

pat = r'\d{1,6}( \w+){1,6}'

The comma placement is crucial as I am splitting the input string by comma so my first item is the address, then city, then the state and zip are split by a space (here I would like to use a second regex in case they have a comma between state and zip).

Essentially I would like to do this:

# check for this format "717 N 2ND ST, MANKATO, MN 56001"
pat_1 = 'regex to match above pattern'
if re.match(pat_1, addr, re.IGNORECASE):
    # extract address 

# check for this pattern "717 N 2ND ST, MANKATO, MN, 56001"
pat_2 = 'regex to match above format'
if re.match(pat_2, addr, re.IGNORECASE):
    # extract address 

else:
    raise ValueError('"{}" must match this format: "717 N 2ND ST, MANKATO, MN 56001"'.format(addr))

# do stuff with address

If anyone could help me with forming a regex to make sure there is a pattern match, I would greatly appreciate it!

4

There are 4 answers

7
Robᵩ On BEST ANSWER

Here's one that might help. Whenever possible, I prefer to use verbose regular expressions with embedded comments, for maintainability.

Also note the use of (?P<name>pattern). This helps to document the intent of the match, and also provides a useful mechanism to extract the data, if your needs go beyond simple regex validation.

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?P<HouseNumber>\d+)\s+        # Matches '717 '
    (?P<Direction>[news])\s+       # Matches 'N '
    (?P<StreetName>\w+)\s+         # Matches '2ND '
    (?P<StreetDesignator>\w+),\s+  # Matches 'ST, '
    (?P<TownName>.*),\s+           # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+       # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                 # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    ):
    match = regex.match(item)
    print item
    if match:
        print "    House is on {Direction} side of {TownName}".format(**match.groupdict())
    else:
        print "    invalid entry"

To make certain fields optional, we replace + with *, since + means ONE-or-more, and * means ZERO-or-more. Here is a version that matches the new requirements in the comments:

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
# Goal:  '717 N 2ND ST NE, MANKATO, MN, 56001',
# Goal:  '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?P<HouseNumber>\d+)\s+         # Matches '717 '
    (?P<Direction>[news])\s+        # Matches 'N '
    (?P<StreetName>\w+)\s*          # Matches '2ND ', with optional trailing space
    (?P<StreetDesignator>\w*)\s*    # Optionally Matches 'ST '
    (?P<StreetDirection>[news]*)\s* # Optionally Matches 'NE'
    ,\s+                            # Force a comma after the street
    (?P<TownName>.*),\s+            # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+        # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                  # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    '717 N 2ND ST NE, MANKATO, MN, 56001',
    '717 N 2ND, MANKATO, MN, 56001',
    ):
    match = regex.match(item)
    print item
    if match:
        print "    House is on {Direction} side of {TownName}".format(**match.groupdict())
    else:
        print "    invalid entry"

Next, consider the OR operator, |, and the non-capturing group operator, (?:pattern). Together, they can describe complex alternatives in the input format. This version matches the new requirement that some addresses have the direction before the street name, and some have the direction after the street name, but no address has the direction in both places.

import re

# Goal:  '717 N 2ND ST, MANKATO, MN 56001',
# Goal:  '717 N 2ND ST, MANKATO, MN, 56001',
# Goal:  '717 2ND ST NE, MANKATO, MN, 56001',
# Goal:  '717 N 2ND, MANKATO, MN, 56001',
regex = r'''
    (?x)            # verbose regular expression
    (?i)            # ignore case
    (?: # Matches any sort of street address
        (?: # Matches '717 N 2ND ST' or '717 N 2ND'
            (?P<HouseNumber>\d+)\s+      # Matches '717 '
            (?P<Direction>[news])\s+     # Matches 'N '
            (?P<StreetName>\w+)\s*       # Matches '2ND ', with optional trailing space
            (?P<StreetDesignator>\w*)\s* # Optionally Matches 'ST '
        )
        | # OR
        (?:  # Matches '717 2ND ST NE' or '717 2ND NE'
            (?P<HouseNumber2>\d+)\s+      # Matches '717 '
            (?P<StreetName2>\w+)\s+       # Matches '2ND '
            (?P<StreetDesignator2>\w*)\s* # Optionally Matches 'ST '
            (?P<Direction2>[news]+)       # Matches 'NE'
        )
    )
    ,\s+                             # Force a comma after the street
    (?P<TownName>.*),\s+             # Matches 'MANKATO, '
    (?P<State>[A-Z]{2}),?\s+         # Matches 'MN ' and 'MN, '
    (?P<ZIP>\d{5})                   # Matches '56001'
'''

regex = re.compile(regex)

for item in (
    '717 N 2ND ST, MANKATO, MN 56001',
    '717 N 2ND ST, MANKATO, MN, 56001',
    '717 N 2ND, Makata, 56001',   # Should reject this one
    '1234 N D AVE, East Boston, MA, 02134',
    '717 2ND ST NE, MANKATO, MN, 56001',
    '717 N 2ND, MANKATO, MN, 56001',
    ):
    match = regex.match(item)
    print item
    if match:
        d = match.groupdict()
        print "    House is on {0} side of {1}".format(
            d['Direction'] or d['Direction2'],
            d['TownName'])
    else:
        print "    invalid entry"
0
Buzz On

you could use this:

\d{1,6}(\s\w+)+,(\s\w+)+,\s[A-Z]{2},?\s\d{1,6}

it will match a string that starts with a house number then any numbers of words followed by a comma. then it will look for a city name that consists of at least one word followed by a coma. next it will look for exactly 2 capital letters followed by an optional comma. then a zip code.

0
AudioBubble On

How about this:

((\w|\s)+),((\w|\s)+),\s*(\w{2})\s*,?\s*(\d{5}).*

You can also use it to extract the street, city, state and zip in \1, \3, \5 and \6 respectively. It'll match the last letter of the street and city separately but this doesn't affect the validity.

0
pchmn On
\d{1,6}\s\w+\s\w+\s[A-Za-z]{2},\s([A-Za-z]+),\s[A-Za-z]{2}(,\s\d{1,6}|\s\d{1,6})

You can test the regex in this link : https://regex101.com/r/yN7hU9/1