Text Scraping using Python: Regex

144 views Asked by At

I have a dynamic text which looks something like this

my_text = "address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  
           email [email protected] , sdasd [email protected] - [email protected]"

The text starts with an 'address'. As soon as we see 'address' we need to scrape everything from there until either 'landline'/'mobile'/'cell' appears. From there on, we want to scrape when all the phone text (without altering spaces in between). We start from the first occurrence of either 'landline'/'mobile'/'cell' and stop as soon as we find 'email' appear. Finally we scrape the email part (without altering spaces in between)

'landline'/'mobile'/'cell' can appear in any order and sometimes some may not appear. For example, the text could have looked like this as well.

my_text = "address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email [email protected] , sdasd [email protected] - [email protected]"

There's a little more engineering that needs to be done to form arrays of subtext contained in address, phones and email text. Subtexts of addresses are always separated with commas (,). Subtexts of emails can be separated with commas (,) or hyphens (-).

My output should be a JSON dictionary which looks something like this:

resultant_dict = {
                      addresses: [
                                  { address: "ae fae daq ad" }
                                , { address: "1231 asdas" }
                               ]
                    , phones: [
                                  { number: "213121233 -123", kind: "landline" }
                                , { number: "513121233", kind: "mobile" }
                                , { number: "(132 -142-3127", kind: "cell" }
                             ]
                    , emails: [
                                  { email: "[email protected]", connector: "" }
                                , { email: "sdasd [email protected]", connector: "," }
                                , { email: "[email protected]", connector: "-" }
                              ]
}

I am trying to achieve this thing using regular expressions or any other way in Python. I can't figure out how to write this as I am a novice programmer.

2

There are 2 answers

2
Cody Bouche On BEST ANSWER

This will work as long as there are no spaces in your emails

import re
my_text = 'address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  email [email protected] , [email protected] - [email protected]'

split_words = ['address', 'landline', 'mobile', 'cell', 'email']
resultant_dict = {'addresses': [], 'phones': [], 'emails': []}

for sw in split_words:

    text = filter(None, my_text.split(sw))
    text = text[0].strip() if len(text) < 2 else text[1].strip()
    next_split = [x.strip() for x in text.split() if x.strip() in split_words]

    if next_split:
        text = text.split(next_split[0])[0].strip()

    if sw in ['address']:
        text = text.split(',')
        for t in text:
            resultant_dict['addresses'].append({'address': t.strip()})

    elif sw in ['landline', 'mobile', 'cell']:
        resultant_dict['phones'].append({'number': text, 'kind': sw})

    elif sw in ['email']:

        connectors = [',', '-']
        emails = re.split('|'.join(connectors), text)
        text = filter(None, [x.strip() for x in text.split()])

        for email in emails:

            email = email.strip()
            connector = ''
            index = text.index(email) if email in text else 0

            if index > 0:
                connector = text[index - 1]

            resultant_dict['emails'].append({'email': email, 'connector': connector})

print resultant_dict
2
Jerry101 On

This is not a good job for regular expressions since the components you want to parse out of the input can appear in any order and any number.

Consider using a lexing and parsing library such as the pyPEG parsing expression grammar.

Another approach would use str.split() or re.split() to split the input text into tokens. Then scan through those tokens looking for your keywords like address, cell, and ,, accumulating the following tokens until the next keyword. This approach lets split() do the first part of the tokenizing work, leaving you to do the rest of the lexical work (by recognizing keywords) and the parsing work manually.

The manual approach is more instructive but more verbose and less flexible. It goes like this:

text = """address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email [email protected] , sdasd [email protected] - [email protected]"""

class Scraper:
    def __init__(self):
        self.current = []
        self.current_type = None

    def emit(self):
        if self.current:
            # TODO: Add the new item to a dictionary.
            # Later, translate the dictionary to JSON format.
            print(self.current_type, self.current)

    def scrape(self, input_text):
        tokens = input_text.split()
        for token in tokens:
            if token in ('address', 'cell', 'landline', 'email'):
                self.emit()
                self.current = []
                self.current_type = token
            else:
                self.current.append(token)
        self.emit()

s = Scraper()
s.scrape(text)

This emits:

address ['ae', 'fae', 'daq', 'ad,', '1231', 'asdas']
cell ['(132)', '-142-3127']
landline ['213121233', '-123']
email ['[email protected]', ',', 'sdasd', '[email protected]', '-', '[email protected]']

You'll want to use re.split() to make it split 'ad,' into ['ad', ','], add code to handle tokens like ,, and use a library to convert the dictionary to JSON format.