Yellow Pages Scraper in Python stopped working

651 views Asked by At

I am trying to scrape data from Yellow Pages. I have used this scraper successfully several times, but it has recently stopped working. I noticed a recent change on the Yellow Pages website where they have added a Sponsored Links table that contains three results. Since this change, the only thing my scraper picks up is the advertisement below this Sponsored Links table. It does not retrieve any of the results.

Where am I going wrong on this?

I have included my code below. As an example, it shows a search for 7 Eleven locations in Wisconsin.

import requests
from bs4 import BeautifulSoup
import csv

my_url = "https://www.yellowpages.com/search?search_terms=7-eleven&geo_location_terms=WI&page={}"
for link in [my_url.format(page) for page in range(1,20)]:
  res = requests.get(link)
  soup = BeautifulSoup(res.text, "lxml")

placeHolder = []
for item in soup.select(".info"):
  try:
    name = item.select("[itemprop='name']")[0].text
  except Exception:
    name = ""
  try:
    streetAddress = item.select("[itemprop='streetAddress']")[0].text
  except Exception:
    streetAddress = ""
  try:
    addressLocality = item.select("[itemprop='addressLocality']")[0].text
  except Exception:
    addressLocality = ""
  try:
    addressRegion = item.select("[itemprop='addressRegion']")[0].text
  except Exception:
    addressRegion = ""
  try:
    postalCode = item.select("[itemprop='postalCode']")[0].text
  except Exception:
    postalCode = ""
  try:
    phone = item.select("[itemprop='telephone']")[0].text
  except Exception:
    phone = ""

  with open('yp-7-eleven-wi.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([name, streetAddress, addressLocality, addressRegion, postalCode, phone])
2

There are 2 answers

3
Dmitriy Khaykin On

The Scraping Life... the struggle is real!

When a site changes their layout, often there may be changes to elements and class names. You want to carefully inspect the updates and fix anything in your scraper that is using hard-coded values tied to page elements, class nammes, etc., which may have changed

A quick inspection of the page shows that the information you're scraping is housed in a different structure:

<div class="v-card">
    <div class="media-thumbnail"><a class="media-thumbnail-wrapper chain-img" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
            data-analytics="{&quot;click_id&quot;:509}" data-impressed="1"><img class="lazy" alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
                data-original="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d" width="40"
                height="40" style="display: block;"><noscript><img alt="7-Eleven" src="//i2.ypcdn.com/blob/c625613c07118f48908d08ec3c5f5f9a9f813850_40.png?074020d"
                    width="40" height="40"></noscript></a></div>
    <div class="info">
        <h2 class="n">2.&nbsp;<a class="business-name" href="/milwaukee-wi/mip/7-eleven-471900245?lid=471900245"
                data-analytics="{&quot;target&quot;:&quot;name&quot;,&quot;feature_click&quot;:&quot;&quot;}" rel=""
                data-impressed="1"><span>7-Eleven</span></a></h2>
        <div class="info-section info-primary">
            <div class="ratings" data-israteable="true"></div>
            <p class="adr"><span class="street-address">1624 W Wells St</span><span class="locality">Milwaukee,&nbsp;</span><span>WI</span>&nbsp;<span>53233</span></p>
            <div class="phones phone primary">(414) 342-9710</div>
        </div>
        <div class="info-section info-secondary">
            <div class="categories"><a href="/wi/convenience-stores" data-analytics="{&quot;click_id&quot;:1171,&quot;adclick&quot;:false,&quot;listing_features&quot;:&quot;category&quot;,&quot;events&quot;:&quot;&quot;}"
                    data-impressed="1">Convenience Stores</a></div>
            <div class="links"><a class="track-visit-website" href="https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836"
                    rel="nofollow" target="_blank" data-analytics="{&quot;click_id&quot;:6,&quot;act&quot;:2,&quot;dku&quot;:&quot;https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836&quot;,&quot;FL&quot;:&quot;url&quot;,&quot;target&quot;:&quot;website&quot;,&quot;LOC&quot;:&quot;https://www.7-eleven.com/locations/wi/milwaukee/1624-w-wells-st-35836?yext=35836&quot;,&quot;adclick&quot;:true}"
                    data-impressed="1">Website</a></div>
        </div>
        <div class="preferred-listing-features"></div>
        <div class="snippet">
            <p class="body"><span>From Business: At 7-Eleven, our doors are always open, and our friendly store teams
                    are ready to serve you. Our fresh, fast and convenient hot foods appeal to any craving, so yo…</span></p>
        </div>
    </div>
</div>

For example, for address, rather than itemprop=address you'd need .street-address, and so on.

For the nested example of Locality, use built-in selectors which mimic CSS style selectors.

try:
  locality = item.select(".street-address")[0]
  addressLocality = locality.text
  state_zip = locality.findChildren("span") # returns a list
  state = state_zip[0]
  zip = state_zip[1]
  # Might want to add some checks if the state or zip is missing, etc.
except Exception:
  addressLocality = ""

In summary:

Fix those hard coded values to match the new class names and you should be back in business.

2
SIM On

Several issues are there in your existing script. You created a for loop which is supposed to traverse 19 different pages whereas the content are confined within a single page. The selectors you defined do not contain those elements anymore. Moreove, you repetead try:except block several times which gave your scraper a real messy look. You can define a customized function to get rid of IndexError or AttributeError issues. Finally, you can make use of csv.DictWriter() to write the scraped items in a csv file.

Give it a shot:

import requests
import csv
from bs4 import BeautifulSoup

placeHolder = []

urls = ["https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=WI&page={}".format(page) for page in range(1,5)]
for url in urls:
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "lxml")

    def get_text(item,path): return item.select_one(path).text if item.select_one(path) else ""

    for item in soup.select(".info"):
      d = {}
      d['name'] = get_text(item,"a.business-name span")
      d['streetAddress'] = get_text(item,".street-address")
      d['addressLocality'] = get_text(item,".locality")
      d['addressRegion'] = get_text(item,".locality + span")
      d['postalCode'] = get_text(item,".locality + span + span")
      d['phone'] = get_text(item,".phones")
      placeHolder.append(d)

with open("yellowpageInfo.csv","w",newline="") as infile:
  writer = csv.DictWriter(infile,['name','streetAddress','addressLocality','addressRegion','postalCode','phone'])
  writer.writeheader()
  for elem in placeHolder:
    writer.writerow(elem)