How can I get data from a specific class of a html tag using beautifulsoup?

7.3k views Asked by At

I want to get data located(name, city and address) in div tag from a HTML file like this:

<div class="mainInfoWrapper">
    <h4 itemprop="name">name</h4>
    <div>
        <a href="/Wiki/Province/Tehran"></a>
         city
        <a href="/Wiki/City/Tehran"></a>
         Address
    </div>
</div>

I don't know how can I get data that i want in that specific tag. obviously I'm using python with beautifulsoup library.

3

There are 3 answers

4
mhawke On BEST ANSWER

There are several <h4> tags in the source HTML, but only one <h4> with the itemprop="name" attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:

from bs4 import BeautifulSoup

html = '''<div class="mainInfoWrapper">
    <h4 itemprop="name">            
        NAME
        &nbsp;                          

    </h4>                           
    <div>                           
        <a href="/Wiki/Province/Tehran">PROVINCE</a> - <a href="/Wiki/City/Tehran">CITY</a> ADDRESS
    </div>                          
</div>'''

soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')

name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()

When run for the URL that you provided

import requests
from bs4 import BeautifulSoup

r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')

name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()

>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت

I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.

2
Mazdak On

You can do it with built-in lxml.html module :

>>> s="""<div class="mainInfoWrapper">
...     <h4 itemprop="name">name</h4>
...     <div>
...         <a href="/Wiki/Province/Tehran"></a>
...          city
...         <a href="/Wiki/City/Tehran"></a>
...          Address
...     </div>
... </div>"""
>>> 
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']

And with BeautifulSoup to get the text between your tags:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text

And for get the text from a specific tag just use soup.find_all :

soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
    print line.text
0
Vikas Ojha On

If h4 is used only once then you can do this -

name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')