Beautifulsoup - Extract text from next div sub tag based on previous div sub tag

465 views Asked by At

I'm trying to extract the data which is in next span of div based on previous div-span text.below is the html content,

<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven
    <br></span></div>

I trying to find the text using,

n_field = soup.find('span', text="Name\")

And then trying to get the text from next sibling using,

n_field.next_sibling()

However, due to "\n" in the field, I'm unable to find the span and the extract the next_sibling text.

In short, I'm trying to form a dict in the below format,

{"Name": "Ven"}

Any help or idea on this is appreciated.

2

There are 2 answers

5
dudko On BEST ANSWER

You could use re instead of bs4.

import re

html = """
    <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;">
        <span style="font-family: b'Times-Bold'; font-size:13px">Name
            <br>
        </span>
    </div>
    <div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;">
        <span style="font-family: b'Helvetica'; font-size:13px">Ven
            <br>
        </span>
    """

mo = re.search(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)
print(mo.groups())

# for consecutive cases use re.finditer or re.findall
html *= 5
mo = re.finditer(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL)

for match in mo:
    print(match.groups())

for (key, value) in re.findall(r'(Name).*?<span.*?13px">(.*?)\n', html, re.DOTALL):
    print(key, value)
0
jnvilo On

I had a go at this, and for some reason even after removing the \n, I could not get the nextSibling() so I tried a different tactic as shown below:

from bs4 import BeautifulSoup

"""Lets get rid of the \n""" 
html = """<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:37px; top:161px; width:38px; height:13px;"><span style="font-family: b'Times-Bold'; font-size:13px">Name<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:161px; width:58px; height:13px;"><span style="font-family: b'Helvetica'; font-size:13px">Ven<br></span></div>""".replace("\n","")
soup = BeautifulSoup(html)
span_list = soup.findAll("span")
result = {span_list[0].text:span_list[1].text.replace(" ","")}

And that gives result as:

{'Name': 'Ven'}