regex findall in beautifulsoup -python 3

276 views Asked by At

I need to get the name and value and context ref for all the fields under the tag ix:nonfraction which looks like this:

<ix:nonfraction name="uk-gaap:TangibleFixedAssets" contextref="FY1.END" unitref="GBP" xmlns:uk-gaap="http://www.xbrl.org/uk/gaap/core/2009-09-01" decimals="0" format="ixt:numcommadot">238,011</ix:nonfraction>.

with the output needed as :

TangibleFixedAssets, FY1.end, 238,011

the string that the regex will have to search through contains many of these tags so would there be a way of keeping all the 3 outputs concatenated (or within the same index of the list)?

1

There are 1 answers

0
宏杰李 On BEST ANSWER
import bs4
html = '''<ix:nonfraction name="uk-gaap:TangibleFixedAssets" contextref="FY1.END" unitref="GBP" xmlns:uk-gaap="http://www.xbrl.org/uk/gaap/core/2009-09-01" decimals="0" format="ixt:numcommadot">238,011</ix:nonfraction>'''

soup = bs4.BeautifulSoup(html, 'lxml')

ixs = soup.find_all('ix:nonfraction')
for ix in ixs:
    name = ix['name'].split(':')[-1]
    contextref = ix['contextref']
    text = ix.text
    output = [name, contextref, text]
    print(output)

out:

['TangibleFixedAssets', 'FY1.END', '238,011']