Unable to extract web content(href tags) I'm using python 3.7

46 views Asked by At

unable to scrape @href tags from "https://www.theaic.co.uk/aic/analyse-investment-companies" I'm using Python 3.7,scrapy, splash and also tried with selenium but no use.

1

There are 1 answers

2
Andrej Kesely On

The table you see on the page is inside <iframe>, so you have to load the source of the iframe first:

import requests
from bs4 import BeautifulSoup

url = 'https://www.theaic.co.uk/aic/analyse-investment-companies'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
soup = BeautifulSoup(requests.get('https:' + soup.article.iframe['src']).content, 'html.parser')

for a in soup.select('.gridFundName a'):
    print(a['href'])

Prints:

http://www.theaic.co.uk/3IN
http://www.theaic.co.uk/AAIF
http://www.theaic.co.uk/ADIG
http://www.theaic.co.uk/AEMC
http://www.theaic.co.uk/AJIT
http://www.theaic.co.uk/ALAI
http://www.theaic.co.uk/ABD
http://www.theaic.co.uk/ANII
http://www.theaic.co.uk/ANW
http://www.theaic.co.uk/ASCI
http://www.theaic.co.uk/AASC
http://www.theaic.co.uk/AAS
http://www.theaic.co.uk/ASEI
http://www.theaic.co.uk/ASLI
http://www.theaic.co.uk/ASL
http://www.theaic.co.uk/ASIT
http://www.theaic.co.uk/ASIZ
http://www.theaic.co.uk/AIF
http://www.theaic.co.uk/AIFZ
http://www.theaic.co.uk/AEWU