Trying to scrape a dynamic website in python with requests_html

34 views Asked by At

When i try to scrape this site site i run into an issue and i can't figure out what's wrong. i tried using Htmlsession but python told me to use AsyncHTMLSession because the former can't perform loops. when using AsyncHTMLSession i keep running into this problem.

url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"
session = AsyncHTMLSession()
response = session.get(url)
await response.html.arender()
await session.close() 

print(response.html)
print(response.html.html)

this is the error i get

AttributeError                            Traceback (most recent call last)
Cell In [12], line 4
      2 session = AsyncHTMLSession()
      3 response = session.get(url)
----> 4 await response.html.arender()
      5 await session.close() 
      7 print(response.html)

AttributeError: '_asyncio.Future' object has no attribute 'html'

Please any help would be greatly appreciated.

I've added await to the render code. tried passing a sleep int in the render code, also adding a await asession.close() also yielded the same error code.

1

There are 1 answers

0
Andrej Kesely On

Use other URL to load the HTLM (not the Ajax-y one), for example:

from io import StringIO
import pandas as pd
import requests

# orinal_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm'
new_url = "https://www.sec.gov/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
}
soup = BeautifulSoup(requests.get(new_url, headers=headers).content, "html.parser")

balance_sheets = soup.select_one("#balance_sheets ~ table")

# for example, load the table into dataframe:
df = pd.read_html(StringIO(str(balance_sheets)))[0].fillna("")
print(df)

Prints:

                                                                                           0 1     2       3  4 5     6       7  8 
0                                                                                                                                 
1                                                                              (In millions)                                      
2                                                                                                                                 
3                                                                                                                                  
4                                                                                   June 30,    2023    2023       2022    2022    
5                                                                                                                                 
6                                                                                     Assets                                       
7                                                                            Current assets:                                       
8                                                                  Cash and cash equivalents       $   34704          $   13931   
9                                                                     Short-term investments           76558              90826    
10                                                                                                                                 
11                                                                                                                                 
12                                  Total cash, cash equivalents, and short-term investments          111262             104757   
13              Accounts receivable, net of allowance for doubtful accounts of $650 and $633           48688              44261   
14                                                                               Inventories            2500               3742   
15                                                                      Other current assets           21807              16924   
16                                                                                                                                 
17                                                                                                                                 
18                                                                      Total current assets          184257             169684   
19            Property and equipment, net of accumulated depreciation of $68,251 and $59,660           95641              74398   
20                                                       Operating lease right-of-use assets           14346              13148   
21                                                                        Equity investments            9879               6891   
22                                                                                  Goodwill           67886              67524   
23                                                                    Intangible assets, net            9366              11298   
24                                                                    Other long-term assets           30601              21897   
25                                                                                                                                 
26                                                                                                                                 
27                                                                              Total assets       $  411976          $  364840   

...