I am trying to extract entire textual data from the given URL below as an example. I have many URLs so automating. I tried every code posted here - they are giving error, eg AttributeError: 'NoneType' object has no attribute 'find_next'. Perhaps the open source software version is changed hence results are affected.
Here is one link: url = r"https://www.sec.gov/Archives/edgar/data/1166036/000110465904027382/0001104659-04-027382.txt" Anyone share a working code in python? The code should give out data that consists of entire textual info starting from PART I preferably if not from Item 1A all the way to the end.
Here is one for example that doesn't run: Extracting text section from (Edgar 10-K filings) HTML
Update: I did these on the SEC data
html = bs(page.content, "lxml")
text = html.get_text()
text = unicodedata.normalize("NFKD", text).encode('ascii', 'ignore').decode('utf8')
text = text.split("\n")
text = " ".join(text)
I got text as well as some junk like below - it might be coming from the tables - is there a way to filter these out:
<div style=""font-family: 'Times New Roman', Times, serif; font-size: 10pt;""><div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt; font-weight: bold;"">
<div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt; font-weight: bold;"">(4) MORTGAGE NOTES PAYABLE, BANK LINES OF CREDIT AND OTHER LOANS<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><div style=""text-align: justify; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">At October 31, 2018, the Company has mortgage notes payable and other loans that are due in installments over various periods to fiscal 2031. The mortgage loans bear interest rates ranging from 3.5% to 6.6% and are collateralized by real estate investments having a net carrying value of approximately $558.2 million.<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt;"">Combined aggregate principal maturities of mortgage notes payable during the next five years and thereafter are as follows (in thousands):<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><table align=""center"" border=""0"" cellpadding=""0"" cellspacing=""0"" style=""width: 80%; font-family: 'Times New Roman', Times, serif; font-size: 10pt;""><td valign=""bottom"" style=""vertical-align: top; padding-bottom: 2px;""> <td colspan=""1"" valign=""bottom"" style=""vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""2"" valign=""bottom"" style=""vertical-align: top; border-bottom: #000000 solid 2px;""><div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">Principal<div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">Repayments<td colspan=""1"" nowrap=""nowrap"" valign=""bottom"" style=""text-align: left; vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""1"" valign=""bottom"" style=""vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""2"" valign=""bottom"" style=""vertical-align: top; border-bottom: #000000 solid 2px;""><div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New
Your URL represents an amended 8-K filing (ie 8-K/A), and not a 10-K. 8-K filings have a different structure than 10-Ks. Item 1A does not exist in 8-Ks, neither do the other items from 1 to 15. I added a complete list of 10K and 8K items for comparison below. In other words, even if you manage to get a 10-K extraction algo working, it wouldn't work on 8-Ks.
I actually had to solve the same problem: extracting sections from 10-Ks, 10-Qs and 8-Ks and developed an extraction algorithm covering about 99% of all edge cases. The algo is a behemoth and utilizes many natural language processing strategies.
Python example
Here is an example illustrating how to extract item 1A and item 7 from Tesla's 10-K filing. It works for all other items too.
Output
includes:
List of 10-K items:
List of 8-K items: