Extract entire textual data from Edgar 10-K using python

2k views Asked by At

I am trying to extract entire textual data from the given URL below as an example. I have many URLs so automating. I tried every code posted here - they are giving error, eg AttributeError: 'NoneType' object has no attribute 'find_next'. Perhaps the open source software version is changed hence results are affected.

Here is one link: url = r"https://www.sec.gov/Archives/edgar/data/1166036/000110465904027382/0001104659-04-027382.txt" Anyone share a working code in python? The code should give out data that consists of entire textual info starting from PART I preferably if not from Item 1A all the way to the end.

Here is one for example that doesn't run: Extracting text section from (Edgar 10-K filings) HTML

Update: I did these on the SEC data

        html = bs(page.content, "lxml")
    text = html.get_text()
    text = unicodedata.normalize("NFKD", text).encode('ascii', 'ignore').decode('utf8')
    text = text.split("\n")
    text = " ".join(text)

I got text as well as some junk like below - it might be coming from the tables - is there a way to filter these out:

<div style=""font-family: 'Times New Roman', Times, serif; font-size: 10pt;""><div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt; font-weight: bold;"">
<div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt; font-weight: bold;"">(4) MORTGAGE NOTES PAYABLE, BANK LINES OF CREDIT AND OTHER LOANS<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><div style=""text-align: justify; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">At October 31, 2018, the Company has mortgage notes payable and other loans that are due in installments over various periods to fiscal 2031.  The mortgage loans bear interest rates ranging from 3.5% to 6.6% and are collateralized by real estate investments having a net carrying value of approximately $558.2 million.<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt;"">Combined aggregate principal maturities of mortgage notes payable during the next five years and thereafter are as follows (in thousands):<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><table align=""center"" border=""0"" cellpadding=""0"" cellspacing=""0"" style=""width: 80%; font-family: 'Times New Roman', Times, serif; font-size: 10pt;""><td valign=""bottom"" style=""vertical-align: top; padding-bottom: 2px;""> <td colspan=""1"" valign=""bottom"" style=""vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""2"" valign=""bottom"" style=""vertical-align: top; border-bottom: #000000 solid 2px;""><div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">Principal<div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">Repayments<td colspan=""1"" nowrap=""nowrap"" valign=""bottom"" style=""text-align: left; vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""1"" valign=""bottom"" style=""vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""2"" valign=""bottom"" style=""vertical-align: top; border-bottom: #000000 solid 2px;""><div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New

1

There are 1 answers

1
Jay On

Your URL represents an amended 8-K filing (ie 8-K/A), and not a 10-K. 8-K filings have a different structure than 10-Ks. Item 1A does not exist in 8-Ks, neither do the other items from 1 to 15. I added a complete list of 10K and 8K items for comparison below. In other words, even if you manage to get a 10-K extraction algo working, it wouldn't work on 8-Ks.

I actually had to solve the same problem: extracting sections from 10-Ks, 10-Qs and 8-Ks and developed an extraction algorithm covering about 99% of all edge cases. The algo is a behemoth and utilizes many natural language processing strategies.

Python example

Here is an example illustrating how to extract item 1A and item 7 from Tesla's 10-K filing. It works for all other items too.

from sec_api import ExtractorApi # https://pypi.org/project/sec-api/

extractorApi = ExtractorApi("YOUR_API_KEY")


# Tesla 10-K filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm"

# get the standardized and cleaned text of section 1A "Risk Factors"
section_text = extractorApi.get_section(filing_url, "1A", "text")

# get the original HTML of section 7 
# "Management’s Discussion and Analysis of Financial Condition and Results of Operations"
section_html = extractorApi.get_section(filing_url, "7", "html")

Output

section_text[0:1000]

includes:

ITEM 1A. RISK FACTORS\n\nYou should carefully consider the risks described below together with the other information set forth in this report, which could materially affect our business, financial condition and future results. The risks described below are not the only risks facing our company. Risks and uncertainties not currently known to us or that we currently deem to be immaterial also may materially adversely affect our business, financial condition and operating results. \n\nRisks Related to Our Ability to Grow Our Business\n\nWe may be impacted by macroeconomic conditions resulting from the global COVID-19 pandemic.\n\nSince the first quarter of 2020, there has been a worldwide impact from the COVID-19 pandemic. Government regulations and shifting social behaviors have limited or closed non-essential transportation, government functions, business activities and person-to-person interactions. In some cases, the relaxation of such trends has recently been followed by actual or...

List of 10-K items:

  • 1 - Business
  • 1A - Risk Factors
  • 1B - Unresolved Staff Comments
  • 2 - Properties
  • 3 - Legal Proceedings
  • 4 - Mine Safety Disclosures
  • 5 - Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
  • 6 - Selected Financial Data (prior to February 2021)
  • 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
  • 7A - Quantitative and Qualitative Disclosures about Market Risk
  • 8 - Financial Statements and Supplementary Data
  • 9 - Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
  • 9A - Controls and Procedures
  • 9B - Other Information
  • 10 - Directors, Executive Officers and Corporate Governance
  • 11 - Executive Compensation
  • 12 - Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
  • 13 - Certain Relationships and Related Transactions, and Director Independence
  • 14 - Principal Accountant Fees and Services

List of 8-K items:

  • 1.01: Entry into a Material Definitive Agreement
  • 1.02: Termination of a Material Definitive Agreement
  • 1.03: Bankruptcy or Receivership
  • 1.04: Mine Safety - Reporting of Shutdowns and Patterns of Violations
  • 2.01: Completion of Acquisition or Disposition of Assets
  • 2.02: Results of Operations and Financial Condition
  • 2.03: Creation of a Direct Financial Obligation or an Obligation under an Off-Balance Sheet Arrangement of a Registrant
  • 2.04: Triggering Events That Accelerate or Increase a Direct Financial Obligation or an Obligation under an Off-Balance Sheet Arrangement
  • 2.05: Cost Associated with Exit or Disposal Activities
  • 2.06: Material Impairments
  • 3.01: Notice of Delisting or Failure to Satisfy a Continued Listing Rule or Standard; Transfer of Listing
  • 3.02: Unregistered Sales of Equity Securities
  • 3.03: Material Modifications to Rights of Security Holders
  • 4.01: Changes in Registrant's Certifying Accountant
  • 4.02: Non-Reliance on Previously Issued Financial Statements or a Related Audit Report or Completed Interim Review
  • 5.01: Changes in Control of Registrant
  • 5.02: Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers
  • 5.03: Amendments to Articles of Incorporation or Bylaws; Change in Fiscal Year
  • 5.04: Temporary Suspension of Trading Under Registrant's Employee Benefit Plans
  • 5.05: Amendments to the Registrant's Code of Ethics, or Waiver of a Provision of the Code of Ethics
  • 5.06: Change in Shell Company Status
  • 5.07: Submission of Matters to a Vote of Security Holders
  • 5.08: Shareholder Nominations Pursuant to Exchange Act Rule 14a-11
  • 6.01: ABS Informational and Computational Material
  • 6.02: Change of Servicer or Trustee
  • 6.03: Change in Credit Enhancement or Other External Support
  • 6.04: Failure to Make a Required Distribution
  • 6.04: Failure to Make a Required Distribution
  • 6.04: Failure to Make a Required Distribution
  • 6.05: Securities Act Updating Disclosure
  • 6.06: Static Pool
  • 6.10: Alternative Filings of Asset-Backed Issuers
  • 7.01: Regulation FD Disclosure
  • 8.01: Other Events
  • 9.01: Financial Statements and Exhibits