Extracting table of holdings from (Edgar 13-F filings) TXT (pre-2013) with python

1.3k views Asked by At

I am working on extracting a table of holdings from 13-F form on EDGAR. Before 2013 holdings were given in a txt file (see example). The output I am aiming for is a pd.DataFrame with same shape as the "Form 13F Information Table" in txt file (10 columns and each line in a separate row).

I have tried to use BeautifulSoup, which turned table into a tag object, but I can't figure out how to format this to go into the dataframe as described above.

This is my code attempt:

soup2 = BeautifulSoup(requests.get(filing_url_13f).content, 'lxml')
holdings = soup2.find_all('table')

#This is my attempt to turn the content into a list:
lixt=[]
for x in soup2.find_all(['c','c','c','c','c','c','c','c','c']  ):
    for line in x:
        lixt.append(line)
x=lixt[1]
l=[]
for string in x.strings:
    l.append(repr(string))
el=l[7]

And this is where I get stuck as el returns the below. I can't split it by \n, as there are often \n within company name (AMERICAN\n EXPRESS CO).

\nAMERICAN\n EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -\nAMERICAN\n EXPRESS CO COM 025816109 990,116 17,225,400 Shared-Defined 4, 5 17,225,400 - -\nAMERICAN\n EXPRESS CO COM 025816109 48,274 839,832 Shared-Defined 4, 7 839,832 - -\nAMERICAN\n EXPRESS CO COM 025816109 111,689 1,943,100 Shared-Defined 4, 8, 11 1,943,100 - -\nAMERICAN\n EXPRESS CO COM 025816109 459,532 7,994,634 Shared-Defined 4, 10 7,994,634 - -\nAMERICAN\n EXPRESS CO COM 025816109 6,912,308 120,255,879 Shared-Defined 4, 11 120,255,879 - -\nAMERICAN\n EXPRESS CO COM 025816109 80,456 1,399,713 Shared-Defined 4, 13 1,399,713 - -\nARCHER DANIELS\n MIDLAND CO COM 039483102 163,151 5,956,600 Shared-Defined 4, 5 5,956,600 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 206,661 8,041,300 Shared-Defined 4 8,041,300 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 46,104 1,793,915 Shared-Defined 2, 4, 11 1,793,915 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 251,827 9,798,700 Shared-Defined 4, 8, 11 9,798,700 - -\nCOCA COLA CO COM 191216100 29,000 800,000 Shared-Defined 4 800,000 - -\n

I will be very grateful for any suggestions.

1

There are 1 answers

1
Jack Fleeting On BEST ANSWER

Yes, these old EDGAR filings are terrible (not that the new ones are much better). This one is particularly bad, since longer entry lines were split into separate lines in order to get them to fit the page.

So the following should get you close enough to what you want:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

req = requests.get('https://www.sec.gov/Archives/edgar/data/1067983/000119312512470800/d434976d13fhr.txt')

#next is a helper function to put back those longer entires
def lst_bunch(l,lenth=4):
    i=0
    while i < len(l):
        if len(l[i])<lenth:
            l[i] += l.pop(i+1)
        i += 1
    for item in l:
        if len(item)<lenth:
            lst_bunch(l,lenth)
    else:
        return l

tabs = req.text.replace('<TABLE>','xxx<TABLE>').split('xxx')
for tab in tabs[2:]:
    soup = bs(tab,'lxml')
    table = soup.select_one('table')
    lines = table.text.splitlines()
    lst_bunch(lines,30)
    for line in lines:
        print(line.strip())

Output:
Name of Issuer  Class  CUSIP     (In Thousands)   Amount      Discretion   Managers     Sole      Shared  None
AMERICAN  EXPRESS CO    COM    025816109      110,999     1,952,142 Shared-Defined 4           1,952,142       -   -
AMERICAN  EXPRESS CO    COM    025816109      979,436    17,225,400 Shared-Defined 4, 5       17,225,400       -   -

etc.