I saw the question on pre-2013 13-F filings, but noticed they used an even different format pre 2012. This is the original question: Extracting table of holdings from (Edgar 13-F filings) TXT (pre-2013) with python
Pre 2013 but post 2012 example:
https://www.sec.gov/Archives/edgar/data/1067983/000119312512470800/d434976d13fhr.txt
Pre 2012 example:
https://www.sec.gov/Archives/edgar/data/1067983/000095012905008251/0000950129-05-008251.txt
Pre 2012, they did not fill in all company names, title of class and CUSIP number. This therefore shifts the columns to the left. (See pre 2012 format in picture)
Adapting the code from NoobFin and Jack Fleeting's question gives me this:
Code:
endpoint = r"https://www.sec.gov/Archives/edgar/data/1067983/000095012905008251/0000950129-05-008251.txt"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url = endpoint, headers = headers)
def lst_bunch(l,lenth=4):
i=0
while i < len(l):
if len(l[i])<lenth:
l[i] += l.pop(i+1)
i += 1
for item in l:
if len(item)<lenth:
lst_bunch(l,lenth)
else:
return l
tabs = response.text.replace('<TABLE>','xxx<TABLE>').split('xxx')
for tab in tabs[1:]:
soup = bs(tab,'html')
table = soup.select_one('table')
lines = table.text.splitlines()
lst_bunch(lines,50)
for line in lines:
print(line.strip())
What I am looking for is a DataFrame which I can export to CSV (or SQL or whatever) that looks like this:
I was thinking of making 1 good example and put it through some ML commands, but maybe I am missing something.
Thanks!