I want to extract us Zipcode from sec 10k html files using python

I have tried this code

import re
s="https://www.sec.gov/Archives/edgar/data/20/000095012310024631/c97665e10vk.htm"

zipcode = re.findall(r'\b[0-9]{5}(?:-[0-9]{4})?\b', s)
print zipcode

output is [] whereas I need 08071-0888

2 Answers

0
Yusufsn On Best Solutions

Try this one. First, grab the html using BeautifulSoup. Find all td tag in the html. Then, extract the zipcode using regex.

from bs4 import BeautifulSoup
import requests, re

url = "https://www.sec.gov/Archives/edgar/data/20/000095012310024631/c97665e10vk.htm"

page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

for s in soup.find_all("td", attrs={"align":"center"}):
    zipcode = re.findall("(\d{5}-\d{4})",str(s)) # you can also use your regex if you want
    if zipcode != []:
        print (zipcode)

Output:

['08071-0888']
0
Muhammad Abbas On

[Thanks for your help, I have to extract zip and city info from number of files in folder, my code is like below but will change according to your regex. next thing is to extract city information and save them to csv file1