Is there a way in python to delete several rows in an csv file?

194 views Asked by At

I'm currently working on the download of the form.idx file from sec.gov for the first quarter of 2016. Since I'm only interested in the 10-Ks, I wanted to download the file as a .csv file and delete the useless rows. I tried to filter by the form type but that didn't work out.

My code so far is the following:

import requests
import os

years = [2016]

quarters = ['QTR1']

base_path = '/Users/xyz/Desktop'

current_dirs = os.listdir(path=base_path)

for yr in years:
    if str(yr) not in current_dirs:
        os.mkdir('/'.join([base_path, str(yr)]))
    
    current_files = os.listdir('/'.join([base_path, str(yr)]))
    
    for qtr in quarters:
        local_filename =  f'{yr}-{qtr}.csv'
        
    
        local_file_path = '/'.join([base_path, str(yr), local_filename])
        
        if local_filename in current_files:
            print(f'Skipping file for {yr}, {qtr} because it is already saved.')
            continue
        
        url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
        
        r = requests.get(url, stream=True)
        with open(local_file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)

r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=";", encoding="utf-8")
r2.head()
filt = (r2 ['Form Type'] == '10-K')
r2_10K = r2.loc[filt]
r2_10K.head()
r2_10K.to_csv('/Users/xyz/Desktop/modified.csv')

The Error message I get is:
Traceback (most recent call last):

  File "<ipython-input-5-f84e3f81f3d1>", line 61, in <module>
    filt = (r2 ['Form Type'] == '10-K')

  File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)

  File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err

KeyError: 'Form Type'

Maybe there's a way to just delete the rows I don't need in the file? Otherwise, I'm also thankful for any kind of help on that problem.

Many thanks in advance.

Kind regards, Elena

2

There are 2 answers

5
A l w a y s S u n n y On BEST ANSWER

This is the full working code for you, The main issue was on your csv format that you're getting from online, Full code: https://rextester.com/QUGF24653

What I did:

  1. I did skip first 10 rows
  2. Set column names after using a 3 space separator
  3. Split last column to 2 new columns
  4. Filter Form Type with "10-K"
import requests
import os
import pandas as pd

years = [2016]
quarters = ['QTR1']
base_path = '/Users/xyz/Desktop'
current_dirs = os.listdir(path=base_path)

for yr in years:
    if str(yr) not in current_dirs:
        os.mkdir('/'.join([base_path, str(yr)]))

    current_files = os.listdir('/'.join([base_path, str(yr)]))

    for qtr in quarters:
        local_filename = f'{yr}-{qtr}.csv'

        local_file_path = '/'.join([base_path, str(yr), local_filename])

        if local_filename in current_files:
            print(f'Skipping file for {yr}, {qtr} because it is already saved.')
            continue

        url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'

        r = requests.get(url, stream=True)
        with open(local_file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)

colnames=['Form Type', 'Company Name', 'CIK', 'Date Filed','File Name']
r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=r'\s{3,}', skiprows=10, encoding="utf-8", names=colnames,header=None)
r2[['Date Filed','File Name']] = r2['Date Filed'].str.split(expand=True)
filtered = (r2['Form Type'] == '10-K')
r2_10K = r2.loc[filtered]
print(r2_10K.head())

Output:

   Form Type                            Company Name      CIK  Date Filed                                    File Name
2181      10-K                       1347 Capital Corp  1606163  2016-03-21  edgar/data/1606163/0001144204-16-089184.txt
2182      10-K  1347 Property Insurance Holdings, Inc.  1591890  2016-03-17  edgar/data/1591890/0001387131-16-004603.txt
2183      10-K                1ST CONSTITUTION BANCORP  1141807  2016-03-22  edgar/data/1141807/0001141807-16-000010.txt
2184      10-K                         1ST SOURCE CORP    34782  2016-02-19    edgar/data/34782/0000034782-16-000102.txt
2185      10-K            1st Century Bancshares, Inc.  1420525  2016-03-04  edgar/data/1420525/0001437749-16-026765.txt
0
zravii On

There are various ways by which you can delete rows from a csv files. The Pandas library in Python has any number of functions by which you can alter your data from csv file. First of all import the Pandas library by the following code:

import pandas as pd

Read your csv file by the following code :

df = pd.read_csv("filename.csv")

For example if you have a data field named df which contains your csv file. You can drop rows by the indexes by the following code:

df1 = df.drop([df.index[1], df.index[2]])

There are any number of ways by which you can drop rows from a csv using Pandas. For example: by row value, by null values, by data type and what not!