Read CSV, if text match, open a html file with matching file name, and copy in text

159 views Asked by At

Alright I think im just missing the connectors, Im pretty new to python.

Goal: Read a CSV
Read all filenames in a directory
If a ROW at index(x) = a filename in the directory then
open the HTML file, and replace text at index(x) with the text from the HTML file

Code so far:

import fileinput
import csv
import os
import sys
import glob
from bs4 import BeautifulSoup

htmlfiles_path = "c:\\somedirectory\\" #path to directory containing the html files
filename_search = glob.glob("c:\\somedirectory\\*.HTM") #get list of filenames

#open csv

with open ('content.csv', mode='rt') as content_file:
    reader = csv.reader (content_file, delimiter=',')
    for row in reader:
        for field in row:
            if filename_search(some matching logic i am stuck on):
                for htmlcontentfile in glob.glob(os.path.join(path, ".HTM")):
                    markup(htmlcontentfile)
                    soup = BeatifulSoup(open(markup, "r").read())
                        content_file.write(soup.get_text())
                #i think something else goes here

I got the csv reader to work, and the glob to pull a list of filenames, having some trouble connecting these. Any help would be fantastic.

I looked up other questions, and some of this code is based on that, but i didn't find anything in python for this challenge. If there is, point me in the right direction!

EDIT1: im using "wt" in the csv open in my code. But that's not where it's getting stuck.

I have a folder full of files. Example:

content/d100.htm
content/d101q.htm
content/d102s.htm

As well as a CSV:
example CSV
CSV File:

Title Name Location
President California d100.html

Goal: Open csv, look for a match under Location for any file from the folder "content"
If it finds a match, open the corresponding HTM file, parse just the text
Replace the field in the csv with the text content of the file

Does that make sense?

1

There are 1 answers

0
bhappyman On BEST ANSWER

Answer:

1) @barny I wouldn't be posting here if I didn't have code running. I apologize for misconstruing what I was looking for.

Anyways, I figured it out by changing by problem statement a little, and using Excel to finish it up.

Original ask:

CSV with

Text | Answer | Target file content

some text | Refer to file 001.htm |
some other text | Refer to file 002.htm |

Find the file, and parse the content to the column next to it.

Slightly changed ask:

Parse all htm files to a csv, and list their respective file name. Then use Excel to match up the content.

Instead of having BSoup, or Python do the matching work, Excel already has a function, index(match()) that can do the second part of my request. So I had Python and Bsoup open each HTML file, and put it in the CSV. I also, carried a long the name of the file in another column. Like so:

Files:
content/001.htm

content/002.htm

content/003.htm

Expected Format of CSV output:

Content of HTML file | File Name

Code:

import fileinput
import csv
import os
import sys
import glob
from bs4 import BeautifulSoup

path = "<the path>"


def main():
   for filepath in glob.glob(os.path.join('<the path>', '*.HTM')): #find folder containing html files 
    with open(filepath) as f:
        contentstuff = f.read() #find an html file, and read it
        soup = BeautifulSoup(contentstuff, "html.parser") #parse the html out
        with open (path + '\\htmlpages.csv', 'a', encoding='utf-8', newline='') as content_file:
            writer = csv.writer (content_file, delimiter=',') #start writer for file content to CSV
            fp = filepath[-12:] #trim the file name to necessary name
            for body_tag in soup.find_all('body'):
                bodye = (body_tag.text.replace("\t", "").replace("\n", "")) #deal with necessary formatting between Bsoup and Excel
                print(bodye) #show me the work
                writer.writerow([bodye, fp])  #do the actual writing

after the content was in the CSV, I used an index(match()) to pair up the file names from my core file, and the new CSV.