Alright I think im just missing the connectors, Im pretty new to python.
Goal: Read a CSV
Read all filenames in a directory
If a ROW at index(x) = a filename in the directory then
open the HTML file, and replace text at index(x) with the text from the HTML file
Code so far:
import fileinput
import csv
import os
import sys
import glob
from bs4 import BeautifulSoup
htmlfiles_path = "c:\\somedirectory\\" #path to directory containing the html files
filename_search = glob.glob("c:\\somedirectory\\*.HTM") #get list of filenames
#open csv
with open ('content.csv', mode='rt') as content_file:
reader = csv.reader (content_file, delimiter=',')
for row in reader:
for field in row:
if filename_search(some matching logic i am stuck on):
for htmlcontentfile in glob.glob(os.path.join(path, ".HTM")):
markup(htmlcontentfile)
soup = BeatifulSoup(open(markup, "r").read())
content_file.write(soup.get_text())
#i think something else goes here
I got the csv reader to work, and the glob to pull a list of filenames, having some trouble connecting these. Any help would be fantastic.
I looked up other questions, and some of this code is based on that, but i didn't find anything in python for this challenge. If there is, point me in the right direction!
EDIT1: im using "wt" in the csv open in my code. But that's not where it's getting stuck.
I have a folder full of files. Example:
content/d100.htm
content/d101q.htm
content/d102s.htm
As well as a CSV:
example CSV
CSV File:
Title Name Location
President California d100.html
Goal: Open csv, look for a match under Location for any file from the folder "content"
If it finds a match, open the corresponding HTM file, parse just the text
Replace the field in the csv with the text content of the file
Does that make sense?
Answer:
1) @barny I wouldn't be posting here if I didn't have code running. I apologize for misconstruing what I was looking for.
Anyways, I figured it out by changing by problem statement a little, and using Excel to finish it up.
Original ask:
CSV with
Text | Answer | Target file content
some text | Refer to file 001.htm |
some other text | Refer to file 002.htm |
Find the file, and parse the content to the column next to it.
Slightly changed ask:
Parse all htm files to a csv, and list their respective file name. Then use Excel to match up the content.
Instead of having BSoup, or Python do the matching work, Excel already has a function, index(match()) that can do the second part of my request. So I had Python and Bsoup open each HTML file, and put it in the CSV. I also, carried a long the name of the file in another column. Like so:
Files:
content/001.htm
content/002.htm
content/003.htm
Expected Format of CSV output:
Content of HTML file | File Name
Code:
after the content was in the CSV, I used an index(match()) to pair up the file names from my core file, and the new CSV.