Add column with filenames on a dataframe with Pandas

63 views Asked by At

I created a document-term matrix from multiple txt files. The result is a dataframe with each column being a word, and each row being a file (my final goal is to visualize the document-term matrix with matplotlib).

My dataframe also have an index, but I rather want a column with the name of each file, since each filename is a year (for example, "1905.txt", "1906.txt", etc.). The data frame looks something like this:

Hello I am
0 1 2 1
1 1 1 1
2 0 1 2

And I want something like this :

Hello I am
1905.txt 1 2 1
1906.txt 1 1 1
1907.txt 0 1 2

It would be even better without the ".txt"

How can I proceed ?

Here's my current code :

from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import numpy as np
import re

# create a list for all txt files
corpus =[]

# with pathlib, get all files in the corpus list 
for fichier in Path("/Users/MyPath/files").rglob("*.txt"):
     corpus.append(fichier.parent / fichier.name)


corpus.sort()

 
all_documents = []
for fichier_txt in corpus:
    with open(fichier_txt) as f:
        fichier_txt_chaine = f.read()
        fichier_txt_chaine = re.sub('[^A-Za-z]', ' ', fichier_txt_chaine) 
    all_documents.append(fichier_txt_chaine)

# here i am using sklearn, but this part is not relevant for my question
coun_vect = CountVectorizer(stop_words= "english")
count_matrix = coun_vect.fit_transform(all_documents)

count_array = count_matrix.toarray()
allDataframe = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(allDataframe)
allDataframe.to_csv("Matrice_doc_term.csv")

I suppose my problem is similar to this one, but I don't know how to adapt the answer to my code : Python Pandas add Filename Column CSV

2

There are 2 answers

0
mozway On BEST ANSWER

You most likely just need to pass the index to the DataFrame constructor:

pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
             index=corpus)

Or, since you have Path objects in corpus and just want the filename:

pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
             index=[f.name for f in corpus])

Or for just the stem:

pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
             index=[f.stem for f in corpus])
0
Moaybe On

To modify your DataFrame so that it includes a column with the filename (without the ".txt" extension) instead of the current index, you can follow these steps:

Extract the filenames from your corpus list, remove the ".txt" extension, and then use these filenames as the index of your DataFrame. Reset the index so that these filenames become a regular column. Here's how you can modify your code to achieve this:

from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import numpy as np
import re

# create a list for all txt files
corpus = []

# with pathlib, get all files in the corpus list 
for fichier in Path("/Users/MyPath/files").rglob("*.txt"):
    corpus.append(fichier.parent / fichier.name)

corpus.sort()

all_documents = []
file_names = [] # List to store file names without .txt extension
for fichier_txt in corpus:
    with open(fichier_txt) as f:
        fichier_txt_chaine = f.read()
        fichier_txt_chaine = re.sub('[^A-Za-z]', ' ', fichier_txt_chaine)
    all_documents.append(fichier_txt_chaine)
    
    # Extract the file name without .txt extension
    file_name = fichier_txt.stem
    file_names.append(file_name)

# Using sklearn (irrelevant for the current modification)
coun_vect = CountVectorizer(stop_words="english")
count_matrix = coun_vect.fit_transform(all_documents)

count_array = count_matrix.toarray()
allDataframe = pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names())

# Set the file names as the index
allDataframe.index = file_names

# Reset the index to make file names a column
allDataframe.reset_index(inplace=True)
allDataframe.rename(columns={'index': 'Year'}, inplace=True)

print(allDataframe)
allDataframe.to_csv("Matrice_doc_term.csv")