Add column with filenames on a dataframe with Pandas

Question

Add column with filenames on a dataframe with Pandas

68 views Asked by Crazy chicken At 28 November 2023 at 12:57

I created a document-term matrix from multiple txt files. The result is a dataframe with each column being a word, and each row being a file (my final goal is to visualize the document-term matrix with matplotlib).

My dataframe also have an index, but I rather want a column with the name of each file, since each filename is a year (for example, "1905.txt", "1906.txt", etc.). The data frame looks something like this:

	Hello	I	am
0	1	2	1
1	1	1	1
2	0	1	2

And I want something like this :

	Hello	I	am
1905.txt	1	2	1
1906.txt	1	1	1
1907.txt	0	1	2

It would be even better without the ".txt"

How can I proceed ?

Here's my current code :

from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import numpy as np
import re

# create a list for all txt files
corpus =[]

# with pathlib, get all files in the corpus list 
for fichier in Path("/Users/MyPath/files").rglob("*.txt"):
     corpus.append(fichier.parent / fichier.name)


corpus.sort()

 
all_documents = []
for fichier_txt in corpus:
    with open(fichier_txt) as f:
        fichier_txt_chaine = f.read()
        fichier_txt_chaine = re.sub('[^A-Za-z]', ' ', fichier_txt_chaine) 
    all_documents.append(fichier_txt_chaine)

# here i am using sklearn, but this part is not relevant for my question
coun_vect = CountVectorizer(stop_words= "english")
count_matrix = coun_vect.fit_transform(all_documents)

count_array = count_matrix.toarray()
allDataframe = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(allDataframe)
allDataframe.to_csv("Matrice_doc_term.csv")

I suppose my problem is similar to this one, but I don't know how to adapt the answer to my code : Python Pandas add Filename Column CSV

Original Q&A

There are 2 answers

Moaybe On 28 November 2023 at 13:20

To modify your DataFrame so that it includes a column with the filename (without the ".txt" extension) instead of the current index, you can follow these steps:

Extract the filenames from your corpus list, remove the ".txt" extension, and then use these filenames as the index of your DataFrame. Reset the index so that these filenames become a regular column. Here's how you can modify your code to achieve this:

from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import numpy as np
import re

# create a list for all txt files
corpus = []

# with pathlib, get all files in the corpus list 
for fichier in Path("/Users/MyPath/files").rglob("*.txt"):
    corpus.append(fichier.parent / fichier.name)

corpus.sort()

all_documents = []
file_names = [] # List to store file names without .txt extension
for fichier_txt in corpus:
    with open(fichier_txt) as f:
        fichier_txt_chaine = f.read()
        fichier_txt_chaine = re.sub('[^A-Za-z]', ' ', fichier_txt_chaine)
    all_documents.append(fichier_txt_chaine)
    
    # Extract the file name without .txt extension
    file_name = fichier_txt.stem
    file_names.append(file_name)

# Using sklearn (irrelevant for the current modification)
coun_vect = CountVectorizer(stop_words="english")
count_matrix = coun_vect.fit_transform(all_documents)

count_array = count_matrix.toarray()
allDataframe = pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names())

# Set the file names as the index
allDataframe.index = file_names

# Reset the index to make file names a column
allDataframe.reset_index(inplace=True)
allDataframe.rename(columns={'index': 'Year'}, inplace=True)

print(allDataframe)
allDataframe.to_csv("Matrice_doc_term.csv")

**mozway** · Accepted Answer · 2023-11-28T13:18:47+00:00

You most likely just need to pass the index to the DataFrame constructor:

pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
             index=corpus)

Or, since you have Path objects in corpus and just want the filename:

pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
             index=[f.name for f in corpus])

Or for just the stem:

pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
             index=[f.stem for f in corpus])

TechQA.

Add column with filenames on a dataframe with Pandas

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in TERM-DOCUMENT-MATRIX

Popular Questions

Popular Tags

Trending Questions