I have a directory containing corpus text files, I want to create a table which contains the number of words in each document that is table contains column of document number & row contains word count in that document for each unique word...all should be done in python...please help...thank you...
The table should look like this:
word1 word2 word3 ...
doc1 14 5 45
doc2 6 1 0
.
.
.
import nltk
import collections
import os.path
def cleanDoc(doc):
stopset = set(nltk.corpus.stopwords.words('english'))
stemmer = nltk.PorterStemmer()
tokens = nltk.WordPunctTokenizer().tokenize(doc)
clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
path = "c://Users/Desktop/corpus files"
i=0
for file in os.listdir(path) :
f = open("c://Users/Desktop/corpus files/file%d.txt" %i,'r')
data= f.read()
words = cleanDoc(data)
fw = open("c://Users/Desktop/words/words%d.txt" %i,'w')
fd = collections.Counter(words)
#fd = nltk.FreqDist(words)
#plot(fd)
row_format = "{:>15}" * (len(words) + 1)
print row_format.format("document %d" %i, *words)
#for
fw.write(str(fd))
fw.write(str(words))
fw.close()
i=i+1
f.close()
I think this is fairly close, if not exactly, what you want. In case it isn't, I tried to make things easy to change.
To produce the table desired processing is done two phases. In the first, the unique words in each document file of the form
file<document-number>.txt
are found and saved in a correspondingwords<document-number>.txt
file, plus they are added to a set of comprising all the unique words seen among all document files. This set is needed to produce table columns that consist of all the unique words in all the files and is why two phases of processing were required.In the second phase, the word files are read back in and turned back into dictionies which used to fill in the corresponding columns of the table being printed.