Lexical dispersion plot with nltk is not working right

272 views Asked by At

i have been using a nltk code to make a lexical dispersion plot. As you can see in the code below ( please excuse the messy imports). I took text for 4 local pdfs extracted the text and did the word_tokenize bit. I have also done the other option where i have used nltk.Text.

import PyPDF2 as pypdf
import matplotlib
import matplotlib.pyplot as plt
import csv
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from PyPDF2 import PdfReader
nltk.download('universal_tagset')
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk import pos_tag
import collections
from collections import defaultdict
from nltk.probability import FreqDist
from nltk.draw.dispersion import dispersion_plot
from nltk.text import Text


df = pd.read_excel(r"C:\Users\Kenneth\Desktop\linksspeech.xlsx", header=None)

textz = ""

for i in df[0]:
 file = open(i,"rb")
 read = pypdf.PdfReader(file)
 pages = len(read.pages)
 count = 0
 while count < pages:
    pagen = read.pages[count]
    count = count + 1
    textz += pagen.extract_text()

taxtf = nltk.Text(textz)

tokes = nltk.word_tokenize(textz, language = "english")

mains = ["education", "poor", "health", "poverty", "zxyzxc"]

nltk.draw.dispersion.dispersion_plot(tokes,words = mains,ignore_case=False, title="hey")

plt.show()

taxtf.dispersion_plot(words = mains)

plot.show()

Question 1 : In word_tokenize, TOKES is the list of strings. It works perfectly well and a dispersion plot does come up which looks alright. But just to test it, i fed it a made up word, which is "zxyzxc" and when i make the lexical dispersion plot, it shows me dispersion for this word as well, which cannot be because such a word in nowhere in the text. Is then lexical plot working wrong or am i doing something wrong? Please help in this

Question 2 : I made a lexcial dispersion with taxtf, which is nltk.text.Text type. For this the word offset (x-axis in the dispersion plot) is completely wrong. the values it shows is -0.4, -0.2, 0 and 2. I understand that we have to feed a list of strings to it in this case.but word_tokenize doesnt work on this and not did taxtf.tokens work. Please help in this.

1

There are 1 answers

0
Beatrice Alex On

For Q1, does the dispersion plot show a continuous bar for the nonsense word? If so, I had the same issue. It was fixed after installing nltk and various other python packages through anaconda. I had them installed individually beforehand and something must have been missing resulting in unknown words to show up continuously across the corpus. The word labels on the y-axis were also reversed in order, so no longer matching up with their relevant data visualised in the plot; this might explain your inconsistency. Once I re-installed everything via anaconda, the plot was generated correctly.