Lexical dispersion plot with nltk is not working right

Question

Lexical dispersion plot with nltk is not working right

272 views Asked by Kenneth Gomes At 07 May 2023 at 11:44

i have been using a nltk code to make a lexical dispersion plot. As you can see in the code below ( please excuse the messy imports). I took text for 4 local pdfs extracted the text and did the word_tokenize bit. I have also done the other option where i have used nltk.Text.

import PyPDF2 as pypdf
import matplotlib
import matplotlib.pyplot as plt
import csv
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from PyPDF2 import PdfReader
nltk.download('universal_tagset')
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk import pos_tag
import collections
from collections import defaultdict
from nltk.probability import FreqDist
from nltk.draw.dispersion import dispersion_plot
from nltk.text import Text


df = pd.read_excel(r"C:\Users\Kenneth\Desktop\linksspeech.xlsx", header=None)

textz = ""

for i in df[0]:
 file = open(i,"rb")
 read = pypdf.PdfReader(file)
 pages = len(read.pages)
 count = 0
 while count < pages:
    pagen = read.pages[count]
    count = count + 1
    textz += pagen.extract_text()

taxtf = nltk.Text(textz)

tokes = nltk.word_tokenize(textz, language = "english")

mains = ["education", "poor", "health", "poverty", "zxyzxc"]

nltk.draw.dispersion.dispersion_plot(tokes,words = mains,ignore_case=False, title="hey")

plt.show()

taxtf.dispersion_plot(words = mains)

plot.show()

Question 1 : In word_tokenize, TOKES is the list of strings. It works perfectly well and a dispersion plot does come up which looks alright. But just to test it, i fed it a made up word, which is "zxyzxc" and when i make the lexical dispersion plot, it shows me dispersion for this word as well, which cannot be because such a word in nowhere in the text. Is then lexical plot working wrong or am i doing something wrong? Please help in this

Question 2 : I made a lexcial dispersion with taxtf, which is nltk.text.Text type. For this the word offset (x-axis in the dispersion plot) is completely wrong. the values it shows is -0.4, -0.2, 0 and 2. I understand that we have to feed a list of strings to it in this case.but word_tokenize doesnt work on this and not did taxtf.tokens work. Please help in this.

Original Q&A

There are 1 answers

**Beatrice Alex** · Answer 1 · 2023-08-18T13:57:17+00:00

For Q1, does the dispersion plot show a continuous bar for the nonsense word? If so, I had the same issue. It was fixed after installing nltk and various other python packages through anaconda. I had them installed individually beforehand and something must have been missing resulting in unknown words to show up continuously across the corpus. The word labels on the y-axis were also reversed in order, so no longer matching up with their relevant data visualised in the plot; this might explain your inconsistency. Once I re-installed everything via anaconda, the plot was generated correctly.

TechQA.

Lexical dispersion plot with nltk is not working right

There are 1 answers

Related Questions in PYTHON

Related Questions in NLP

Related Questions in NLTK

Related Questions in NLTK-TRAINER

Related Questions in NLTK-BOOK

Popular Questions

Trending Questions