I am conducting a textual content analysis of several web blogs, and now focusing on finding emerging trends. In order to do so for one blog, I coded a multi-step process:
- looping over all the posts, finding the top 5 keywords in each post
- adding them to a list, if they are not already in the list
- calculate the term frequency for all the terms in the list for every single post
- create a list of dictionaries, where for each post I save the date of the post, and the tf for every single word
- create a data frame from this list of dictionaries, and plot it
This works all fine, except that I get a 1000 graphs in one plot, while I only care for those who peak over a certain threshold. Meaning that they should have a different set of colours or be easily recognizable in some other way, and they should appear in the legend - the rest not. Any ideas how to do that?
Here is the code I use now, which produces an unreadable plot:
from pattern.db import Database
import pandas as pd
import matplotlib.pyplot as plt
def plot_trends(keywords_and_date_list):
d1 = pd.DataFrame(keywords_and_date_list)
d1.sort(inplace=True)
grouped = d1.groupby(pd.Grouper(freq='1M', key="date")).mean()
plt.style.use("ggplot")
fig = plt.figure(figsize=(25,6))
for i in d1.columns:
if i == 'date':
continue
plt.plot(grouped.index, grouped[i], lw=2, label="monthly average " + i)
plt.ylim(0,0.015)
plt.legend(prop={'size':7})
plt.title("Occurence of various words in blogs")
plt.xlabel("Post publication date")
plt.ylabel("Term Frequency")
plt.show()
Do any of you have any ideas of how to feasibly differentiate between the graphs that peak over, let's say 0.004, and assign them a different colour set and labels?
I played with a small data set in order to achieve this with panda's max function, but I don't get it to work.
import pandas as pd
import numpy as np
from pattern.db import date
import matplotlib.pyplot as plt
l = [dict(date=date('2015-01-02'), one=0.1, two=0.2)]
l.append(dict(date=date('2014-01-01'), one=0.2, two=0.5))
l.append(dict(date=date('2014-02-01'), one=0.5, two=0.6))
l.append(dict(date=date('2014-03-01'), one=0.1, two=0.7))
d1 = pd.DataFrame(l)
d2 = d1.set_index('date')
plt.style.use("ggplot")
fig = plt.figure(figsize=(10,6))
for i in d1.columns:
if d1.max() >= 0.6:
plt.plot(d1.index, d1[i], lw=2, label="monthly average " + i)
else:
plt.plot(d1.index, d1[i], lw=2)
plt.ylim(0,1)
plt.legend(prop={'size':10})
plt.title("Occurence of various words in Naoki's blog")
plt.xlabel("Post publication date")
plt.ylabel("Term Frequency")
plt.show()
What I would like to see as a result is one graph with a label, and one without a label. I played with different syntaxes, but either I get two labelled graphs, or a value error, or an error that datetype and float is not comparable.
Your small data set script is largely correct, but with some minor errors.
if i=='date': continue
line. (The source of your 'not comparable' error).else
line is mis-indented.plt.hold(True)
to prevent the creation of new graphs.Here is my modified version of your script. I swapped
db.pattern.date
forpd.to_datetime
, added more random columns, and de-emphasized the low-valued data lines usingalpha
andlinewidth