I´ve done some for-fun twitter-mining. I used twitters streaming-APi and recorded the tweets before, while and after a football match. Now i want to do a ggplot2-graph that shows the frequency of tweets on the football match.
In the original dataframe i´ve one row per tweet and a variable "created_at" which contains a date like this: Sat Dec 13 13:04:34 +0000 2014
Then i changed the time-format like this
tweets$format<- as.POSIXct(tweets$created_at, format = "%a %b %d %H:%M:%S %z %Y", tz="") an
and got this 2014-12-13 14:04:34 CET
. Because i don´t need the date, i thought, i could get rid of it
tweets$Uhrzeit <- sub(".* ", "", tweets$format)
With this i have only the time left 14:04:34
.
My Problem is, that i want to analyse the tweet-frequency with an accuracy of of tweets per minute. How do i aggregate the tweets per minute? As i said earlier, every row is a tweet. I made a dataframe with just the time and a second variable containing "1". Now i want to count (aggregate, sum) the second variable for every minute. I tried to find a solution, read about the zoo-library and the chron-library, but it left my confused.
Hope, somebody can help me.
EDIT: Reproducible Data The dataframe is a subset of this: names(tweets)
[1] "X" "text" "retweet_count"
[4] "favorited" "truncated" "id_str"
[7] "in_reply_to_screen_name" "source" "retweeted"
[10] "created_at" "in_reply_to_status_id_str" "in_reply_to_user_id_str"
[13] "lang" "listed_count" "verified"
[16] "location" "user_id_str" "description"
[19] "geo_enabled" "user_created_at" "statuses_count"
[22] "followers_count" "favourites_count" "protected"
[25] "user_url" "name" "time_zone"
[28] "user_lang" "utc_offset" "friends_count"
[31] "screen_name" "country_code" "country"
[34] "place_type" "full_name" "place_name"
[37] "place_id" "place_lat" "place_lon"
[40] "lat" "lon" "expanded_url"
[43] "url" "timeformat"
I transformed the "created_at" variable to the "timeformat" variable, which looks like this:
tweets.df<-as.data.frame(cbind(c("2014-12-13 14:04:34 CET","2014-12-13 14:04:37 CET","2014-12-13 14:04:45 CET","2014-12-13 14:05:23 CET","2014-12-13 14:05:53 CET","2014-12-13 14:05:58 CET","2014-12-13 14:06:33 CET","2014-12-13 14:06:38 CET","2014-12-13 14:06:59 CET","2014-12-13 14:08:16 CET","2014-12-13 14:09:12 CET","2014-12-13 14:09:34 CET","2014-12-13 14:10:05 CET","2014-12-13 14:10:16 CET","2014-12-13 14:10:17 CET","2014-12-13 14:11:13 CET","2014-12-13 14:11:16 CET","2014-12-13 14:12:01 CET","2014-12-13 14:12:30 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:02 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:03 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:05 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:07 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:08 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:11 CET","2014-12-13 14:14:22 CET","2014-12-13 14:14:48 CET","2014-12-13 14:15:02 CET","2014-12-13 14:15:03 CET","2014-12-13 14:16:20 CET","2014-12-13 14:16:26 CET","2014-12-13 14:17:14 CET","2014-12-13 14:17:24 CET","2014-12-13 14:17:45 CET","2014-12-13 14:17:49 CET","2014-12-13 14:18:05 CET","2014-12-13 14:18:30 CET","2014-12-13 14:19:38 CET"),1))
colnames(tweets.df)<-c("time","freq")
I just plotted the data. stat="bin" which defaults bins to 1/30 of the range of the data. It would be nicer to have it per minute.
ggplot(tweets,aes(x=timeformat)) + geom_line(stat="bin")
GIven your example dataset:
First, your time column as it stands contains text string, you want POSIXct objects:
Then, binning by minutes is done using function
cut.POSIXt
:Then you want to split your dataframe using this, and sum the column
freq
on the subsets:In this case, since
freq
is always equals to 1 it is equivalent of usingtable(by.mins)
.