I have network traffic data in the following for for each hour of a ten day period as follows in a R dataset.
Day Hour Volume Category
0 00 100 P2P
0 00 50 email
0 00 200 gaming
0 00 200 video
0 00 150 web
0 00 120 P2P
0 00 180 web
0 00 80 email
....
0 01 150 P2P
0 01 200 P2P
0 01 50 Web
...
...
10 23 100 web
10 23 200 email
10 23 300 gaming
10 23 300 gaming
As seen there are repetition of Category within a single hour also. I need to calculate the volatility and the peak hour to average hour ratios of these different application categories.
Volatility: Standard deviation of hourly volumes divided by hourly average.
Peak hour to avg. hour ratio: Ratio of volume of the maximum hour to the vol. of the average hour for that application.
So how do I aggregate and calculate these two statistics for each category? I am new to R and don't have much knowledge of how to aggregate and get the averages as mentioned.
So, the final result would look something like this where first the volume for each category is aggregated on a single 24 hour period by summing the volume and then calculating the two statistics
Category Volatility Peak to Avg. Ratio
Web 0.55 1.5
P2P 0.30 2.1
email 0.6 1.7
gaming 0.4 2.9
Edit: plyr got me as far as this.
stats = ddply(
.data = my_data
, .variables = .( Hour , Category)
, .fun = function(x){
to_return = data.frame(
volatility = sd((x$Volume)/mean(x$Volume))
, pa_ratio = max(x$Volume)/mean(x$Volume)
)
return( to_return )
}
)
But this is not what I was hoping for. I want the statistics per Category where all the hours of the days are aggregated first into 24 hours by summing the volumes and then the volatility and PA ratio calculated. Any suggestions for improvement?
You'd need to do it in two stages (using the
plyr
package): First, as you pointed out, there can be multiple Day-Hour combos for the same category, so we first aggregate, for each category, its totals within each Hour, regardless of the day:Then you get your stats: