Java/Scala Metrics - Codahale - Cluster/Mulitnode & Graphite Reporter

1.2k views Asked by At

When using CodaHale Metrics in Java or Scala code a clustered environment what are the gotchas when reporting to Graphite?

If I have multiple instances of my application running and creating different metrics can Graphite cope - i.e. is reporting cumulative?

For example if I have AppInstance A and B. If B has a gauge reporting 1.2 and another reporting 1.3 - what would be the result in Graphite? Will it be an average? or will one override the other.

Are counters cumulative?

Are timers cumulative?

Or should I somehow give each instance some tag to differentiate different JVM instances?

3

There are 3 answers

0
kamaradclimber On BEST ANSWER

You can find your default behavior for the case where graphite receive several points during the aggregation period in aggreagtion-rules.conf. I think graphite default is to take last received point in the aggregation period.

If you might be interested in metric detail by process instance (and you'll probably be at some point), you should tag instances in some way and use that tag in metric path. Graphite is extremely useful for aggregation at request time and finding a way to aggregate individuals metrics (sum, avg, max, or more complex) you be difficult.

One thing that can make you reluctant to have different metrics by sender process would be if you have a very versatile environment where instances change all the time (thus creating many transient metrics). Otherwise, just use ip+pid and you'll be fine.

0
Kevin Stewart On

We've found that the easiest way to handle this is by using per instance metrics. This way you can see how each instance is behaving independently. If you want an overall view of the cluster it's also easy to look at the sumSeries of a set of metrics by using wildcards in the metric name.

The caveat to this approach is that you are keeping track of more metrics in your graphite instance, so if you're using a hosted solution this does cost more.

0
Jack Parsons On

I added a 'count' field to every set of metrics that I knew went in at the same time. Then I aggregated all of the values including the counts as 'sum'. This let me find both the average, sum and count for all metrics in a set. (Yes, graphite's default is to take the most recent sample for a time period. You need to use the carbon aggregator front end.)

Adding the IP address to the metric name lets you calculate relative speeds for different servers. If they're all the same type and some are 4x as fast as others you have a problem. (I've seen this). As noted above adding a transitory value like IP creates a dead metric problem. If you care about history you could create a special IP for 'old' and collect dead metrics there, then remove the dead entries. In fact, the number of machines at any timer period would be a very useful metric.