I am evaluating various system monitoring tools to use one to monitor my hadoop cluster. One of the tools I am impressed by is collectl. I have been playing around with it since a couple of days.
I am struggling to find how can we aggregate the metrics captured by collectl when using colmux?
Say, I have 10 nodes in my hadoop cluster each running collectl as a service. Using colmux I can see the performance metrics of each node in a single view (in single and multi-line formats). Great!
But what if I am considering aggregate of CPU, IO etc on all the nodes in the cluster. That is I want to find how my cluster as a whole is performing by aggregating the performance metrics from each node into corresponding numbers, thereby giving me cluster-level metrics instead of node-level.
Any help is greatly appreciated. Thanks!
I had already answered this on the mailing list but for the benefit of those not on it I'll repeat myself here..
That's a cool idea. So if I understand you correctly you might see some sort of total line at the bottom? I can always add to my wish list but no promises. But I think I may also have a solution if you don't mind doing a little extra work on your own ;) btw - can I assume you've installed readkey so you can change sort columns with the arrow keys?
If you run colmux with --noesc, it will take it out of full screen more and simply print everything as scrolling output. If you then also include "--lines 99999" (or some big number) it will print all the output from all the remote systems so you don't miss anything. Finally you can pipe the output through perl, python, bash, or whatever your favorite scripting tool might be and do the totals yourself. Then whenever you see a new header fly by, print the totals and reset the counters to 0. You could even add timestamps and maybe even ultimately make it your own opensource project. I bet others would find it useful too.
-mark