We're in the process of adding monitoring to various servers and processes on our network, and currently, the various monitors will email my development group if something seems amiss - no customer payments on the website in X minutes, web services that support a process are unresponsive, daily automated FTP to a vendor failed, etc. While some of these are informational and need to be addressed soon (tomorrow or Monday is fine, for example), some are critical and are the result of actual customer outages, so they need to be restored as soon as possible.
The problem is that there are so many emails that people are getting desensitized to them and beginning to ignore even the critical ones. Even though we have a point person that changes each week, I still find that critical alerts will sit there, unclaimed and unresponded, for hours sometimes.
What are other people doing to better address these types of monitor and alert situations? Should I have a dashboard or summary email that gives everything from the day? Then what about critical things - is a group email still the best way to go? I'm curious to see what others are doing to see that things get addressed quickly, but ensure that developers aren't overwhelmed into inaction.
In RHQ ( http://rhq-project.org/ ) we have dampening events - meaning that e.g. an email is only sent every 5 alerts etc.
Also it is possible to have an alert disable the sending and then have a 2nd so called recovery alert, that (if the error situation goes away) re-enables the sending if the next error situation shows up.
See http://www.rhq-project.org/display/JOPR2/Alerts for more info.