When using a runbook to pull traffic out of SCOM 2012 to pass onto a different system or take some sort of automated action, the logical way to start one of those runbooks would be with the ‘Monitor Alert’ activity. A very simply runbook could look something like this:
In this case, the ‘Update Alert’ simply sets the resolution state to processed (not real useful but will demo the point). In order to test this, I have a rule set to create an alert every 15 seconds from a single machine. When we look into the Runbook Designer and SCOM after a few minutes, here’s what we see:
This all looks good. However, let’s not get a head of ourselves. What happens during a patch cycle? Does this system recover automatically on its own? Do I get all of my alert traffic?
In order to test, let’s go ahead and reboot my SCOrch machine. Post reboot, it looks like the runbook recovered from a reboot just fine in the Runbook Designer:
However, form SCOM 2012 console:
The alert that was raised during the reboot was never processed. Rather than a reboot, let’s shut the SCOrch machine down for a few minutes to simulate a little more downtime and then bring it back up. Maybe this was just an anomaly…
Nope. So, the question becomes what if we have an additional runbook server to take over the work? During this time of transition, one of my runbook servers in my lab is dedicated to SCOM 2007 and the other to 2012 (due to the console requirements). Time to add a 3rd runbook server with the SCOM 2012 console and IP so that this server can auto take over the work.
Ok, my SCORCH and SCORCH3 machines are configured for primary and standby roles for the Monitor Alert runbook. Let’s see if there are any gaps in processing. We have alerts coming in every 15 seconds, so, as long as SCOrch picks up that the SCORCH machine is missing, the runbook execution should move over to SCORCH3 and there should be no gap.
Gap. This isn’t good. According to Runbook Designer, there was a gap as well that mostly lines up with what we see above.
So, how do we make sure we don’t miss any traffic? Easy…don’t use ‘Monitor Alert’ unless you are willing to potentially have gaps.
Configure ‘Monitor Data/Time’ to run every 10-15 seconds (or longer – now you have this control). Configure ‘Get Alert’ just as you did ‘Monitor Alert’ and voila! Now you have a runbook that grabs the traffic at the frequency you want AND it goes back in time to grab any alerts still sitting in a new state. This also opens the door for reprocessing alerts. Say there’s a hiccup on the backend system and you know you need to resend all open alerts for the period during the downtime. Set them the resolution state to New…done.
Fired up, we see even the ones that were missed during the previous test have now been processed:
Done!