SCOM 2012 – SDK Service Failover

To ensure high availability for your SCOrch solutions that connect to SCOM, failover of the SDK service needs to be managed in the event the SCOM management server the SCOrch environment is connecting to becomes unavailable.  The right way to do this would be some sort of network side solution to load balance the SDK services.  Sometimes, however, you need a quick and dirty hack job to test things out.  Maybe you are out of IP addresses or your company has layers of process that would need to be navigated before an enterprise solution could be deployed.  If you can accomplish this the right way, please do.  If not, here goes…

NOTE: this could be used to manage failover for any platform SCOrch is communicating with.  Again, this is not meant to be a permanent solution but something that could be used in a lab environment where resources are tight.

Step 1: Modify the hosts File

Yep – how about that blast from the past?  Identify which SCOM MS server you want to be primary, and grab the IP address for it.  Create an alias for your SCOM SDK and map that to IP for the primary MS:

image

Step 2: Create Your SCOM SDK Variables

Just to make the example easy, I am just going to have two SCOM servers; a primary and a secondary. 

image 

image

Lastly, we need a counter to manage the failover. 

image

Step 3: Create the Failover Runbook

image

Monitor Counter – we need to determine a fail count we are ok with before we want to failover.  I like to start with 3 and modify from there if needed.

image

Run .Net Script – we need to determine which SCOM SDK is active and then determine where to failover.

image

And of course, publish the IP we want to make active to the data bus ($Output).

image

Off the screen for the replacement text it reads “SCOM_SDK” as was put into the hosts file in step 1.

Modify Counter – since we have now failed to what we assume to be an active SDK service, we need to reset the counter to 0.

image

That’s it.  We now have a fully functional complete hack of a failover solution.  Now, when registering your SDK connections for the SCOM IP to use, use the alias specified in the hosts file rather than the actual name of any of your SCOM MS servers.

Let’s test.  First, create the connection for the IP:

image

Second, modify/create a runbook that uses SCOM 2012 IP activities.  Change the activity to use the new alias driven Connection for the IP.  For each SCOM 2012 activity, also associate two ‘Modify Counter’ activities; one for success and one for failure.

image

Create Alert – attempts to create an alert in SCOM 2012.

Fail – increments the SCOMSDK_FailCount counter created in step 2.

Success – resets the SCOMSDK_FailCount counter created in step 2.

With the SDK running, let’s fire up the runbook and make sure an alert gets created since this uses our newly created SCOM 2012 connection utilizing the alias in the hosts file:

image

There it is.  Now, I am going to stop the SDK service on the current MS server and then run the Create Alert runbook again.

image

Perfect, upon 2 more failures, the hosts file should get rewritten with the IP address of the secondary MS server.  In the middle of the test, however, the SDK service on the MS server I stopped actually was restarted.  So, after stopping and disabling the service, 3 consecutive failures were produced.

image

We have:

image

There it is.  We now have the IP address of the secondary SDK service in the hosts file.  Failover complete!

Leave a Reply