To ensure high availability for your SCOrch solutions that connect to SCOM, failover of the SDK service needs to be managed in the event the SCOM management server the SCOrch environment is connecting to becomes unavailable. The right way to do this would be some sort of network side solution to load balance the SDK services. Sometimes, however, you need a quick and dirty hack job to test things out. Maybe you are out of IP addresses or your company has layers of process that would need to be navigated before an enterprise solution could be deployed. If you can accomplish this the right way, please do. If not, here goes…
NOTE: this could be used to manage failover for any platform SCOrch is communicating with. Again, this is not meant to be a permanent solution but something that could be used in a lab environment where resources are tight.
Step 1: Modify the hosts File
Yep – how about that blast from the past? Identify which SCOM MS server you want to be primary, and grab the IP address for it. Create an alias for your SCOM SDK and map that to IP for the primary MS:
Step 2: Create Your SCOM SDK Variables
Just to make the example easy, I am just going to have two SCOM servers; a primary and a secondary.
Lastly, we need a counter to manage the failover.
Step 3: Create the Failover Runbook
Monitor Counter – we need to determine a fail count we are ok with before we want to failover. I like to start with 3 and modify from there if needed.
Run .Net Script – we need to determine which SCOM SDK is active and then determine where to failover.
And of course, publish the IP we want to make active to the data bus ($Output).
Off the screen for the replacement text it reads “SCOM_SDK” as was put into the hosts file in step 1.
Modify Counter – since we have now failed to what we assume to be an active SDK service, we need to reset the counter to 0.
That’s it. We now have a fully functional complete hack of a failover solution. Now, when registering your SDK connections for the SCOM IP to use, use the alias specified in the hosts file rather than the actual name of any of your SCOM MS servers.
Let’s test. First, create the connection for the IP:
Second, modify/create a runbook that uses SCOM 2012 IP activities. Change the activity to use the new alias driven Connection for the IP. For each SCOM 2012 activity, also associate two ‘Modify Counter’ activities; one for success and one for failure.
Create Alert – attempts to create an alert in SCOM 2012.
Fail – increments the SCOMSDK_FailCount counter created in step 2.
Success – resets the SCOMSDK_FailCount counter created in step 2.
With the SDK running, let’s fire up the runbook and make sure an alert gets created since this uses our newly created SCOM 2012 connection utilizing the alias in the hosts file:
There it is. Now, I am going to stop the SDK service on the current MS server and then run the Create Alert runbook again.
Perfect, upon 2 more failures, the hosts file should get rewritten with the IP address of the secondary MS server. In the middle of the test, however, the SDK service on the MS server I stopped actually was restarted. So, after stopping and disabling the service, 3 consecutive failures were produced.
We have:
There it is. We now have the IP address of the secondary SDK service in the hosts file. Failover complete!