Tuesday, April 3

Slaying the SCOM Auditability Dragon

In the library of boring topics, software configuration auditing must surely rank close to the top.  While end user applications such as messaging and database servers spark swarms of such products, SCOM and its System Center siblings have yet to attract major interest.  The reasons for this are open to debate, but there can be no doubt as to the fact that Microsoft’s infrastructure products are rife with configuration interfaces and rich APIs that make it all too easy for configuration drift to rear its ugly head. A typical SCOM deployment will have over 10,000 rules and another 10,000 monitors.  In fact, the job boards over the recent past have been thick with openings for SCOM engineers (a search for “SCOM” on Dice returns 253 postings), many of which are to reign in existing deployments sinking under their own clumsy weight.

So how much should you worry about “what changed” in your SCOM deployment?  Well, the truth is that, in the absence of tight change controls, SCOM deployments evolve like any other complex system in your data center:  instead of being a ready tool in helping detect and solve problems, it becomes part of the noise overwhelming administrators.  It is another example that even the monitor needs monitoring, and IT managers pay a price for ignoring that reality.

Enter The Dude

These questions of SCOM configuration controls were the subject of a recent conversation I had with a very capable IT engineer, whom I enjoy calling the “Dude” on account of his unshakable confidence and periodic exhibits of great flair.  The specific point of our debate was the use of naming conventions for SCOM authoring. (Don’t say you weren’t forewarned some dull topics were afoot!)  Dude had inherited a SCOM deployment that was perfectly devoid of documentation, courtesy of his predecessors, all reputedly SCOM experts.  Dude had earned his SCOM spurs in this school of hard knocks:  deciphering the meaning of each SCOM alert, and investigating how to ensure SCOM alerted when serious problems.

From his first tentative overrides to his eventual mastery of the Authoring pane, he carried on his company’s tradition of documenting “nada”.  And “nada” means no run guide, no change log, heck not even a lazy description sprinkled here and there.  His disregard for documentation in any form was an article of simple faith:  why waste time on documentation when you will always remember the changes you made, and if somehow your memory fails you, then it’s no problem to figure it out when the need arises.  Such self-assurance was so charming – ah, how good it is to work in IT when you’re outside the scope of any quality controls, audits or best practices reviews. 

Some more context would be helpful.  Dude was responsible for only one SCOM management, so the question of code consistency across more than one system never crossed his mind.  Further, since Dude was the only SCOM administrator, it was actually possible, assuming a heroic memory, that he might remember all his overrides and customizations.  Lastly, since it was not a customer-facing application, Dude’s managers accepted the frequent problems with SCOM as inevitable annoyances.  They never challenged Dude to set any quality goals or continuous improvement plan.  It was ‘the best of times, it was the worst of times.

Under the Covers

Microsoft’s Management Pack Authoring Guide for SCOM gives a clear explanation why the key attribute of every element in a MP is the ID field.  As with SCSM, SCOM constructs a class hierarchy ordering all the elements in every imported MP, and maintaining this in memory is one of the key roles of the RMS.  When your primary tool for creating new rules and monitors is the Ops console, SCOM conceals the ID field, automatically constructing one on the fly from the element’s type and a GUID-like string of numbers.  All the author controls in the console are the Display Name and the Description.  Nice and easy, but not the best design for configuration auditing.

The uniqueness function provided by a SCOM element’s ID is similar to the function of a hostname in a DNS namespace, in that the hostname must be unique within a DNS zone.  Further, the ID of the MP anchors the namespace in SCOM as the name of the zone does in DNS.  The display name, on the other hand, is like the comment field in that, just as you can give many hosts in a zone the same comment (or no comment at all), you can give the same display name to any number of SCOM elements – think “VIP Application Service Down Monitor.”

Restoring Order

The Authoring Console is actually the intended tool for extending SCOM with custom classes in enterprise deployments.  The Authoring Guide recommends that you standardize your IDs just as Microsoft does in its own MPs.  And how does that work?  Well, as I explained to Dude, the basic process is:

1.   Create a concisely named MP, using the model evident in all the SCOM system MPs and most of Microsoft’s application MPs
2.   create the element in the Ops console as usual and host it in the well named MP
3.   export the MP to an XML file
4.   open the file in an editor and locate the console-generated ID
5.   replace all occurrences of that ID with a new ID comprised of the following parts, each delimited by a period
a.   the ID of the MP
b.   the element type (such as “Group”, “Rule”, or “Monitor”)
c.    one or more descriptors that indicate the essence of the element (such as AppLog.Error1102 for a rule that alerts when an Error event occurs in the Application log with an ID of 1102)
d.   Re-import the MP

My favorite editor for this purpose is Notepad++, which is freely available at http://notepad-plus-plus.org/.

And what have you accomplished?  Your new IDs now reflect a path in a clean amespace from a root or branch to a leaf node in terms that are easy to follow.  These IDs will reinforce the design integrity of the deployment and protect the relevance of the documentation.  When you export your MPs to a common area, you can search across multiple files for elements targeting the same class or object, which can answer many questions such as identifying redundancies.  If you ever run the Alert report, your IDs will no longer read like the outputs from a runaway random generator.

Y.A.M. -- Young Admin Myopia – and the Upside of Planning

At this point, Dude was rolling his eyes and groaning in disbelief.  “Why all this fuss and bother if it produces no visible benefit in the console?”  He might well have added, “And why spend my good time assisting future administrators if it’s not in the job description?”  Well, a good retort might follow the logic Robert Duvall gave Sean Penn in “Colors” on the question of running down a hill vs. walking down, but your mileage may vary.

Well, returning to the question we started with, if you ever hope to implement change control audits for your SCOM configuration, especially if you have multiple administrators who make frequent changes, it’s simply unrealistic to try to do it with manual tools such as screenshots and spreadsheets.  There are just too many elements to track without an automated tool.

In the meanwhile, hopefully this article has persuaded you to take the extra effort to convert your GUID-like IDs to a more meaningful namespace style.  Here are some explicit examples of the two styles we’ve been discussing.

Class ID: 
Before: UINameSpace172861e718614744a224992d5237de31.Group
After: MyCompany.Messaging.Monitoring.MyAppServers.Group
Rule ID:
Before:  MomUIGeneratedRule3d232ca92a3a4e9e9c53e70c6838439b
After:  MyCompany.Messaging.Monitoring.Alert.AppLog.Error1102

Rule Property Override ID:
Before:  OverrideForRuleMomUIGeneratedRule083aa6fe88f34aedb0c871e3da8843a1ForContextUINameSpace59a7638242334435824fcf4ebbf3450bGroup75fba1242f0d4f038b8590e169a123d0
After:  MyCompany.Messaging.Monitoring.Alert.AppLog.Error1102.Override.Interval.MyAppServers

SCOM has been around for a while, but good practices are never too late to start.  I hope this article has shown you that these default IDs are not beyond your control, and they’re worth controlling.