Slaying the SCOM Auditability Dragon
In
the library of boring topics, software configuration auditing must surely rank
close to the top. While end user
applications such as messaging and database servers spark swarms of such
products, SCOM and its System Center siblings have yet to attract major
interest. The reasons for this are open
to debate, but there can be no doubt as to the fact that Microsoft’s
infrastructure products are rife with configuration interfaces and rich APIs
that make it all too easy for configuration drift to rear its ugly head. A
typical SCOM deployment will have over 10,000 rules and another 10,000 monitors. In fact, the job boards over the recent past have
been thick with openings for SCOM engineers (a search for “SCOM” on Dice
returns 253 postings), many of which are to reign in existing deployments
sinking under their own clumsy weight.
So
how much should you worry about “what changed” in your SCOM deployment? Well, the truth is that, in the absence of
tight change controls, SCOM deployments evolve like any other complex system in
your data center: instead of being a
ready tool in helping detect and solve problems, it becomes part of the noise
overwhelming administrators. It is
another example that even the monitor needs monitoring, and IT managers pay a
price for ignoring that reality.
Enter
The Dude
These
questions of SCOM configuration controls were the subject of a recent conversation
I had with a very capable IT engineer, whom I enjoy calling the “Dude” on
account of his unshakable confidence and periodic exhibits of great flair. The specific point of our debate was the use of
naming conventions for SCOM authoring. (Don’t say you weren’t forewarned some
dull topics were afoot!) Dude had
inherited a SCOM deployment that was perfectly devoid of documentation,
courtesy of his predecessors, all reputedly SCOM experts. Dude had earned his SCOM spurs in this school
of hard knocks: deciphering the meaning
of each SCOM alert, and investigating how to ensure SCOM alerted when serious problems.
From
his first tentative overrides to his eventual mastery of the Authoring pane, he
carried on his company’s tradition of documenting “nada”. And “nada” means no run guide, no change log,
heck not even a lazy description sprinkled here and there. His disregard for documentation in any form was
an article of simple faith: why waste
time on documentation when you will always remember the changes you made, and
if somehow your memory fails you, then it’s no problem to figure it out when
the need arises. Such self-assurance was
so charming – ah, how good it is to work in IT when you’re outside the scope of
any quality controls, audits or best practices reviews.
Some
more context would be helpful. Dude was
responsible for only one SCOM management, so the question of code consistency across
more than one system never crossed his mind.
Further, since Dude was the only SCOM administrator, it was actually
possible, assuming a heroic memory, that he might remember all his overrides
and customizations. Lastly, since it was
not a customer-facing application, Dude’s managers accepted the frequent
problems with SCOM as inevitable annoyances.
They never challenged Dude to set any quality goals or continuous
improvement plan. It was ‘the best of
times, it was the worst of times.
Under
the Covers
Microsoft’s
Management Pack Authoring Guide for SCOM gives a clear explanation why the
key attribute of every element in a MP is the ID field. As with SCSM, SCOM constructs a class
hierarchy ordering all the elements in every imported MP, and maintaining this
in memory is one of the key roles of the RMS.
When your primary tool for creating new rules and monitors is the Ops
console, SCOM conceals the ID field, automatically constructing one on the fly from
the element’s type and a GUID-like string of numbers. All the author controls in the console are
the Display Name and the Description. Nice
and easy, but not the best design for configuration auditing.
The
uniqueness function provided by a SCOM element’s ID is similar to the function
of a hostname in a DNS namespace, in that the hostname must be unique within a
DNS zone. Further, the ID of the MP anchors
the namespace in SCOM as the name of the zone does in DNS. The display name, on the other hand, is like
the comment field in that, just as you can give many hosts in a zone the same
comment (or no comment at all), you can give the same display name to any
number of SCOM elements – think “VIP Application Service Down Monitor.”
Restoring
Order
The
Authoring Console is actually the intended tool for extending SCOM with custom
classes in enterprise deployments. The
Authoring Guide recommends that you standardize your IDs just as Microsoft does
in its own MPs. And how does that
work? Well, as I explained to Dude, the
basic process is:
1. Create a concisely
named MP, using the model evident in all the SCOM system MPs and most of
Microsoft’s application MPs
2. create the element
in the Ops console as usual and host it in the well named MP
3. export the MP to an
XML file
4. open the file in an
editor and locate the console-generated ID
5. replace all
occurrences of that ID with a new ID comprised of the following parts, each
delimited by a period
a.
the
ID of the MP
b.
the
element type (such as “Group”, “Rule”, or “Monitor”)
c.
one
or more descriptors that indicate the essence of the element (such as AppLog.Error1102
for a rule that alerts when an Error event occurs in the Application log with
an ID of 1102)
d.
Re-import
the MP
My
favorite editor for this purpose is Notepad++, which is freely available at http://notepad-plus-plus.org/.
And
what have you accomplished? Your new IDs
now reflect a path in a clean amespace from a root or branch to a leaf node in
terms that are easy to follow. These IDs
will reinforce the design integrity of the deployment and protect the relevance
of the documentation. When you export
your MPs to a common area, you can search across multiple files for elements
targeting the same class or object, which can answer many questions such as
identifying redundancies. If you ever
run the Alert report, your IDs will no longer read like the outputs from a
runaway random generator.
Y.A.M.
-- Young Admin Myopia – and the Upside of Planning
At
this point, Dude was rolling his eyes and groaning in disbelief. “Why all this fuss and bother if it produces
no visible benefit in the console?” He
might well have added, “And why spend my good time assisting future
administrators if it’s not in the job description?” Well, a good retort might follow the logic Robert
Duvall gave Sean Penn in “Colors” on the question of running down a hill vs.
walking down, but your mileage may vary.
Well,
returning to the question we started with, if you ever hope to implement change
control audits for your SCOM configuration, especially if you have multiple
administrators who make frequent changes, it’s simply unrealistic to try to do it
with manual tools such as screenshots and spreadsheets. There are just too many elements to track without
an automated tool.
In
the meanwhile, hopefully this article has persuaded you to take the extra
effort to convert your GUID-like IDs to a more meaningful namespace style. Here are some explicit examples of the two
styles we’ve been discussing.
Class
ID:
Before:
UINameSpace172861e718614744a224992d5237de31.Group
After:
MyCompany.Messaging.Monitoring.MyAppServers.Group
Rule
ID:
Before: MomUIGeneratedRule3d232ca92a3a4e9e9c53e70c6838439b
After: MyCompany.Messaging.Monitoring.Alert.AppLog.Error1102
Rule
Property Override ID:
Before: OverrideForRuleMomUIGeneratedRule083aa6fe88f34aedb0c871e3da8843a1ForContextUINameSpace59a7638242334435824fcf4ebbf3450bGroup75fba1242f0d4f038b8590e169a123d0
After: MyCompany.Messaging.Monitoring.Alert.AppLog.Error1102.Override.Interval.MyAppServers
SCOM
has been around for a while, but good practices are never too late to start. I hope this article has shown you that these
default IDs are not beyond your control, and they’re worth controlling.