Friday, September 14

IT Services Management With Zenoss Core 4

 The latest development in IT Services Management tools may well be the release of Zenoss Core 4 (see this Press Release).

Zenoss is a component-level IT monitoring tool with two versions:  Zenoss Core, a free and Open Source product, and the commercial version, Zenoss Enterprise, which is sold by Zenoss Inc., the corporate sponsor of Zenoss Core.  Large distributed IT departments will probably consider Zenoss Enterprise as it  extends the Core with multiple collectors.  As always, all components of the Zenoss 4 architecture are built from Open Source tools.

Major improvements in the new release include:
  • enhanced ZenPing suppression (essentially,  the python code in ZenPing was rewritten to include better layer 3 link attributes)
  • NMAP for ICMP packet generation
  • a new auto-deploy script, making Zenoss Core 4 easier to install than past releases
The new release is initially available as the Core version.   Zenoss Service Dynamics version 4.2.2, which provides additional analytics and resource management capabilities, is set to be released later this year.

Today's dynamic data center relies crucially on virtualization and cloud services, and Zenoss continues to advance in that direction.  For an overview of the features of the Enterprise product, check out the ZenossSolution Overview Brochure.

I have found Zenoss to be a near perfect tool to help IT departments meet their monitoring needs in a more cost-effective manner.  While my focus today is on Microsoft SCOM, Orchestrator and Service Manager, I encourage you to consider Zenoss as a long-term, thoroughly professional alternative.

Tuesday, April 3

Slaying the SCOM Auditability Dragon

In the library of boring topics, software configuration auditing must surely rank close to the top.  While end user applications such as messaging and database servers spark swarms of such products, SCOM and its System Center siblings have yet to attract major interest.  The reasons for this are open to debate, but there can be no doubt as to the fact that Microsoft’s infrastructure products are rife with configuration interfaces and rich APIs that make it all too easy for configuration drift to rear its ugly head. A typical SCOM deployment will have over 10,000 rules and another 10,000 monitors.  In fact, the job boards over the recent past have been thick with openings for SCOM engineers (a search for “SCOM” on Dice returns 253 postings), many of which are to reign in existing deployments sinking under their own clumsy weight.

So how much should you worry about “what changed” in your SCOM deployment?  Well, the truth is that, in the absence of tight change controls, SCOM deployments evolve like any other complex system in your data center:  instead of being a ready tool in helping detect and solve problems, it becomes part of the noise overwhelming administrators.  It is another example that even the monitor needs monitoring, and IT managers pay a price for ignoring that reality.

Enter The Dude

These questions of SCOM configuration controls were the subject of a recent conversation I had with a very capable IT engineer, whom I enjoy calling the “Dude” on account of his unshakable confidence and periodic exhibits of great flair.  The specific point of our debate was the use of naming conventions for SCOM authoring. (Don’t say you weren’t forewarned some dull topics were afoot!)  Dude had inherited a SCOM deployment that was perfectly devoid of documentation, courtesy of his predecessors, all reputedly SCOM experts.  Dude had earned his SCOM spurs in this school of hard knocks:  deciphering the meaning of each SCOM alert, and investigating how to ensure SCOM alerted when serious problems.

From his first tentative overrides to his eventual mastery of the Authoring pane, he carried on his company’s tradition of documenting “nada”.  And “nada” means no run guide, no change log, heck not even a lazy description sprinkled here and there.  His disregard for documentation in any form was an article of simple faith:  why waste time on documentation when you will always remember the changes you made, and if somehow your memory fails you, then it’s no problem to figure it out when the need arises.  Such self-assurance was so charming – ah, how good it is to work in IT when you’re outside the scope of any quality controls, audits or best practices reviews. 

Some more context would be helpful.  Dude was responsible for only one SCOM management, so the question of code consistency across more than one system never crossed his mind.  Further, since Dude was the only SCOM administrator, it was actually possible, assuming a heroic memory, that he might remember all his overrides and customizations.  Lastly, since it was not a customer-facing application, Dude’s managers accepted the frequent problems with SCOM as inevitable annoyances.  They never challenged Dude to set any quality goals or continuous improvement plan.  It was ‘the best of times, it was the worst of times.

Under the Covers

Microsoft’s Management Pack Authoring Guide for SCOM gives a clear explanation why the key attribute of every element in a MP is the ID field.  As with SCSM, SCOM constructs a class hierarchy ordering all the elements in every imported MP, and maintaining this in memory is one of the key roles of the RMS.  When your primary tool for creating new rules and monitors is the Ops console, SCOM conceals the ID field, automatically constructing one on the fly from the element’s type and a GUID-like string of numbers.  All the author controls in the console are the Display Name and the Description.  Nice and easy, but not the best design for configuration auditing.

The uniqueness function provided by a SCOM element’s ID is similar to the function of a hostname in a DNS namespace, in that the hostname must be unique within a DNS zone.  Further, the ID of the MP anchors the namespace in SCOM as the name of the zone does in DNS.  The display name, on the other hand, is like the comment field in that, just as you can give many hosts in a zone the same comment (or no comment at all), you can give the same display name to any number of SCOM elements – think “VIP Application Service Down Monitor.”

Restoring Order

The Authoring Console is actually the intended tool for extending SCOM with custom classes in enterprise deployments.  The Authoring Guide recommends that you standardize your IDs just as Microsoft does in its own MPs.  And how does that work?  Well, as I explained to Dude, the basic process is:

1.   Create a concisely named MP, using the model evident in all the SCOM system MPs and most of Microsoft’s application MPs
2.   create the element in the Ops console as usual and host it in the well named MP
3.   export the MP to an XML file
4.   open the file in an editor and locate the console-generated ID
5.   replace all occurrences of that ID with a new ID comprised of the following parts, each delimited by a period
a.   the ID of the MP
b.   the element type (such as “Group”, “Rule”, or “Monitor”)
c.    one or more descriptors that indicate the essence of the element (such as AppLog.Error1102 for a rule that alerts when an Error event occurs in the Application log with an ID of 1102)
d.   Re-import the MP

My favorite editor for this purpose is Notepad++, which is freely available at

And what have you accomplished?  Your new IDs now reflect a path in a clean amespace from a root or branch to a leaf node in terms that are easy to follow.  These IDs will reinforce the design integrity of the deployment and protect the relevance of the documentation.  When you export your MPs to a common area, you can search across multiple files for elements targeting the same class or object, which can answer many questions such as identifying redundancies.  If you ever run the Alert report, your IDs will no longer read like the outputs from a runaway random generator.

Y.A.M. -- Young Admin Myopia – and the Upside of Planning

At this point, Dude was rolling his eyes and groaning in disbelief.  “Why all this fuss and bother if it produces no visible benefit in the console?”  He might well have added, “And why spend my good time assisting future administrators if it’s not in the job description?”  Well, a good retort might follow the logic Robert Duvall gave Sean Penn in “Colors” on the question of running down a hill vs. walking down, but your mileage may vary.

Well, returning to the question we started with, if you ever hope to implement change control audits for your SCOM configuration, especially if you have multiple administrators who make frequent changes, it’s simply unrealistic to try to do it with manual tools such as screenshots and spreadsheets.  There are just too many elements to track without an automated tool.

In the meanwhile, hopefully this article has persuaded you to take the extra effort to convert your GUID-like IDs to a more meaningful namespace style.  Here are some explicit examples of the two styles we’ve been discussing.

Class ID: 
Before: UINameSpace172861e718614744a224992d5237de31.Group
After: MyCompany.Messaging.Monitoring.MyAppServers.Group
Rule ID:
Before:  MomUIGeneratedRule3d232ca92a3a4e9e9c53e70c6838439b
After:  MyCompany.Messaging.Monitoring.Alert.AppLog.Error1102

Rule Property Override ID:
Before:  OverrideForRuleMomUIGeneratedRule083aa6fe88f34aedb0c871e3da8843a1ForContextUINameSpace59a7638242334435824fcf4ebbf3450bGroup75fba1242f0d4f038b8590e169a123d0
After:  MyCompany.Messaging.Monitoring.Alert.AppLog.Error1102.Override.Interval.MyAppServers

SCOM has been around for a while, but good practices are never too late to start.  I hope this article has shown you that these default IDs are not beyond your control, and they’re worth controlling.

Friday, February 26

IT Security Best Practices and Why Users Could Care Less

In the November issue of  Communications of the ACM, Butler Lampson, a Technical Fellow at Microsoft Research, offers an incisive analysis of the sad state of affairs in Security Management.  I'm not a security practitioner, but I've been around long enough to have witnessed many an IT department be brought to its knees for hours and days while battling a security breach.  Lampson's simple argument is that security experts have set perfection as the goal, and both vendors and customers have bought into this assumption.  He reasons that perfection is missing the point because security management is essentially "risk management:  balancing the loss from breaches against the costs of security.  Unfortunately, both are difficult to measure."

That the costs are difficult to measure is generally obvious to anyone in IT, which typically doesn't even take the time to quantify the impact of component or application outages per hour [Numerous blog postings to follow!].  From the users' perspective, access and authentication interfaces become mere hindrances to doing productive work, so their universal response is to just say yes to any security question--no understanding or sense of ownership required.  Lampson sums up the ramifications of this linkage between economic uncertainty and user indifference with an implicit rebuke of security vendors:

The root cause of the problem is economics: we don’t
know the costs either of getting security
or of not having it, so users quite
rationally don’t care much about it.
Therefore, vendors have no incentive
to make security usable.

I hope this encourages you to download the whole article for yourself.  I will be watching how Security managers and vendors solve these self-limiting practices in the future.

Thursday, February 25

Notes on Bill Powell's March 2009 Presentation -- Impact of Economic Uncertainty on Service Management Plans

In March 2009, Bill Powell of IBM presented a super draft presentation to the NY LIG (Local Interest Group) of ITSMF USA that couldn't have been more interesting.  I wrote up my notes on his talk here, but I encourage you to download the podcast and his slides from the final presentation.  Here's the summary text from the ITSMF conferences posting just to give you an overview.

Amid the global financial turmoil and toughening business conditions, businesses continue to look to IT to provide leadership in responding to challenges and emerging opportunities. This presentation covers the implications and recommendations for leadership in an uncertain economy based on a recently completed IBM research of over 400 IT organizations. This session will focus on the US results, how Service Management is transforming from an IT to a business discipline, and provide practical advice on how best to weather the storm.