Systems Management Blog: Presentation to NetIQ Executive Briefing in NYC, January 13, 2004

The text of the presentation is available as a Microsoft Word file here. The text of the presentation follows:

Good morning, everybody. My name is John MacLeod, and I've worked with two AppManager clients here in Manhattan. I'm going to discuss them with you as CUSTOMER CASE STUDIES 1 AND 2.

I. CUSTOMER CASE STUDY (1)

This customer's Systems Management project initially involved only the messaging department's migration to Exchange. In the initial migration design there were about 80 Exchange 5.x servers - these were grouped into a single backbone site and about a dozen regional sites. Eventually, the messaging agents numbered closer to 250 as we added dedicated public folder servers, a second backbone for the Asia Pacific region, Blackberry servers, Exchange 2K, IMS/World Secure servers for Compliance, and KVS servers for archiving. A little over two years into our deployment we merged with the infrastructure department, which added responsibility for about 150 NT4/W2K PDC/BDC/WINS and software distribution servers.

For messaging statistics we relied on a 3rd party SQL server solution which didn't scale very well-the calculations took longer and longer as we added more users, and finally were taking more than 24 hours to complete-so after a certain point in our migration we had to start extrapolating the totals from the data provided by a subset of the servers. This approach eventually was replaced with a custom solution developed in house from Perl scripts, and today it is AppAnalyzer.

The following are the five main challenges confronted by customer 1:

A. For our core OS/HW monitoring, we needed to replace an existing monitoring product, Sentry (Mission Critical) that was globally deployed to every Windows server in the firm. Sentry had the unfortunate habit of flooding the Windows event logs with meaningless information such as "Sentry is detecting an event" and "Sentry is escalating an event". Either the customer had not configured it effectively, or it was just too noisy a product for our busy environment. So we needed to find a product that was right-sized for our global deployment.

B. We needed to find a product to monitor our Exchange system thoroughly. We knew that we didn't want any product that would require server-specific configuration files as was necessary with Microsoft's native tools (link-monitor and server-monitor). As you know the connections in an Exchange site are a mesh of two-way links, so the number of queues is 2(n-1) factorial. Since we were anticipating sites with about a dozen queues, this would have been a nightmare to administer manually. So we needed to find a product that could see new servers and queues dynamically.

C. We were expected to provide summary status reports to upper management. Because Exchange was fairly new to us at that time, we didn't know exactly what reports would be useful and necessary, so we wanted a product that provided a good range of application reports to get us going.

D. We had to satisfy our SQL Server DBA team that our application supported Windows authentication and would not require 'SA' or 'Local Admin' privileges. Not every company will have as many DBA constraints as these, but fortunately AM was able to run with these modifications.

E. We knew we needed a product that was open enough that we could extend it fairly easily. What we wanted was the functionality that the RunDOS script provided, as it allowed us to distribute local tasks easily and see the results of the tasks in the console.

But we also wanted to customize the monitoring with new tasks that were specific to our environment. This capability, of course, was provided through the developer's console, which allowed us to do virtually anything that fit the model of a scripted job. As it turned out, we were able to script a crucial software distribution job-namely, the rollout of new anti-virus pattern files from TrendMicro-as a totally automated system, which in fact pre-dated the AM module for ScanMail.

The AM deployment: We initially deployed four QDB's, one on each of four SQL servers. These were deployed in our regional data centers. Each SQL server doubled as the MS for the region. The version of AppManager we started with was 2.0, and we finished with 4.3, so we went through two major upgrades and a few minor upgrades. The customer is today at 5.0.1. We trained each of the local Exchange administrators in handling the AM console and understanding the events. It was widely deemed a successful system within the firm.

II. CUSTOMER CASE STUDY (2)

The second customer was again a messaging department and again it involved a migration to a new Exchange system, but it was a simpler environment since we were monitoring only the Exchange 2000 application, and a separate department was monitoring the OS/HW with BMC Patrol. Also, Exchange 2000 is bit easier to administer than Exchange 5.x because the Global Address List is no longer hosted on each Exchange server, so the database health and replication monitoring were moved to the department that maintained the AD. Like the first customer, the messaging system also supported Blackberry servers and was planning to use an anti-virus product, but the decision between TrendMicro and Sybari had not been made.

We cut our teeth with an initial pilot that consisted of only 12 Exchange 2000 servers in four routing groups, distributed to four regional IT centers. In production this deployment grew to 60 servers with clustered mailbox servers, two dedicated routing groups for backbones, and dedicated front-end servers for OWA users. For Exchange statistics this firm had already selected AppAnalyzer.

The following are the five main challenges confronted by customer 2:

A. We had to do a head-to-head comparison of AM with MOM, but also consider other Exchange 2000 monitoring tools such as Quest, Bindview, Microsoft's native tools, etc. The reason a comparison to MOM specifically was required was that the firm was converting their OS/HW level monitoring from BMC Patrol to MOM. We were given 30 days for this comparison.

B. The monitoring and reporting needed to be ready for Day One of the pilot deployment, which was scheduled for 30 days after the comparison project ended. The assumption was that servers from the evaluation would be reusable for the pilot. As for the reporting requirement at this customer, they clearly expected the reporting would be useful, flexible and entirely web-based. As it turned out, since we went with AM 5.x, this was actually not a great problem. Had we gone with MOM, we would have been writing reports with the Microsoft Access report designer and having to schedule them with static batch files.

C. We needed a two-way link to Micromuse NetCOOL, which was the firm's Manager of Managers. As you are all aware, AM provides numerous connectors to other monitoring programs including NetCOOL. It turned out that we installed this connector in about 90 minutes and it ran successfully for the duration of the pilot.

D. Easy to use and extend. A key requirement for extensibility was in automating the weekly reboots of our Exchange clusters. The reboot had to be performed with complete control of the logical application so that at no time was a node rebooted if the application was not running on the other node.

E. We also needed the main components of the monitoring to be redundant, so that we could tolerate an outage in a data center without losing our monitoring. As you all know, today Business Contingency Planning is a must-have on all projects.

The AM deployment: The production deployment consisted of five servers for AM, all located in the NY/NJ data centers: a clustered (active/passive) SQL Server for the QDB and the AppAnalyzer databases; one reporting agent that also served as the web console server, and three management servers, each one dedicated to a continent. There was also one OLAP server for AppAnalyzer - this did not have any redundancy, but this was acceptable since it was only for reporting.

I just want to mention that our choice of AM over MOM hinged primarily on three technical merits:
- Existing modules to support Blackberry and either TrendMicro or Sybari AV
- Better integration of reports, especially the AM reports portal
- Roughly equivalent coverage of the core monitoring requirements but with fewer discreet tasks

III. Best Practices - Four Lessons Learned

A. Architectural planning is key. The more complex your environment is, whether in terms of the number of QDB's you've deployed, redundancy, the impact on the agent of monitoring jobs, or other factors, the more important it is to get your architecture right. I'm sure I'm preaching to the choir on this point.

A corollary point is that the more complex your environment is, the more likely you'll need to customize it. I'll speak more about customization in a minute when I talk about scripting.

A second corollary is to maintain a current lab setup. I think too often Systems Management is not considered high enough a priority when budgets are tight to justify the extra expense of a lab, but without it you're really never sure when a new job will have harmful side-affects.

B. Documenting the environment is critical. Be rigorous in your documentation of installations/upgrades. I suggest the best documentation is pre- and post-installation snapshots of your servers' configurations, including every file and registry setting and, in the case of the QDB, every object in the database. There are sophisticated and expensive tools to collect these snapshots, but you can really use fairly simple ones as well. To record changes to the database you can even use SQL Server's native database scripting tool.

This is obviously required NOT for every machine but for every configuration-i.e., at least one cluster if you're monitoring any clusters, at least one MS, at least one agent for every server type, and of course the SQL server.

A corollary rule is to keep on top of the changes made to your environment. While it's standard practice for every IT shop to announce all changes in advance, I'm suggesting you need to tie in your snapshot procedures with these change plans so that your snapshots are as close to the before and after picture as possible.

C. The AM online forum is a big help. It seems everyone who gets help with a problem tries to help someone else, and the NetIQ moderators are excellent. I also find the forum's search tool very helpful. Another great source of online help, especially for newcomers to AM, is the KS depot.

D. Validate that your monitoring system is doing what you expect. By this I mean you have to be diligent about tracking down the cause of any anomalous behavior from any monitoring component. The value of this rule is multiplied for larger deployments where small problems can multiple very quickly.

If you have a new agent installation that is failing, you should rectify that immediately before rolling out any other agents. You need to work with tech support on these problems as soon as possible. Simply stated, you need to maintain the following standards:

- Every agent should run every job reliably, 24 x 7.

- Every policy should be reflected on every agent promptly.

- Every report should always run correctly.

A corollary note is that one of the core dependencies of AM is also one of the hardest components to troubleshoot when it fails. That's the RPC services between the agents and the MS. If an agent has problems with its RPC, it may not be on account of any change in your AM configuration-in other words it can be the result of a configuration change made by another application-but it will halt your monitoring dead in its tracks all the same, and could drive you a bit crazy in the process. This is an excellent time to take a new configuration snapshot to see what's changed since the last known good configuration.

E. I've found it very helpful to have a good tool for file distribution that is independent of the monitoring system. By this I mean a program that lets you easily push or pull files to or from every server in your system. One ad hoc example when this is handy is during a virus outbreak and you're given a list of possible places to look for signs of the virus. By pushing out a simple command file that looks for it, you enable your agents to scan for it with the easy RunDOS KS-which is exactly what we did at Customer 1 for both the Code Red and the I-Love-U viruses. With the pull facility you can collect AM-generated files from the agents to use as inputs to a report or to confirm the consistency of your agent deployment.

IV. Thoughts on scripting

I want to discuss scripting for Systems Management because I think many administrators are still reluctant to customize their solutions for fear that they will be unable to upgrade their product or that they will break certain Report dependencies. Obviously, the standard practice with regard to the first concern is to rename your KS to a proprietary naming convention that will not conflict with the product upgrade, and in the latter concern you can usually find and fix the Reports that use hard-coded KS names.

A. On the Windows platform, Microsoft has made it easier and easier to access HW/OS configuration and status information in scripts - something which was always taken more or less for granted in the Unix platform.

The main advantage that more pervasive scripting offers to Systems Management is that more KS can be self-sufficient in terms of querying their environment. By that I mean that you can now accomplish more tasks within the KS compiler without having to shell out to the system. The disadvantage of shelling out to the system to call an external program is that it adds a layer of overhead and error checking. Overhead is a bad thing when a system is stressed, and too many sources of error is a bad thing for the developers who have to write and maintain code.

B. Another thought on scripting is that every vendor now uses XML files for their application interfaces, and I think we all have seen the power of XML for simplifying development. One place XML formatting can be immediately useful to Systems Management administrators is in the area of reports. Whereas in the past IT reports were formatted in hard-coded text layouts or comma-delimited records for display in a spreadsheet, today we want most of our reports in HTML. By generating report data in XML files, one has the option of presenting it in any HTML page. HTML pages generated from XML data files can support sorting and filtering within the browser instead of requiring a round trip back to the web server to execute a CGI or ASP script. I assume that most of you are already using XML techniques, but I wanted to mention it for those who maybe have not yet taken this direction.

V. Integration of AppManager with other solutions

I'm going to discuss three categories of products that cohabit AppManager's monitoring space: M-O-M, other NetIQ products, and other non-NetIQ products.

A. Manager of Managers

1. I have heard fairly negative feedback about most M-O-M products except for two: Micromuse NetCOOL and Managed Objects. I think what differentiates these two from the M-O-M offerings of the usual suspects (Tivoli, CA and HP) is that they were designed from their inception to be M-O-M's rather than a component monitor with certain M-O-M features bolted on. Of the two, I'm most interested in Managed Objects for deployments where business unit managers want to see their service status.

2. NetIQ may add an event correlation engine to AM 6. This should be very exciting if it's done with an API that allows users and third parties to define their own relation end-points. On the other hand, for a multi-platform enterprise, which typically already has several point solutions in place, the correlation determination will need to be made at the level of a M-O-M.

B. Other NetIQ products

Obviously, if you are monitoring Exchange 5 or 2K or 2K3, NetIQ AppAnalyzer for Exchange is a very cool product. It was one of the first systems management products to use OLAP services, and it was also one of the first to use the .Net Framework. So it has a history of being cutting edge.

I found the combination of AppManager agents with AppAnalyzer KS to run the local Exchange data gathering tasks is a very efficient combination, and one that scales very well. I believe it's the IRS that uses AppAnalyzer to report on something like 75,000 mailboxes.

Secondly, the Diagnostics console for Windows and for SQL Server are extremely useful for real-time graphs and performance snapshots. They're priced very reasonably, and if you don't already have a product in-house that delivers this real-time perspective then you really should take look at them.

C. Other Non-NetIQ products

Only one - Netuitive Analytics. This is an excellent performance bench-marker that can't function by itself - it needs either AM or BMC Patrol. It may integrate with other monitoring solutions in the future.

Netuitive's primary focus is on dynamic alarm thresholds that are generated uniquely for each machine based on its past performance over time. These dynamic thresholds are not derived from simple moving average calculations of a single performance metric. The data from multiple performance counters that are collected by the AM agent are used as inputs to a patented neural network that can predict the servers key performance indicators up to two hours in the future.

NA also provides excellent value for its intuitive baselining capabilities. Some scenarios where these baselines would be useful are

– Your most critical servers, since you want to know you've tuned their performance as much as possible;
– Your heaviest used servers, since you want to know what limits to set, if any, for your static thresholds, and
– Your lab servers, since you want to compare performance from multiple configurations.

Another note on NA: the company's latest release uses some very exciting Open Systems technology from the Apache project that, I think, will attract a growing number of customers. Previously the reporting interface required Microsoft IIS exclusively, but its new interface is written on top of Jakarta and TomCat, which are Java/XML projects of the Apache organization - that's www.apache.org. This allows the user interface to run identically on any web server that supports Java and XML.

I'd like to thank NetIQ and Bekim Protopapa in particular for inviting me to speak today. I hope my comments were helpful, as I think AppManager is a great product that still has a lot to offer. I'd be happy to answer any of your questions after the briefing.

Systems Management Blog

Tuesday, January 13

Presentation to NetIQ Executive Briefing in NYC, January 13, 2004

No comments:

Post a Comment

Blog Archive