Monday, July 19

Updates to AppManager 6.0 Review

Last month I posted a review of the beta version of AppManager 6.0. This month (July 15, 2004) I assisted NetIQ by demonstrating some of the new features of the product in a Placeware audiocast conducted by NetIQ.

There are some features of the product that were discussed in the audiocast but were not covered in the review, and they include the following:

1. Action_RunKS -- this knowledge script allows the administrator to launch up to three knowledge scripts dynamically from a job action
2. Ability to resize the values tab of a knowledge script
3. Knowledge scripts for collecting data for the Diagnostic Console. This integration with Diatnostic Console includes both Exchange and NT server core metrics.
4. Action Severity Configuration -- this refers to the ability within many Action knowledge scripts to fire only when the severity of the triggering event is within a range defined as part of the Action knowledge script.

As I think of other features I've left out, I'll post them here.

Tuesday, March 2

2003 Market Research on SLAs from Veritas

In a September 2003 survey commissioned by Veritas, data-center managers and their non-IT counterparts at 604 organizations with at least 500 employees responded to questions on the nature of their usage of SLAs. Veritas sponsored the survey as part of its "Utility Computing" product focus.

I've culled some of their statistics here, but you can download the original report for the full analysis.

1. Over 59% of the respondents do use SLAs.

2. Among organizations that have SLAs in place, the following key IT performance areas are covered:

- Processing performance (37 percent)
- System availability and uptime (35 percent)
- Restoration times following an outage (29 percent)
- None of the above (39 percent)

3. Twenty five percent of all cases said the SLAs were crafted without the involvement of the department heads.

4. With respect to the IT reports generated in support of SLAs, non-IT managers used them as follows:

- to make department operations more efficient (34 percent)
- to work with the IT department to lower costs (28 percent)
- not used successfully (39 percent)

The research covered a wide array of companies in the United States, nine countries throughout Europe, the Middle East and South Africa. It was conducted by Dynamic Markets of the UK.

Thursday, February 12

Insights on IT organizations and their use of metrics

A recent posting by Dave Morgan, Common Mistakes in Selecting and Implementing Analytics Systems, focuses on the application of metrics to web site analysis.  However this article also contains useful observations for the issue of metrics for Systems Management.  Dave is a highly successful consultant and entrepreneur in the media field, having founded Real Media in the 1990's and Tacoda Systems in 2001.  This article will convince you that opinions like his are an essential counterweight to the hype and exaggeration offered by most vendors.

Tuesday, January 27

Microsoft SQL Server 2000 Reporting Services Launch

As described in a previous post (see correction below), the Reporting Services for SQL Server 2000 has now been released. The New York launch event today was very well attended, although technical difficulties caused frequent freeze-ups of the web broadcast portion of the event.

From a Systems Mangement perspective, this product has great potential. Good reports are one of the core values of a Systems Management (SM) product. Reporting Services holds out the promise of maximizing the reporting potential of an SM tool.

I am working with Reporting Services against NetIQ AppManager and Microsoft MOM databases to see how much effort is involved in producing effective reports. Results will be posted to the Sample Reports page of the JMACINC.com site.

Correction to post Microsoft Reporting Products (Friday, January 16): Because Reporting Services can read from OLE DB data sources, it can generate reports from OLAP databases such as Microsoft Aanalysis Services.

Friday, January 16

Microsoft Reporting Products

Microsoft's upcoming release of Reporting Services, now in Beta 2, is an extension to SQL Server that could become a prominent tool for Systems Management deployments. In addition to custom reports for MOM 2004, reports can be created from any SQL Server or Oracle database, such as AppManager or MOM 2000. What IT managers will probably most like about Reporting Services is that it allows users to subscribe to reports on their own schedules.

It seems it can't report against OLAP databases, but this may be an incorrect assumption. For this type of enterprise data reporting, the best alternative may be Crystal Analysis Pro, which also offers managed subscriptions and web-based authoring. Crystal Decision was acquired in 2003 by Business Objects.

One Systems Management vendor who is planning on releasing a reporting product that uses Reporting Services is NetIQ. More on this to follow

Tuesday, January 13

Presentation to NetIQ Executive Briefing in NYC, January 13, 2004

The text of the presentation is available as a Microsoft Word file here. The text of the presentation follows:

Good morning, everybody. My name is John MacLeod, and I've worked with two AppManager clients here in Manhattan. I'm going to discuss them with you as CUSTOMER CASE STUDIES 1 AND 2.

I. CUSTOMER CASE STUDY (1)

This customer's Systems Management project initially involved only the messaging department's migration to Exchange. In the initial migration design there were about 80 Exchange 5.x servers - these were grouped into a single backbone site and about a dozen regional sites. Eventually, the messaging agents numbered closer to 250 as we added dedicated public folder servers, a second backbone for the Asia Pacific region, Blackberry servers, Exchange 2K, IMS/World Secure servers for Compliance, and KVS servers for archiving. A little over two years into our deployment we merged with the infrastructure department, which added responsibility for about 150 NT4/W2K PDC/BDC/WINS and software distribution servers.

For messaging statistics we relied on a 3rd party SQL server solution which didn't scale very well-the calculations took longer and longer as we added more users, and finally were taking more than 24 hours to complete-so after a certain point in our migration we had to start extrapolating the totals from the data provided by a subset of the servers. This approach eventually was replaced with a custom solution developed in house from Perl scripts, and today it is AppAnalyzer.

The following are the five main challenges confronted by customer 1:

A. For our core OS/HW monitoring, we needed to replace an existing monitoring product, Sentry (Mission Critical) that was globally deployed to every Windows server in the firm. Sentry had the unfortunate habit of flooding the Windows event logs with meaningless information such as "Sentry is detecting an event" and "Sentry is escalating an event". Either the customer had not configured it effectively, or it was just too noisy a product for our busy environment. So we needed to find a product that was right-sized for our global deployment.

B. We needed to find a product to monitor our Exchange system thoroughly. We knew that we didn't want any product that would require server-specific configuration files as was necessary with Microsoft's native tools (link-monitor and server-monitor). As you know the connections in an Exchange site are a mesh of two-way links, so the number of queues is 2(n-1) factorial. Since we were anticipating sites with about a dozen queues, this would have been a nightmare to administer manually. So we needed to find a product that could see new servers and queues dynamically.

C. We were expected to provide summary status reports to upper management. Because Exchange was fairly new to us at that time, we didn't know exactly what reports would be useful and necessary, so we wanted a product that provided a good range of application reports to get us going.

D. We had to satisfy our SQL Server DBA team that our application supported Windows authentication and would not require 'SA' or 'Local Admin' privileges. Not every company will have as many DBA constraints as these, but fortunately AM was able to run with these modifications.

E. We knew we needed a product that was open enough that we could extend it fairly easily. What we wanted was the functionality that the RunDOS script provided, as it allowed us to distribute local tasks easily and see the results of the tasks in the console.

But we also wanted to customize the monitoring with new tasks that were specific to our environment. This capability, of course, was provided through the developer's console, which allowed us to do virtually anything that fit the model of a scripted job. As it turned out, we were able to script a crucial software distribution job-namely, the rollout of new anti-virus pattern files from TrendMicro-as a totally automated system, which in fact pre-dated the AM module for ScanMail.

The AM deployment: We initially deployed four QDB's, one on each of four SQL servers. These were deployed in our regional data centers. Each SQL server doubled as the MS for the region. The version of AppManager we started with was 2.0, and we finished with 4.3, so we went through two major upgrades and a few minor upgrades. The customer is today at 5.0.1. We trained each of the local Exchange administrators in handling the AM console and understanding the events. It was widely deemed a successful system within the firm.

II. CUSTOMER CASE STUDY (2)

The second customer was again a messaging department and again it involved a migration to a new Exchange system, but it was a simpler environment since we were monitoring only the Exchange 2000 application, and a separate department was monitoring the OS/HW with BMC Patrol. Also, Exchange 2000 is bit easier to administer than Exchange 5.x because the Global Address List is no longer hosted on each Exchange server, so the database health and replication monitoring were moved to the department that maintained the AD. Like the first customer, the messaging system also supported Blackberry servers and was planning to use an anti-virus product, but the decision between TrendMicro and Sybari had not been made.

We cut our teeth with an initial pilot that consisted of only 12 Exchange 2000 servers in four routing groups, distributed to four regional IT centers. In production this deployment grew to 60 servers with clustered mailbox servers, two dedicated routing groups for backbones, and dedicated front-end servers for OWA users. For Exchange statistics this firm had already selected AppAnalyzer.

The following are the five main challenges confronted by customer 2:

A. We had to do a head-to-head comparison of AM with MOM, but also consider other Exchange 2000 monitoring tools such as Quest, Bindview, Microsoft's native tools, etc. The reason a comparison to MOM specifically was required was that the firm was converting their OS/HW level monitoring from BMC Patrol to MOM. We were given 30 days for this comparison.

B. The monitoring and reporting needed to be ready for Day One of the pilot deployment, which was scheduled for 30 days after the comparison project ended. The assumption was that servers from the evaluation would be reusable for the pilot. As for the reporting requirement at this customer, they clearly expected the reporting would be useful, flexible and entirely web-based. As it turned out, since we went with AM 5.x, this was actually not a great problem. Had we gone with MOM, we would have been writing reports with the Microsoft Access report designer and having to schedule them with static batch files.

C. We needed a two-way link to Micromuse NetCOOL, which was the firm's Manager of Managers. As you are all aware, AM provides numerous connectors to other monitoring programs including NetCOOL. It turned out that we installed this connector in about 90 minutes and it ran successfully for the duration of the pilot.

D. Easy to use and extend. A key requirement for extensibility was in automating the weekly reboots of our Exchange clusters. The reboot had to be performed with complete control of the logical application so that at no time was a node rebooted if the application was not running on the other node.

E. We also needed the main components of the monitoring to be redundant, so that we could tolerate an outage in a data center without losing our monitoring. As you all know, today Business Contingency Planning is a must-have on all projects.

The AM deployment: The production deployment consisted of five servers for AM, all located in the NY/NJ data centers: a clustered (active/passive) SQL Server for the QDB and the AppAnalyzer databases; one reporting agent that also served as the web console server, and three management servers, each one dedicated to a continent. There was also one OLAP server for AppAnalyzer - this did not have any redundancy, but this was acceptable since it was only for reporting.

I just want to mention that our choice of AM over MOM hinged primarily on three technical merits:
- Existing modules to support Blackberry and either TrendMicro or Sybari AV
- Better integration of reports, especially the AM reports portal
- Roughly equivalent coverage of the core monitoring requirements but with fewer discreet tasks

III. Best Practices - Four Lessons Learned

A. Architectural planning is key. The more complex your environment is, whether in terms of the number of QDB's you've deployed, redundancy, the impact on the agent of monitoring jobs, or other factors, the more important it is to get your architecture right. I'm sure I'm preaching to the choir on this point.

A corollary point is that the more complex your environment is, the more likely you'll need to customize it. I'll speak more about customization in a minute when I talk about scripting.

A second corollary is to maintain a current lab setup. I think too often Systems Management is not considered high enough a priority when budgets are tight to justify the extra expense of a lab, but without it you're really never sure when a new job will have harmful side-affects.

B. Documenting the environment is critical. Be rigorous in your documentation of installations/upgrades. I suggest the best documentation is pre- and post-installation snapshots of your servers' configurations, including every file and registry setting and, in the case of the QDB, every object in the database. There are sophisticated and expensive tools to collect these snapshots, but you can really use fairly simple ones as well. To record changes to the database you can even use SQL Server's native database scripting tool.

This is obviously required NOT for every machine but for every configuration-i.e., at least one cluster if you're monitoring any clusters, at least one MS, at least one agent for every server type, and of course the SQL server.

A corollary rule is to keep on top of the changes made to your environment. While it's standard practice for every IT shop to announce all changes in advance, I'm suggesting you need to tie in your snapshot procedures with these change plans so that your snapshots are as close to the before and after picture as possible.

C. The AM online forum is a big help. It seems everyone who gets help with a problem tries to help someone else, and the NetIQ moderators are excellent. I also find the forum's search tool very helpful. Another great source of online help, especially for newcomers to AM, is the KS depot.

D. Validate that your monitoring system is doing what you expect. By this I mean you have to be diligent about tracking down the cause of any anomalous behavior from any monitoring component. The value of this rule is multiplied for larger deployments where small problems can multiple very quickly.

If you have a new agent installation that is failing, you should rectify that immediately before rolling out any other agents. You need to work with tech support on these problems as soon as possible. Simply stated, you need to maintain the following standards:

- Every agent should run every job reliably, 24 x 7.

- Every policy should be reflected on every agent promptly.

- Every report should always run correctly.

A corollary note is that one of the core dependencies of AM is also one of the hardest components to troubleshoot when it fails. That's the RPC services between the agents and the MS. If an agent has problems with its RPC, it may not be on account of any change in your AM configuration-in other words it can be the result of a configuration change made by another application-but it will halt your monitoring dead in its tracks all the same, and could drive you a bit crazy in the process. This is an excellent time to take a new configuration snapshot to see what's changed since the last known good configuration.

E. I've found it very helpful to have a good tool for file distribution that is independent of the monitoring system. By this I mean a program that lets you easily push or pull files to or from every server in your system. One ad hoc example when this is handy is during a virus outbreak and you're given a list of possible places to look for signs of the virus. By pushing out a simple command file that looks for it, you enable your agents to scan for it with the easy RunDOS KS-which is exactly what we did at Customer 1 for both the Code Red and the I-Love-U viruses. With the pull facility you can collect AM-generated files from the agents to use as inputs to a report or to confirm the consistency of your agent deployment.

IV. Thoughts on scripting

I want to discuss scripting for Systems Management because I think many administrators are still reluctant to customize their solutions for fear that they will be unable to upgrade their product or that they will break certain Report dependencies. Obviously, the standard practice with regard to the first concern is to rename your KS to a proprietary naming convention that will not conflict with the product upgrade, and in the latter concern you can usually find and fix the Reports that use hard-coded KS names.

A. On the Windows platform, Microsoft has made it easier and easier to access HW/OS configuration and status information in scripts - something which was always taken more or less for granted in the Unix platform.

The main advantage that more pervasive scripting offers to Systems Management is that more KS can be self-sufficient in terms of querying their environment. By that I mean that you can now accomplish more tasks within the KS compiler without having to shell out to the system. The disadvantage of shelling out to the system to call an external program is that it adds a layer of overhead and error checking. Overhead is a bad thing when a system is stressed, and too many sources of error is a bad thing for the developers who have to write and maintain code.

B. Another thought on scripting is that every vendor now uses XML files for their application interfaces, and I think we all have seen the power of XML for simplifying development. One place XML formatting can be immediately useful to Systems Management administrators is in the area of reports. Whereas in the past IT reports were formatted in hard-coded text layouts or comma-delimited records for display in a spreadsheet, today we want most of our reports in HTML. By generating report data in XML files, one has the option of presenting it in any HTML page. HTML pages generated from XML data files can support sorting and filtering within the browser instead of requiring a round trip back to the web server to execute a CGI or ASP script. I assume that most of you are already using XML techniques, but I wanted to mention it for those who maybe have not yet taken this direction.

V. Integration of AppManager with other solutions

I'm going to discuss three categories of products that cohabit AppManager's monitoring space: M-O-M, other NetIQ products, and other non-NetIQ products.

A. Manager of Managers

1. I have heard fairly negative feedback about most M-O-M products except for two: Micromuse NetCOOL and Managed Objects. I think what differentiates these two from the M-O-M offerings of the usual suspects (Tivoli, CA and HP) is that they were designed from their inception to be M-O-M's rather than a component monitor with certain M-O-M features bolted on. Of the two, I'm most interested in Managed Objects for deployments where business unit managers want to see their service status.

2. NetIQ may add an event correlation engine to AM 6. This should be very exciting if it's done with an API that allows users and third parties to define their own relation end-points. On the other hand, for a multi-platform enterprise, which typically already has several point solutions in place, the correlation determination will need to be made at the level of a M-O-M.

B. Other NetIQ products

Obviously, if you are monitoring Exchange 5 or 2K or 2K3, NetIQ AppAnalyzer for Exchange is a very cool product. It was one of the first systems management products to use OLAP services, and it was also one of the first to use the .Net Framework. So it has a history of being cutting edge.

I found the combination of AppManager agents with AppAnalyzer KS to run the local Exchange data gathering tasks is a very efficient combination, and one that scales very well. I believe it's the IRS that uses AppAnalyzer to report on something like 75,000 mailboxes.

Secondly, the Diagnostics console for Windows and for SQL Server are extremely useful for real-time graphs and performance snapshots. They're priced very reasonably, and if you don't already have a product in-house that delivers this real-time perspective then you really should take look at them.

C. Other Non-NetIQ products

Only one - Netuitive Analytics. This is an excellent performance bench-marker that can't function by itself - it needs either AM or BMC Patrol. It may integrate with other monitoring solutions in the future.

Netuitive's primary focus is on dynamic alarm thresholds that are generated uniquely for each machine based on its past performance over time. These dynamic thresholds are not derived from simple moving average calculations of a single performance metric. The data from multiple performance counters that are collected by the AM agent are used as inputs to a patented neural network that can predict the servers key performance indicators up to two hours in the future.

NA also provides excellent value for its intuitive baselining capabilities. Some scenarios where these baselines would be useful are

– Your most critical servers, since you want to know you've tuned their performance as much as possible;
– Your heaviest used servers, since you want to know what limits to set, if any, for your static thresholds, and
– Your lab servers, since you want to compare performance from multiple configurations.

Another note on NA: the company's latest release uses some very exciting Open Systems technology from the Apache project that, I think, will attract a growing number of customers. Previously the reporting interface required Microsoft IIS exclusively, but its new interface is written on top of Jakarta and TomCat, which are Java/XML projects of the Apache organization - that's www.apache.org. This allows the user interface to run identically on any web server that supports Java and XML.

I'd like to thank NetIQ and Bekim Protopapa in particular for inviting me to speak today. I hope my comments were helpful, as I think AppManager is a great product that still has a lot to offer. I'd be happy to answer any of your questions after the briefing.


Saturday, January 10

Refer to www.JMACINC.com for the rest of the story.