Proposed rewrite of the DICE monitoring system

The current Nagios monitoring system works well and we should continue to use it. It is not easily maintainable however, and a project proposal exists for a future rewrite - the aim of which would be to try to produce a much simpler system. What follows are preliminary thoughts on the current problems, and some alternatives.

Problems with the current system

  1. The internal design of the system is complex and undocumented.

  2. The system consists of dense Perl object-oriented code. The design and implementation will have made sense to the original author, but in the absence of design documentation, it is not obvious why the overall code takes the shape it does; procedural code (still modularised of course) might have produced a much clearer outcome.

    Question: What are the essential 'objects' in this system?

  3. The system is designed to support monitoring systems other than Nagios (how easy it would be to integrate such systems is another matter), and therefore has more complexity than a system designed purely to integrate with Nagios.

  4. The so-called 'translators' used by the system are a complex answer to a question that doesn't even exist in many cases. If a machine wants to be monitored for ssh, say, then that fact just needs to be declared to the monitoring server in a simple boolean way: 'please monitor me for ssh'. A 'translator' is entirely unnecessary.

    Likewise, there is no point in having code in, say, the Kerberos 'translator' which checks whether or not the requesting machine is 'entitled' to ask for monitoring by virtue of its being a KDC. We can trust the managers of any machine to request appropriate monitoring; and if, for example, a non-KDC machine does (inappropriately) request monitoring of its (non-existent) KDC service, then the monitoring system itself will report that as an error soon enough.

    My suspicion is that the idea of the translators came about from the potentially complex requirements of the lcfg-apacheconf component - but, if so, we can ask if there are alternatives to the monitoring which the current apaceconf translator configures.

    Conclusion: the configuration of the current system is overcomplicated.

  5. The Nagios-specific translators are now bundled with our components, and are distributed as part of our formal LCFG distribution. This seems completely inappropriate for something which is a very specific and optional add-on; in any case, it also seems at odds with point 3 above.

  6. The current system takes a non-standard approach to LCFG in several aspects: it deals with machine profiles in a non-standard way; it doesn't use the standard templating mechanisms; and it doesn't respect LCFG naming conventions (c.f. lcfg-nagios, but om nagios_server start.) Some of these are trivial objections but any variation from the standard approach implies future maintenance problems.

    The pull mechanism invented for checking for machine profile updates - where the Nagios server is fetching the profiles of all monitored machines from the LCFG server every 90 seconds or so in order to test for changes - seems over-the-top. Better would be a push mechanism based on spanning maps.

    I have been told that the need for non-standard handling of machine profiles arises from deficiencies in the current implementation of LCFG spanning maps - namely, 'spanning maps can't handle complex arrays.' If that's the case, then fine: don't try to span arrays! Rather, pass the necessary data some other way. The first thing, of course, is to tie down the data we really do need to share. Again, I suspect that considerations of the potential complexity of the requirements of the lcfg-apacheconf component might have influenced the design thinking in this regard.

Initial ideas for a simpler system

  1. Parts of the current system - e.g. the Jabber reporting mechanism - should be retained. What we would like to do is to simplify the construction of the Nagios configuration files themselves: the production of these files is what's currently done via the profile pull mechanism, and the 'translators'.

  2. The Nagios configuration files are themselves unfortunately complex but, for our system, the current partitioning into a single file per machine (/etc/nagios/machines/<machinename>, which encapsulates that machine's monitoring requirements) seems fine. So the main thing to be worked out is the production of the files bound for the /etc/nagios/machines directory. That set of files will change as machines enter and leave the monitoring system, but most of the other configuration files will be much less dynamic.

  3. It should be enough to decree that all services on any one machine report to the same set of usernames/Jabber IDs/email addresses. It seems unnecessary to allow a finer-grained split.

  4. All of the above could be declared in a single 'live' header file, which enumerated the list of machines to be monitored and the services of interest on each. That would work, but may be too crude an approach; better would be to use spanning maps. For a service as simple as ssh, it is enough that a machine declares 'monitor me for ssh'. That is, it is enough to span the single resource nagios_client.ssh which has been set to true.

    For a complete system, the complete set of resources spanned from the profile of any machine could be one value per potentially-monitored service (where a service will typically map onto a component), as well as the targets of any alerts.

    That is: nagios_client.alerts-to nagios_client.<component-1> nagios_client.<component-2> ... nagios_client.<component-n>

    For many service configurations, a true or false setting will be enough, just declaring 'monitor (or not) this service'. For more complex service configurations, both client and server will have to agree on the formats and data values to be embedded in the resources. The server will need logic to disentangle these requirements, but there should be no need for logic in the client components: all values should be able to be declared in DICE-level header files.

    Example: nagios_client.apacheconf might look contain http://website-1;http://website-2;...;https//website-n with the meaning to the server that all listed sites are to be monitored.

    Actual service monitoring configurations etc. can be configured in some less dynamic way in the server. These defintions and configurations don't change frequently: they should only need to be updated when we introduce a new potentially-monitorable service.

  5. Whether this simpler approach will work needs to be tested. It is likely that we will have to sacrifice some of the configurability which exists in the current system - but I wonder if the current complex set of testing for, say, apache on cigar, is actually necessary or useful. The test case services will probably be AFS and apache: if we can make a simple approach work for those, then other services should also be okay.

-- IanDurkacz - 08 May 2012

Topic revision: r1 - 08 May 2012 - 16:09:46 - IanDurkacz
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies