Centreon Engine 1.4 – Configuration reloading on the fly

This post is also available in: French

After several months of development and tests, we are pleased to announce the official release of Centreon Engine 1.4.
You will finally get the chance to use the long-awaited feature :-) : configuration reloading on the fly.

Seemingly simple (especially when we check all softwares that we use everyday), this feature was really hard to implement because the base product was not designed for it.

When we look at the “Nagios” ecosystem, we notice that only Centreon Engine offers such a feature.
Nagios, Icinga, Shinken, Naemon : none of these products has it! Only Naemon, lastest fork of Nagios, announces that they are considering its integration.

So, what’s so difficult about it?

1. Project History

Most of the projects mentioned above come from Nagios which has emerged in the early 2000s, out of a need that was to monitor small critical and technical perimeters. At that time, the number of monitored equipment was not that important.
A scheduler of commands to”check” was more than perfect for the need. The systems don’t change every day and the “good” old self made configuration file was plenty enough.

Over the years, monitoring has become important for DSI. It has become more than a mere quality of life software: it is now mandatory! DSI are now service centers and they tend to manage several hundred of devices; on top of that, they have SLA obligation. Therefore, taking measures every minute or every 5 minutes has become a common practice.

However, the configuration file is no longer suitable for these days’ environments : multiple individuals edit these files; there are many configuration changes by day because the IT is constantly changing to improve their operations and adapt to new demands…

But, in the “Nagios” background, it was never intended to manage a feature such as “On the fly reloading”.

Our developers bumped into this wall when they started their development :

Centreon Engine Copy

Over 135 days of development were necessary to achieve our goals and we don’t even count all tests that we ran to guarantee the stability; over 20,000 to 30,000 lines of code have been changed and added. They also added unit tests that validate the current code and the code recently implemented in order to gain quality.

2. Open source

As indicated above, this change was time-consuming for our teams.
Only a product such as Centreon, developed by a company can make such investment.
Deadlines and quality control are possible but it’s more complicated for community projects because the developers are not dedicated to the project all the time.
Imagine if we had to develop this feature during very limited time periods as in some Open Source projects (evenings, weekends, holidays, lunch breaks…), the development wouldn’t have been completed by now, probably. It’s why this feature took 14 years to appear on this software (2000 -> 2014).
It certainly was not our top priority to develop this feature. However, for these last 2 or 3 years and especially since we started talking about its implementation in Centreon Engine, it has become essential to all the users.

3. The constraints of “Monitoring”

To work with configuration files : indeed, we need a high level of stability and we all know it, the configuration file is not safe. We write the configuration file, test it and when we validate it, we apply the changes in our application. Unless the configuration file is modified, the engine’s behavior doesn’t change.

Some people were willing to make things happen in the Nagios world. A few years ago, some patches or Nagios modules were created in order to store the configuration into LDAP tree views.
That was convenient as when a host is added into the tree, the monitoring engine receives the information and the modification is applied right away.
However, the request to implement the feature was rejected. The core developers of Nagios explained that the monitoring would be down if the LDAP ever falls. And it is also Nagios’ job to monitor LDAP… a vicious circle.
Another proposal has been submitted in the Nagios mailing list: to store the configuration in a MySQL database… which leads to the same problem.
Only a configuration file edited by VI (or emacs) is valid.

We understand that Nagios developers want to limit the number of dependent softwares in order to ensure the stability of Nagios. MySQL or LDAP in these cases would be SPOF and thus too risky.
As far as we are concerned, no configuration file can literally crash while Nagios is running :)
Only users would be the cause of problems with self edited configuration files, but then, there’s not much we could do about that.

4. How does it work in Centreon Engine ?

Simple : with configuration files !
Ok, but what does this 1.4 version bring ?

The on the fly reload is done when a SIGHUP signal is received.
Centreon Engine knows that the configuration has changed on the reception of this signal. It will then start making a “diff” of the configuration it has in memory. The memory access being quite fast, this allows fast modifications. During the process, the monitoring is interrupted for 1 to 2 seconds, which is far better than the standard reload / restart commands that used to have 20 to 25 minutes of service downtime on larger environments.

Obviously, you don’t need to send the signal on your own as it will be done by the init script (/etc/init.d/centengine reload). This will also be possible through the web interface on Centreon version 3.

5. What do we really gain from that?

With the reload on the fly, there are quite a lot of advantages:

  1. No long interruption of Centreon Engine
    1. no interruption of monitoring service
    2.  no interruption of passive data reception (unavailability of centengine.cmd)
    3. no gap on event scheduling due to interruption
  2. No need to re compute the scheduling queue after each restart
  3. No need to send all the memory data of Centreon Engine to main poller through Centreon Broker. Only changed data is sent (for instance a threshold).
  4. No need to send the “initial_state” statuses after each restart which prevent the log files from growing too much.
  5. You will no longer see the monitoring console getting empty while the broker receives data after a restart. Thus, operators will have no interruption in their tasks while using the monitoring console :-)

This version will soon be available for download from our website or directly from your Centreon yum repository. We hope that you appreciate our effort as well as the overall gain that will be brought to the users of Centreon.

Cheers!

Leave a Reply