The long road out of Nagios – Part 1: A retrospective roadmap of Centreon Engine

Centreon’s Coming of Age

Centreon’s early history is closely tied to Nagios. Indeed, Centreon Web started out as a solution to make Nagios less complex and easier to use while maintaining its powerful flexibility that made it a favorite of the industry.

Centreon has grown up considerably since, and while it has remained true to its root, we found the need to make substantial evolutions to Nagios on our own. This need led to the creation of Centreon Engine, the subject of this article.

Most users will never see this core software. Today Centreon’s software suite, known as CES, is a complete monitoring platform that includes the three big pivotal bricks: Centreon Web, Centreon Engine and Centreon Broker, together with all the tools needed to monitor any kind of IT system. CES has been downloaded and used by thousands already and has already proven its mark. This relative obscurity can generate a lack of interest, but just as the engine in a car is essential to the car, Centreon Engine is essential to Centreon CES.

The best way to understand this is to begin by the grandaddy of Centreon Engine: Nagios.

Nagios

What is Nagios? Answering this question would need far more than a single article and take us far from our subject. For the sake of brevity, we will give a short description of this software.

Nagios is an open-source monitoring engine. It is generally considered the granddaddy of all open-source monitoring solutions, for good reason: its first release dates back to 1999! It has become the de-facto standard for a lot of enterprises. Although it has changed since its first inception, it is best described as a centralized, active, pre-configured monitoring software 1.

Nagios is based upon the idea of a small centralized daemon being configured to do active checks of an object’s availability… Those checks range from ping to increasingly more complex commands. The return of those checks assures that a distant server is reachable, that it has enough disk space, its network is working, etc.
Nagios alone only centralizes the return of those commands. It generally passes that kind of information on to a second software, called a broker, which will then save the information in a database for further use.

Nagios has proven its usefulness and has become a tried-and-true solution used by a large user-base, from small business to big enterprise. It was the natural choice for the monitoring engine used by Centreon when it was first created.

Unfortunately, as time passed, we discovered more and more pitfalls in its design. Those shortcomings weren’t important when it was first conceived, but they became an obstruction in an increasingly complex world of distributed networks, high-redundancy, QoS, and cloud computing. Those issues were the primary motivations to create Centreon Engine, a fork of Nagios that has considerably diverged from its parent.

Centreon Engine

So why improve Nagios? For too many reasons. Indeed, the best insights into this question can be taken by reviewing some metrics about the improvements that have been made since the fork on the latest stable branch (1.5).2

Total number of files: 1,057
Total number of lines: 142,063
Total number of commits: 1,499

+———————+———+————+——-+——————–+
| name                                | loc          | commits | files   | distribution    |
+———————+———+————+——-+——————–+
| Dorian Guillois                | 114,991 | 743             | 596   | 80.9 / 49.6 / 56.4 |
| Matthieu Kermagoret       | 26,221  | 621             | 393   | 18.5 / 41.4 / 37.2 |
| Alexandre Fouillé           | 538 | 49   | 35              | 0.4 / 3.3 / 3.3 |
| Alexandre Fouille           | 192 | 4     | 12              | 0.1 / 0.3 / 1.1 |
| ageric                             | 104 | 67   | 9                | 0.1 / 4.5 / 0.9 |
| Antoine Nguyen             | 16   | 3    | 2                 | 0.0 / 0.2 / 0.2 |
| tonvoon                          | 1    | 8    | 1                  | 0.0 / 0.5 / 0.1 |
| Michael Friedrich           | 0    | 1    | 0                  | 0.0 / 0.1 / 0.0 |
| egalstad                         | 0    | 2    | 0                  | 0.0 / 0.1 / 0.0 |
| root                                | 0    | 1    | 0 | 0.0 / 0.1 / 0.0 |
+———————+———+———+——-+——————–+

You could say that’s a lot of things! This includes code refinement, new functionalities, bugfixes, performance and scaling improvement. But wait, before detailing some of them, here are the stats of the unstable branch for the next centreon-engine milestone (2.0):

Total number of files: 762
Total number of lines: 105,734
Total number of commits: 1,659

+———————+——–+———+——-+——————–+
| name                               | loc     | commits | files | distribution   |
+———————+——–+———+——-+——————–+
| Dorian Guillois               | 76,875 | 743 | 660   | 72.7 / 44.8 / 86.6 |
| Matthieu Kermagoret     | 27,810 | 716 | 535   | 26.3 / 43.2 / 70.2 |
| Alexandre Fouillé           | 739     | 114  | 78     | 0.7 / 6.9 / 10.2   |
| Alexandre Fouille           | 186     | 3      | 11     | 0.2 / 0.2 / 1.4|
| Antoine Nguyen             | 67       | 3      | 3       | 0.1 / 0.2 / 0.4|
| ageric                      | 56 | 67       | 9      | 0.1 / 4.0 / 1.2|
| tonvoon                          | 1         | 8      | 1       | 0.0 / 0.5 / 0.1|
| afouille                            | 0         | 1      | 0 | 0.0 / 0.1 / 0.0|
| Michael Friedrich     | 0  | 1         | 0      | 0.0 / 0.1 / 0.0|
| egalstad                         | 0         | 2      | 0       | 0.0 / 0.1 / 0.0|
| root                    | 0        | 1         | 0       | 0.0 / 0.1 / 0.0|
+———————+——–+———+——-+——————–+

Did we lose 40 000 lines of code from a version to another? No, in fact we lost more! These statistics only count the lines that were added in Centreon Engine and then removed. But a lot of Nagios lines were removed in the subsequent version as well. A simpler line count for Centreon Engine/Nagios will give you 191639 lines for branch 1.5 and 110979 line for branch 2.0.

We lost more than 80 000 lines of code between this stable version and the next, unstable version! What happened? For the answer to this question, we first need to re-trace our steps.

Centreon Engine Then: Better, Faster, Stronger.

One of the primary focus of the fork was to provide high, scalable performance for all kinds of charges. Here at Centreon we typically manage tens of thousands of different services with the same server, and hundreds of thousands on several machines is not unheard of. Managing those charges would have been be impossible if Centreon Engine had not been almost entirely rewritten to offer powerful performance from the start.

This is not as easy as it seems. Optimizing any piece of software is a difficult road, filled with premature optimizations and other traps for the unwary developers. This is why we took our time to identify all of Nagios’ points where performance degenerated irremediably and needed fixing. To give you an idea, we will give three examples here of improvements made to performance, although there were many more.

Macro

Centreon Engine, like Nagios, offers a huge number of macros (or user-replaced variables) to customize a command. Traditionally, those macros were always a point where performance would fail irremediably. Several mechanisms have been devised to ensure macro replacement would be as fast as possible. They include an aggressive caching strategy, pre-compilation of the macro hashtable, and no memory allocations in the main replacement path. The result: time spent in macro replacement is now negligible compared to the rest.

In memory check results

Nagios’ check results were written to a directory to be read later on by the scheduler. An important overhead was incurred by those unnecessary round-trips to the filesystem. One of the first tasks of Centreon Engine was to ensure check results were processed in-memory by the daemon, speeding up check results by an order of magnitude.

Vfork

This is a relatively complicated subject. So bear with me or jump directly to the next point.
Under Unix philosophy, creating a new process (typically to execute a new command) is made by the fork() system call. The fork() system call copies all the memory of the current process, effectively cloning it. The new process then executes the new command.

Centreon Engine can effectively make tens of thousands of fork() system calls for each second. Fork() has been historically a point of contention in Unix as copying memory is inherently slow. Fortunately, Linux implements a copy-on-write mechanism for fork() that, through some computer magic3, never copies the memory until it is really needed. This should be perfect for us, right?

In fact, no. The copy-on-write mechanism is still relatively slow. Even if memory is not, the page tables still need to be copied along with some kernel structures. It was found by Centreon’s team that even with copy-on-write, a large number of time was passed in fork() for high-loads. Thus the process spawning has been entirely rewritten to use the more difficult but faster vfork()4.

The sorry state of the restart

An ongoing demand concerned the restart of Nagios. Big configurations reported that starting, or restarting Nagios could take as much as ten minutes or more, waiting for the new configuration to be processed. This was as bad as you can imagine5.

A serious investigation was performed into the daemon and several points of failure were discovered.

The first is a classic problem algorithmic complexity problem. New checks were added in an exponentially worst manner6 – a large part of the startup sequence was passed in scheduling several thousand checks. Thus the algorithm used to schedule new checks was changed to be saner7 when dealing with a large bulk.

The second problem found was linked to the broker used at this time by Nagios. This broker made unoptimized reads and writes in a database at each startup that slowed down Nagios. This, and so many other problems, motivated the creation of Centreon’s own Broker, simply named Centreon Broker. As it is a central piece of Centreon’s own distributed architecture, we won’t discuss Centreon Broker here. It is the subject of a future article planned.

With those tweaks and many others, we were finally able to reduce a start/restart sequence to several seconds at worst. Unfortunately, even this was found to be unacceptable for some charges. There was a demand for a true ‘live’ configuration reload, without stopping the monitoring for even a second.

This was the major feature of the 1.4 branch. With that, we can finally fully appreciate a seamless, instantaneous configuration loading.

Compatibility at all cost

Unfortunately, in implementing new functionalities, we found ourselves bumping more and more against Nagios numerous limitations. One of the decisions taken at the start was to keep binary compatibility with Nagios’ modules and brokers. This was sound when Centreon Engine was still a new player, unsure of what was its place in Centreon’s ecosystem, but this isn’t the case today.

Binary compatibility greatly restricts the modifications we can make to the daemon. For instance, the live reloading was a wonderful piece of code that seemed, at times, more like mental gymnastics than software development.

Nagios codebase is old. There is only so much improvements that can be made without breaking compatibility. Additionally, several paradigm shifts occurred in Centreon and in the monitoring world in general. Moreover, the advent of Centreon Broker allowed for high-performance distributed computing with top-level correlation, discharging Centreon Engine of a lot of its traditional duties.

Clearly, Centreon Engine needed to evolve. In my next article, I will share what this meant for us.

Alexandre Fouillé

—————————————————————

1Readers interested in  the subject can read its history.
2 Courtesy of git-fame, a wonderful little tool.
3 Not really.
4 Vfork() does not copy memory and assume the new process will behave momentarily, sharing the memory of its parent.
5 Worst, honestly.
6 O(n2) to be precise.
7 O(n).
8 Status are unimportant, metrics are. But this goes far beyond the scope of this article.

Leave a Reply