This post is also available in: French
We told you via the newsletter of June : we go back to the basic information with some articles about the installation and customization of your Centreon platform.
You have just completed the installation of your monitoring platform with Centreon Enterprise Server via the official online documentation and now you are ready to configure it. This guide will explain how to quickly get your system up and running.
We will look over all the objects and their respective parameters. The aim of this article is to familiarize you with the concepts of monitoring configuration with Centreon.
Before we start with the actual configuration, let’s review some basic definitions:
- Host: a host is any IT equipment with an IP address. In short, a host can be a server, a firewall, a router, a temperature sensor, …
- Service: a service is a point of measurement (indicator) attached to a host. It could be anything, for instance:
– CPU load
– memory usage
– state of a process on a server
– bandwidth usage of an interface
– keyword parsing on Syslog
– SNMP trap event
- Plugin: a plugin is a binary executable or a script that can be run from the command line (shell) with arguments, in order to retrieve the value of an indicator. The result of a plugin will reveal the value of the service and its associated status.
- Command: a command can be of several types: “check command”, “notification command” or “processing command”. It is a definition corresponding to the path to a binary or script and the various arguments for its implementation.
- Contact: definition of a user who may be attached to hosts and/or a services in order to receive their status changes, by notification process.
- Group: a group can be composed of objects such as hosts, services or contacts for easier handling of those.
- Status: the status of a host or service is determined by the return value of the monitoring plugin. The different statuses of a host or a service are:
- UP: equipment is present on the network (i.e. positive response to an ICMP request);
- DOWN: equipment is not present on the network;
- UNREACHABLE: the status of the equipment cannot be determined (located behind an equipment that is “DOWN”);
- PENDING: equipment has just been configured but has not yet been checked
- OK: the value polled by the monitoring plugin is in a nominal state;
- WARNING: the polled value exceeds the first warning threshold ;
- CRITICAL: the polled value exceeds the second and last warning threshold;
- UNKNOWN: refers to an unsuccessful polling and the status of the indicator cannot be determined. This is mostly due to a plugin execution problem: incorrect or missing arguments;
- PENDING: indicator has just been configured but has not yet been checked;
- State: the state defines whether a status is confirmed (“HARD”) or not (“SOFT”). This applies to both monitoring indicators (services) and devices (hosts). Note that only confirmed statuses (“HARD”) will trigger notifications (emails, SMS’s, etc…) to contacts;
- Protocol: network protocol (ICMP, SNMP, SSH …) that is used by a plugin to poll the value of an indicator. Example: percentage of CPU usage is polled via SNMP protocol.
We now have the basics of vocabulary, let us see some notions.
1. Prepare an IT framework.
Before starting with the configuration of your hosts and services, it is necessary to determine the scope of your inventory.
Will you deploy the monitoring of all or part of your IT? What equipments are present in your IT? Are they redundant? What are the different OS installed on your servers? What applications do they host?
To answer these questions it is necessary to create a document referencing manufacturer / model / Version [/ OS version] / IP or DNS address of these objects.
2. Identify indicators to monitor.
Once your inventory is defined or extracted from a CMDB, it is time to determine which indicators are to be polled. These indicators are classified according to different profiles:
- Equipment : reference all indicators to know the status of your equipment (server status, temperature, fan status / FAN RAID controller, physical disk integrity, …);
- Operating system : reference all main indicators such as
- Percentage of CPU usage;
- Percentage of RAM and virtual memory usage(swap, swap file, …);
- Percentage of partitions / volumes usage;
- Percentage of network interfaces usage;
- Status of the main system process (status of event log, …);
- Applicative: reference hosted applications and processes(status httpd service, access to the URL of a web application, connection to a database instance…)
Note: It is very difficult to determine the reference of indicators to monitor.
Indeed, defining too many indicators could confuse the end user due to the high amount of data whereas too few indicators may lead to the non-detection of an actual problem. You can periodically review the monitoring configuration to remove unnecessary indicators and add new ones that those that would allow the detection of problems that were previously undetected.
3. Define the threshold
Your list of equipment and related indicators is now complete. The following action is to define the warning thresholds.
For this, two relevant questions are to be asked:
“Starting from what value does a polled indicator require intervention to fix any potential problem (proactive)?
Starting from what value are this indicator considered critical and the service considered undeliverable? “.
Here is an example: the rate of CPU usage of srv-web-002 server is at 91% of usage and users are experiencing some latency when accessing some applications that are hosted by the Web server. The application service is degraded but not failing. How to set the thresholds? After several load tests, latency seems to occur starting from 85% of CPU usage. The choice of thresholds may then be the following: “WARNING if usage is above 80% (small margin) and “CRITICAL “if usage is greater than 90% (again, a small margin, as waiting until it reaches 95% could be a very heavy handicap).
4. In case of failure, who to inform and how?
You just set the warning thresholds for your indicators; however your inventory is not yet complete.
Who to notify when an incident or a malfunction is detected? Are these the same people regardless of the criticality of the problem? Are these the same people no matter the indicator?
The answer is “no” most likely. If your GNU / Linux server hosts a database server, there is a good chance that the hardware and system problems must be supported by the system team and those on the database by the DBA team. Other questions are also crucial: “What will be the means of notification: e-mail? SMS? Other?”; «On what time period? “.
That’s all for today.
Stay tuned. Next week : configuration !