When the Customer’s applications are deployed, EMENCIA puts in place a set of probes that are specific to these applications: for example, when installing a Java application server, these probes will check whether the number of processes and threads, as well as the memory and processor occupancy specific to the application server, in fact fall within an acceptable range. For an SQL server, a specific probe will regularly ensure that a connection to the SQL server is indeed possible, as is submitting a basic query within a reasonable timeframe.
At the time of implementation, EMENCIA deploys a set of probes that constantly check:
All your application components are monitored in real time by the monit system, to warn of any impact.
Also, all the application components are monitored in real time by the monit system. If an incident is identified, an email is automatically sent to EMENCIA’s NOC and the Customer (if requested). For some types of incident, monit may undertake corrective actions to restore proper service operations. If the consumption of resources (processor or memory) exceeds critical thresholds, monit can cut or restart certain services in order to protect the proper operations of the main elements. For example, if one of the sites is considered secondary, it can be cut if resources are saturated to maintain optimal service operations on the main sites.
In addition to these monitoring and self-healing features, monit offers a general overview of the services (see the screen shot below).
MUNIN is an open source system and network monitoring tool with a GNU general public license that uses the RRRDtool (database recording and graphics system) and its framework is written in Perl.
It memorizes everything it has seen on the network and then presents this information in the form of graphics available via a web interface. A sample screenshot from the server’s weekly usage memory is set out below.
This tool allows for easy monitoring of the performance of the system, network and applications. It helps determine the time at which a performance issue arises.
Emencia offers the configuration of a PINGDOM probe on the site to be monitored, with a text (SMS) alert. Pingdom provides an application monitoring service and helps measure the availability of applications and services from a number of geographic points.
Il offers :
Emencia deploys a proven open source monitoring solution, Supervisord, which is used by many companies and organizations. Our monitoring system is hosted on a dedicated server. This system allows for the monitoring of servers and related services.
The verification system for the proper service or server operations can be configured so that it proactively responds if a problem occurs. The use of scripts allows for an action to be taken for a service, for example to automatically restart a defective web server.
In addition to the onboard monit service, Emencia offers monitoring with the Nagios service. Nagios is an application that allows for system and network monitoring. It monitors the specified hosts and services, issuing alerts when systems work poorly and when they improve. It is free GPL-licensed software.
This is a modular program that is broken down into three parts :
EMENCIA’s monitoring includes queries to the customer’s web platform every two to five minutes in order to ensure it is operating properly. Once a problem is identified, a warning is recorded, then the verification is launched every minute. At the end of three unsuccessful verifications, an alert is sent by email and text message to the contacts the customer has declared, as well as to EMENCIA’s technical team, which intervenes in order to find a solution.
In the example above, this system allows for the hardware standby team (authorized to work on the network) to be alerted without unnecessarily involving the software standby team (authorized to work on the web server).
When the Nagio system identifies an anomaly affecting a critical service that has not been corrected by monit’s self-healing procedures, an alert is sent by email and text message to EMENCIA’s team (as well as to the Customer for information of the change in status). EMENCIA’s technicians then initiate the actions required to restore the proper operations of the systems. These actions may range from a simple manual rebooting of a service to the execution of more complex procedures.
Some interventions may impact services other than the one at fault. For example, if a disk is full, temporary files may need to be erased or a database may need to be cleaned. If this occurs, EMENCIA informs the Customer for its approval or additional instructions before going further.
If the technical team does not resolve the alert, escalation is triggered in order to alert EMENCIA’s Level 2, as well as the emergency contacts designated by the Customer. The Customer may indicate, department by department, whether its teams should be contacted outside of normal working hours (so as to avoid unnecessarily causing actions on the Customer’s side for a problem that is not urgent.)