Welcome to CERN IT Monitoring

Welcome to the CERN IT Monitoring Documentation site. Here, you'll find comprehensive guidance on integrating and using the services provided by the CERN IT Monitoring team. Our monitoring solutions are designed to support CERN's infrastructure and applications by offering metrics, logging, alerting, and visualization tools essential for maintaining system performance and operational reliability.

CERN IT Monitoring Services

Below, you’ll find a list of the available services, each with a short description that will help you decide if it is the right service for you. Still, if you have any doubt do not hesitate to contact us via SNOW ticket.

Please note that the minimum requirement for you to integrate with the MONIT infrastructure is to get a Tenant

Metrics

Metrics services enable detailed monitoring of system and application performance. You’ll find instructions here on setting up metrics collection, configuring data access, and integrating different metrics formats to suit your operational needs.

Currently we support four main entrypoints for metrics:

Puppet Managed Nodes (Also known as Puppet Monit Agent)
Kubernetes Clusters
Generic Metrics posted via API
Metrics Pulled from Active MQ

Once the metrics are within our infrastructure we put them in a dedicated Kafka topic. This allows you to transform them as desired.

Logs

Our logging services collect and centralize log data from various CERN IT systems, allowing for comprehensive monitoring, troubleshooting, and auditing. This documentation covers the setup and integration of logs, data formats supported (e.g., OTLP, JSON), and best practices for log management.

Currently we support three main entrypoints for logs:

Once the logs are within our infrastructure we put them in a dedicated Kafka topic. This allows you to transform them as desired.

Alerts

The CERN IT Monitoring team provides a range of alerting services to help you stay informed of critical events in real-time. By using Collectd and Grafana alerts, you can set up custom notifications, configure alert thresholds, and receive timely insights to maintain system health and responsiveness.

Currently we support three main entrypoints for alerts:

Also we allow to send the alerts to different endpoints like:

Mattermost
Generic Notification Interface (GNI)
Email

Grafana (monit-grafana.cern.ch)

Grafana is our central visualization platform, where you can view and analyze data from various sources through interactive dashboards. Learn about access types, user roles, and how to build and manage dashboards to visualize key metrics and alerts. This section also covers integration with CERN’s e-groups for organizational access control.

Public interest organizations

There are several orgs hosted in monit-grafana, here you will find quick links for some of the most visited ones:

MONIT: CERN only accounts, general dashboards for host/services monitoring
OPEN: Set of dashboards to be shared publicly
WLCG: Dashboards for the WLCG collaboration
ATLAS: Dashboards for ATLAS experiment
CMS: Dashboards for CMS experiment
LHCb: Dashboards for LHCb experiment
ALICE: Dashboards for ALICE experiment

Service Availability Metrics (SLS)

Service availability is a special flow where service managers can report a JSON formatted metrics that will be added to the service availability endpoints.

Remote Probes

Remote probes allows to perform black box testing against internal and external resources, both from the CERN internal network and from external one. The workflow is based on Prometheus and the Blackbox exporter, probing endpoints over a variety of protocols: HTTP, HTTPS, TCP, DNS, ICMP. Useful for example if you want to receive an alert in the case a website is not available or when a set of machines to not answer to ICMP calls.

Service Performance Metrics

The IT Monitoring Service provides infrastructure to store and display information about the different performance metrics of your services over long intervals of time. This infrastructure has been also designed with the idea of reducing the toil for the Service Managers by pulling the required data from various data sources only based on configuration and data queries. All the collected data is stored in a central Influx DB instance and it could be accessed in Grafana.

WLCG Monitoring

WLCG (Worldwide LHC Computing Grid) Monitoring provides tools and resources for processing data from the LHC’s distributed computing infrastructure. This section covers supported Spark jobs for data processing, and guides for requesting new monitoring use cases to support WLCG operations.

How to get your monitoring tenant

In order to interact with the monitoring central infrastructure you will require to get a tenant (user), as otherwise only the central managed metrics/logs will arrive to us and they will do in a default/shared tenant with big limitations.

The tenant is used for us to identify you as a valid producer of your monitoring data but also for isolation across the full monitoring infrastructure.

To start with the creation of your tenant there are two possibilities:

I want to send documents from Puppet managed nodes: In this case the first step will be for you to create a teigi service where we can put your credentials
```
$ ai-pwn set service <toplevel_hostgroup>_monit --owners monit-support --hg <toplevel_hostgroup>
```
I don't want to send documents from Puppet managed nodes: In this case you might not be able to use teigi linked to a hostgroup, so we will provide the service for you.

Anyhow, once you are certain that you require a tenant, you should contact us via SNOW ticket and provide the following information:

Tenant Name: The required tenant name
Teigi Service Name: In case you were able to have a teigi service the name of it
Prometheus TSDB stats: In case you already have a Prometheus to forward the metrics from you can get these from the /tsdb-status endpoint.
A responsible e-group: It will be used mainly for notifications and sharing credentials if needed
Approximation of the expected daily data volume and rate
How you plan to access your data: hdfs files, kafka stream, opensearch/grafana dashboard, etc.

Once your tenant is ready you will be notified back and you will be able to access your credentials using teigi:

$tbag showkeys --service <service_name>