Skip to content

Puppet Managed Nodes Metrics

This section provides guidance on working with the metrics collection capabilities available on Puppet-managed nodes through the Monit Agent. You’ll learn how to get started with the default metrics, customize the agent to collect metrics from your own applications, and access the data in CERN's monitoring infrastructure.

Getting started

The Monit Agent on Puppet-managed nodes is pre-configured to gather essential metrics out of the box, including CPU, memory, disk usage, and network statistics. This default setup helps you monitor the performance and health of your node without any additional configuration.

To start monitoring your Puppet-managed node:

  1. Verify Installation: The Monit Agent is installed by default on all Puppet-managed nodes. You can check if it’s active by running the command systemctl status fluent-bit@monit-agent.service. And, in /etc/fluent-bit/monit-agent/ you should be able to see its configuration files.

  2. Explore Default Metrics: The agent uses the Prometheus Node Exporter to collect standard metrics. You can view these metrics locally by accessing the Prometheus-style endpoint exposed by Node Exporter, typically available at http://localhost:9100/metrics on the node.

  3. Central Monitoring: All metrics are forwarded to CERN’s centralized monitoring endpoints, where you can set up dashboards, create alerts, and analyze trends across your infrastructure.

    An important aspect of the internal monitoring infrastructure is its multi-tenancy model. This means that all data—both reads and writes—in the central monitoring system is organized by tenant. For Puppet-managed nodes, base metrics are assigned to a tenant corresponding to the node's top-level host group (although multiple ones can be mapped to a single tenant if requested). For you it means that you will need to request the monitoring team via SNOW ticket to enable your tenant.

How do I send metrics from my application?

If you want to monitor specific metrics from your own applications, you can customize the Monit Agent to include these in the data it sends to the centralized monitoring system. Here’s a general approach:

  1. Expose Prometheus-Compatible Metrics: Ensure your application exposes metrics in Prometheus format, which is a common standard used by the Monit Agent. Many application frameworks and libraries support this format natively (for example, libraries like prometheus_client in Python or prometheus-exporter in Java).

  2. Configure Monit Agent to Scrape your Endpoint: Update Fluent Bit to scrape your application’s metrics endpoint. You can add a custom scrape configuration in Fluent Bit’s settings to collect data from the specific port where your application exposes its metrics. This flexibility lets you extend monitoring to application-level metrics that go beyond default system metrics.

    While the examples below focus on scraping endpoints hosted on localhost, this configuration can also be used for remote endpoints. Simply replace localhost with the remote host's address or domain name.

    Here’s an example of configuring the Monit Agent via Hiera to scrape metrics from two different endpoints, both hosted on localhost. In this setup, the agent is configured to collect metrics from two separate ports—8080 and 8081—each with its own scraping interval, path, and custom labels.

    monitoring::monit_agent::user_metrics_scrape_targets:
      localhost:
        8080:
          path: /metrics
          interval: 30s
        8081:
          path: /v1/prom/metrics
          interval: 60s
    

    • First example (8080): The Monit Agent will scrape metrics from http://localhost:8080/metrics every 30 seconds.

    • Second example (8081): The Monit Agent will scrape metrics from http://localhost:8081/v1/prom/metrics every 60 seconds. Here, a different path (/v1/prom/metrics) and interval (60 seconds) are used.

    The hiera wrapper is meant for simple integration of exposed Prometheus metrics, if you need to have advanced settings (like adding/removing labels or multiple scrape paths for the same endpoint:pod) please use the provided pipeline wrapper instead.

    When running a high-availability setup for your application or services, it is essential to include labels that capture information about the cluster and replica. This enables accurate identification and correlation of metrics originating from different nodes in the cluster, plus deduplication in our storage backend. Here's an example of how to configure the Monit Agent to include cluster and replica labels:

    monitoring::monit_agent::forwarders::fluentbit::pipeline { 'pipe-one':
      inputs                       => [
        {
          scrape_port         => 8080,
          scrape_endpoint     => 'my-cluster-node',
          scrape_interval     => '30s',
          scrape_metrics_path => '/metrics',
        },
      ],
      add_labels                   => {
        '__cluster__' => 'prod-cluster',
        '__replica__' => 'replica-1',
      }
      out_tenant_password_teigikey => 'teigi_key_with_your_tenant_password',
    }
    

  3. Verify and Test: After configuring the Monit Agent, test that your custom metrics are being collected. First check that the proper configuration appears in your nodes after running puppet. New inputs should appear in /etc/fluent-bit/monit-agent/. Then you can check Fluent Bit's logs (/var/log/fluent-bit/monit-agent/fluent-bit.log). And, finally, you can check in the central Grafana instance (https://monit-grafana.cern.ch). More on how to connect in the next section.

How can I see my metrics?

Once the Monit Agent has been set up on your node, all collected metrics are forwarded to CERN's centralized monitoring infrastructure, where they are stored and can be visualized. Currently, all sent metrics are stored within a tenant in Mimir (our backend for Prometheus Metrics). To visualize your metrics please refer to our Grafana documentation. The steps you will need to follow are:

- **Access Grafana**: Refer to the [Grafana Access section](/grafana/accessing) of the docs.
- **Create a Data Source for your metrics**: Refer to the [Grafana Data Source](/grafana/organizations) section of the docs. You will need the following data:
    - **Type**: Prometheus Data Source
    - **URL**: `https://monit-prom-lts.cern.ch/prometheus`
    - **Auth**: Basic
    - **Username**: your tenant name
    - **Password**: your tenant password
    - **Alerting**: Uncheck (if checked) "Manage alerts via Alerting UI" option.
- **Create a dashboard to visualize your metrics**: After configuring your data source in Grafana, the next step is to build a dashboard to visualize your metrics. Grafana allows you to create highly customizable dashboards by adding panels for different types of visualizations, such as time series graphs, gauges, and tables. Each panel can display specific metrics or sets of metrics, enabling you to monitor trends, set thresholds, and gain insights into your system’s performance.

    At the moment metrics are stored in a Prometheus like backend. This implies that you can query the data using PromQL. More information about how to use PromQL [here](https://prometheus.io/docs/prometheus/latest/querying/basics/). Here’s a simple example of a query you might use in a Grafana panel to monitor CPU usage metrics collected by the Monit Agent (assuming a Prometheus data source):

    ```promql
    avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
    ```

    In this example:
    - `node_cpu_seconds_total`: is the metric name that represents the total time the CPU spends in various modes (e.g., user, system, idle).
    - `{mode="user"}`: filters the metric to show only the CPU time spent in user mode. You can explore more labels for every metric like submitter environment or availability zone.
    - `rate(...[5m])`: calculates the rate of CPU usage over a 5-minute interval, giving a smooth average to reduce noise.
    - `avg(...) * 100`: takes the average of the rate across all CPU cores and converts it to a percentage.

    This query would produce a graph panel in Grafana showing the average CPU usage in user mode over time, making it easier to track system load and identify any CPU-related performance issues. Once created, you can add additional panels to the same dashboard for other metrics, like memory usage or network traffic, providing a comprehensive view of your system’s health.