NoContact Service

One of the purposes of the remote probes infrastructure is to allow service managers to be informed of the state of their endpoints, for this we generate a so called NoContact alarm, these can be based either on the remote probes or in heartbeats produced by the node.

Remote Probe Based Alarms

The pull model of this infrastructure (remote probe) can be configured to generate NoContact style of alarms for the different endpoints probed, please refer to the specific documentation

Heartbeat Based Alarms

The push model of this infrastructure (heartbeat) is used to provide the NoContact capability for the datacentre nodes. By default all Unix Puppetised nodes will register in the infrastructure and produce a periodic heartbeat to announce that the server is running (NoContact) and a second one to announce that the Monitoring stack of the server is running (NoMonitoring).

The infrastructure is configured in a way that if a server enters in NoContact the NoMonitoring alarms will be inhibited and thus won't raise an alert.

NoContact alarms

NoContact alarms are a special type of alarms generated outside of the host using heartbeats producer by the host. When heartbeats stop being produced a NoContact alarm is generated. These alarms are evaluated every 10 minutes and create SNOW tickets using information from the puppet facts below. If required, the default value for these facts can be overridden for NoContact alarms.

cerncollectd::nc_override::fename: "My new fe"
cerncollectd::nc_override::troubleshooting: "My troubleshooting URL"
cerncollectd::nc_override::snow_assignment_level: "3"
cerncollectd::nc_override::snow_grouping: "1"
cerncollectd::nc_override::egroup_name: "foo@cern.ch,bar@cern.ch"

All SNOW tickets created for NoContacts alarms have the following additional features:

Are automatically closed when an OK notification is received (heartbeats produced again).
Send a periodic reminder every 24 hours while the NoContact alarm is still active.
No ticket is created in case the "roger_alarmed" flag is set to false (check roger docs), in the case of NoContact this flag is "nc_alarmed".

Why is my node in NoContact?

As mentioned before, NoContact is a special flow that relies on the heartbeat generated by the nc_heartbeat "service" (systemd) in Puppet managed nodes and the "remote probes" service for Windows infrastructure.

In the case you are receiving NoContact for a node managed with Puppet, here's a list of things to check before reaching MONIT.

Service is active: systemctl status nc_heartbeat will provide already a good indication on the state
Check /var/log/messages: NC_heartbeat script logs there any issue there might be while running
Ping monit-remote.cern.ch: It might be your node has connectivity but fails to contact the "NoContact" infrastructure.

For Windows nodes that are handled under the remote probes service it might be triggered by different factors: lack of connectivity, timeouts... you can check more details in the available dashboard.

In the case of a Windows machine that has been removed but it's still creating NoContacts, please make sure it has been deleted properly from Foreman, as it's the source of truth for our PuppetDB service discovery.

Avoid your node to register for NoContact at all

In case you have a very specific kind of flow, where your nodes are supposed to live for a very short-time (few hours), it would make sense that these nodes are not registered at all in the NoContact infrastructure, which will avoid some extra load in both sides.

Please note that this is different from masking your NoContact in Roger, as that still allows the node to register and "silent" NoContacts might be triggered until the cleanup in our side runs.

If your flow is in this case scenario you can override the variable in Hiera to disable the registration:

cerncollectd::enable_nc_registration: false

NoMonitoring alarms

NoMonitoring alarms, as well as NoContact ones, are a special type of alarms generated using the host using the Monitoring heartbeats producer run by Collectd. When Collectd stops producing heartbeats or it detects there's any issue with the MONIT agent a NoMonitoring alarm is generated.

Default value for this alarms can be overridden using the plugin configuration.

cerncollectd_contrib::plugin::heartbeat::snow_functional_element: 'ignore'
...

As mentioned, there are two types of "NoMonitoring" alarms:

NoCollectd: Means Collectd is not able to send heartbeats. This may be caused by a stuck plugin or the service not running at all, please check the "collectd" service and the logs in "/var/log/collectd.log" to understand better the reason.

In case you are receiving these new alarms and the reason is not clear, please get in contact with the MONIT team via SNOW.

If you would just like to stop receiving these alarms, please assign 'ignore' as the "snow_functional_element" parameter when configuring the plugin in Hiera.

To avoid receiving NoMonitoring alarms when draining/removing nodes, the best will be to mask the node in roger, setting the "app_alarmed" flag as "false". Please note that this will stop any non "Operating system, Hardware or NoContact" alarm from being raised as a SNOW ticket.