Nagios Monitoring at NT1

From neicext
Jump to navigation Jump to search

This page is about Nagios monitoring of NeIC Tier 1 resources. We also provides monitoring for other Nordic grid resources. This page is directed towards site operators of the respective resources.

Monitored Storage Services

dCache Head Nodes

On head nodes, we monitor the basic health of the system like:

  • Disk usage, CPU load, memory, and process count via Ganglia.
  • A probe to check the RAID health.
  • Availability of the SSH port.

dCache Pools

dCache provides information about pools under https://chaperon.ndgf.org:2288/. The probe uses an internally accessible version of the /info sub-URL containing machine readable information which is parsed by the "gather dCache" service on localhost. This publishes the information to the dCache Pool Services.

In addition read and write availability of each site is reported as localhost services, in lack of a better place.

dCache Pool Groups

Monitored Compute Services

ARC Job Submission

The list of CEs currently tested can be seen in the ARC CEs host group on chaperon.

For each ARC test we do on a host there is an active probe to submit jobs and a passive probe report the results. There are three such pairs, one for testing plain job submission and retrieval and two for testing jobs with data staging:

  • ARCCE Job Submit (active) and ARCCE Job Result (passive)
  • ARCCE SRM and GridFTP Submit (active) and ARCCE SRM and GridFTP Result (passive)
  • ARCCE LFC Submit (active) and ARCCE LFC Result (passive)

Fetching of the jobs and reporting to passive services is done by a single service ARCCE Monitor associated with localhost.

The configuration of these probes is found in /opt/nagios/etc/ndgf/arcnagios.cfg and /opt/nagios/etc/ndgf/arcnagios.ini and the state information can be found under /var/opt/nagios/plugins/arcce/None.

The ENV/PROXY Runtime Environment

The check of the IGTF CA certificates requires the ENV/PROXY run-time environment. From [1]:

#!/bin/bash

x509_cert_dir="/etc/grid-security/certificates"

case $1 in
  0) mkdir -pv $joboption_directory/arc/certificates/
     cp -rv $x509_cert_dir/ $joboption_directory/arc
     cat ${joboption_controldir}/job.${joboption_gridid}.proxy >$joboption_directory/user.proxy
     ;;
  1) export X509_USER_PROXY=$RUNTIME_JOB_DIR/user.proxy
     export X509_USER_CERT=$RUNTIME_JOB_DIR/user.proxy
     export X509_CERT_DIR=$RUNTIME_JOB_DIR/arc/certificates
     ;;
  2) :
     ;;
esac

For general information about installing RTEs, see the documentation links at http://pulse.fgi.csc.fi/gridrer/htdocs/index.phtml.

ARC Cache Index

The ARC Cache Index service goes checks which CEs cache srm://srm.ndgf.org/ops/nagios-chaperon/testfile. If ACIX is set up on a CE, this file should be cached, since the ARC probes use it. Each time the file is seen for a CE, it publishes to the passive ARC Cache associated with that CE. This passive service will fail if nothing was published for an extended period of time. Thus it is sufficient that the cache entry is seen now and then.

If the ARC Cache Index service for a CE fails, the site operator may check

  • that ACIX is configured,
  • that the port is accessible to scooter.ndgf.org, and
  • that the cache hasn't filled up.

It may help to restart ACIX and wait for Nagios to send the next SRM staging check, which should happen within an hour.

The probe checks the following URL: https://cacheindex.ndgf.org:6443/data/index?url=srm://srm.ndgf.org/ops/nagios-chaperon/testfile

Site BDII

The BDII/GLUEINFO Service

This service is monitored on BDII servers. It checks that a certain number of Tier 1 CEs are published in the GLUE tree (mds-vo-name=NDGF-T1,o=grid). If we add or decommission CEs, we should update the expected number. Our Tier 1 CEs can be seen in MyEGI or GOCDB and is at the time of this writing

  • abisko-ce.hpc2n.umu.se
  • arc-ce.smokerings.nsc.liu.se
  • ce01.grid.uio.no
  • gateway01.dcsc.ku.dk

Monitoring of Other Services

GGUS Tickets

Two Nagios services keeps track of NGI_NDGF and ops.ndgf.org tickets, respectively, and publish updates to the next weekly page. These probes don't trigger alerts unless they fail to fetch the data or fail to update the wiki. However, if an alert is found in state new or assigned, it will be published to a third passive service "GGUS NGI_NDGF Alarm Tickets".

Scheduling Downtime

Downtime for a particular host can be scheduled from the host view in Nagios. Separately:

  • Schedule downtime for this host.
  • Schedule downtime for all services on this host.

Downtime for a group of services can be done from the service group summary.