Nagios Monitoring at NT1
This page is about Nagios monitoring of NeIC Tier 1 resources. We also provides monitoring for other Nordic grid resources. This page is directed towards site operators of the respective resources.
Monitored Storage Services
dCache Head Nodes
On head nodes, we monitor the basic health of the system like:
- Disk usage, CPU load, memory, and process count via Ganglia.
- A probe to check the RAID health.
- Availability of the SSH port.
dCache Pools
dCache provides information about pools under https://chaperon.ndgf.org:2288/. The probe uses an internally accessible version of the /info sub-URL containing machine readable information which is parsed by the "gather dCache" service on localhost. This publishes the information to the dCache Pool Services.
In addition read and write availability of each site is reported as localhost services, in lack of a better place.
dCache Pool Groups
Monitored Compute Services
ARC Job Submission
The list of CEs currently tested can be seen in the ARC CEs host group on chaperon.
For each ARC test we do on a host there is an active probe to submit jobs and a passive probe report the results. There are three such pairs, one for testing plain job submission and retrieval and two for testing jobs with data staging:
- ARCCE Job Submit (active) and ARCCE Job Result (passive)
- ARCCE SRM and GridFTP Submit (active) and ARCCE SRM and GridFTP Result (passive)
- ARCCE LFC Submit (active) and ARCCE LFC Result (passive)
Fetching of the jobs and reporting to passive services is done by a single service ARCCE Monitor associated with localhost.
The configuration of these probes is found in /opt/nagios/etc/ndgf/arcnagios.cfg and /opt/nagios/etc/ndgf/arcnagios.ini and the state information can be found under /var/opt/nagios/plugins/arcce/None.
The ENV/PROXY Runtime Environment
The check of the IGTF CA certificates requires the ENV/PROXY run-time environment. From [1]:
#!/bin/bash
x509_cert_dir="/etc/grid-security/certificates"
case $1 in
0) mkdir -pv $joboption_directory/arc/certificates/
cp -rv $x509_cert_dir/ $joboption_directory/arc
cat ${joboption_controldir}/job.${joboption_gridid}.proxy >$joboption_directory/user.proxy
;;
1) export X509_USER_PROXY=$RUNTIME_JOB_DIR/user.proxy
export X509_USER_CERT=$RUNTIME_JOB_DIR/user.proxy
export X509_CERT_DIR=$RUNTIME_JOB_DIR/arc/certificates
;;
2) :
;;
esac
For general information about installing RTEs, see the documentation links at http://pulse.fgi.csc.fi/gridrer/htdocs/index.phtml.
ARC Cache Index
The ARC Cache Index service goes checks which CEs cache srm://srm.ndgf.org/ops/nagios-chaperon/testfile. If ACIX is set up on a CE, this file should be cached, since the ARC probes use it. Each time the file is seen for a CE, it publishes to the passive ARC Cache associated with that CE. This passive service will fail if nothing was published for an extended period of time. Thus it is sufficient that the cache entry is seen now and then.
If the ARC Cache Index service for a CE fails, the site operator may check
- that ACIX is configured,
- that the port is accessible to scooter.ndgf.org, and
- that the cache hasn't filled up.
It may help to restart ACIX and wait for Nagios to send the next SRM staging check, which should happen within an hour.
The probe checks the following URL: https://cacheindex.ndgf.org:6443/data/index?url=srm://srm.ndgf.org/ops/nagios-chaperon/testfile
Site BDII
The BDII/GLUEINFO Service
This service is monitored on BDII servers. It checks that a certain number of Tier 1 CEs are published in the GLUE tree (mds-vo-name=NDGF-T1,o=grid). If we add or decommission CEs, we should update the expected number. Our Tier 1 CEs can be seen in MyEGI or GOCDB and is at the time of this writing
- abisko-ce.hpc2n.umu.se
- arc-ce.smokerings.nsc.liu.se
- ce01.grid.uio.no
- gateway01.dcsc.ku.dk
Monitoring of Other Services
GGUS Tickets
Two Nagios services keeps track of NGI_NDGF and ops.ndgf.org tickets, respectively, and publish updates to the next weekly page. These probes don't trigger alerts unless they fail to fetch the data or fail to update the wiki. However, if an alert is found in state new or assigned, it will be published to a third passive service "GGUS NGI_NDGF Alarm Tickets".
Scheduling Downtime
Downtime for a particular host can be scheduled from the host view in Nagios. Separately:
- Schedule downtime for this host.
- Schedule downtime for all services on this host.
Downtime for a group of services can be done from the service group summary.