DCache Pool installation

From neicext
Jump to navigation Jump to search

This page provides a guide for the local site admins of how configure an NDGF Tier 1 dCache pool.


Pool operators guide

This the guide for Site Admins to set up the requirements for an NDGF dCache pool. The actual installation av dCache is handled by NDGF operators and is described on Operation Procedures dCache.

Centrally managed dCache pools

Starting fall 2017, dCache on the pools were gradually converted to NDGF managed mode using Ansible and an unprivileged user account (no root access). Work name is "tarpool" since dCache now is installed from a tar file distribution instead of a locally installed deb/rpm package. But ganglia configuration, file system setup, TCP and OS tuning and more is still managed by the local site admins. Basically all things that the "dcache" user cannot do. NDGF will install and upgrade Java and dCache and do the required configuration of dCache.

Pool procurement

There are separate pages for pool hardware sizing and some completed hardware procurements.

dCache user account

dCache runs as regular user and does not need root privileges. Requirements:

  • A non-privileged user account usually and preferably called "dcache" for all new installations. Exceptions can be negotiated if needed. In all the following instructions we will use "dcache" as the user name, adjust if you are deviant.
  • A home directory for the "dcache" user where the dCache software and logs will be stored. Give it some space. 10GB is good, 5GB will probably work.
  • The dCache user should have bash as shell and bash should be installed as /bin/bash.
  • All or most dot-files in the "dcache" home directory will be overwritten by the tarpool installation ansible scripts that NDGF runs.
  • Lingering must be enabled for the account (loginctl enable-linger dcache). This is currently only needed for the node exporter when not using a locally managed Prometheus instance, but it will likely also be needed for the dcache service in the future.

Some basic requirements for pool machines

  • Space for pools should be ready for use, with a empty top level directory owned by the "dcache" user.
  • dCache should not be installed. If existing, the /usr/local/share/dcache/ directory should be empty (TODO: move this check to the tarpool ansible script).

For UiO: The Ansible role dcache-tar-pool-boostrap has been created to automate these steps. See Example about applying dcache-tar-pool-boostrap on UiO

Logrotate

dCache logs to ~/dcache/log/. The logrotate package should be installed and dCache log files will automatically be rotated.

Cron

The tarpool installation uses cron to automatically start dCache when the pool machine boots. So cron has to be installed, running and usable by the dCache user. Please verify that the local dCache user is allowed to run `crontab -e`. It is possible that this will be replaced by some systemd gizmo in the future.

If for some reason the machine needs multiple reboots the cron job can be temporarily disabled to prevent automatic restarts of dCache.

NDGF SSH management access

For NDGF to be able to manage the tarpool setup we need public key SSH access as the non-privileged dCache user. Here is the initial authorized_keys. It will be replaced during the tarpool installation. See also the firewall settings.

If for some reason the ~dcache/.ssh/authorized_keys is not used by sshd on the poolnode, symlink it from the correct place.

Recommended extra packages

For operation of the tarpool there are a number of packages that could be nice but aren't strictly required.

iperf3 (optionally open port 5201 in the firewall, but we can use the port range for dCache too)
emacs-nox (for nice log reading)

Network and OS tuning

For all tuning concerns, please see Operations Tuning Linux. There are some critical bits here. (TODO: check for tuning in the tarpool ansible scripts)

Adjust the limits for open files

dCache pools can use a lot of simultaneous network connections and open files on disk, so the limits for open files needs to be adjusted. We also allow more threads.

In the tarpool distribution there is a file for this. This file can be symlinked to /etc/security/limits.d/:

ln -s ~dcache/dcache/config/92-dcache.conf /etc/security/limits.d/92-dcache.conf

The file will appear during the tarpool installation but can be symlinked in advance.

Or you can copy the file (once it's installed):

cp ~dcache/dcache/config/92-dcache.conf /etc/security/limits.d/92-dcache.conf

Or you can create /etc/security/limits.d/92-dcache.conf manually with the content:

#
# limits configuration for dcache
#

# number of processes/threads
dcache soft nproc unlimited

# number of open files and connections
dcache soft nofile 65535
dcache hard nofile 65535

Beware that the file needs to be copied and modified if you don't use "dcache" as your dCache pool user. (TODO: check for this in the tarpool ansible scripts)

Raid volumes

TODO: Add content here about the size of raid volumes, choice of RAID solution and some basic optimizations to do.

File system

We recommends XFS as the file system for dCache pools. Other possible candidates are ext4, OpenZFS and GPFS. dCache pools contain two directories: The meta/ directory is used for storing meta data and the data/ directory contains the data. The meta/ directory will contain a set of Berkeley DB (Java Edition) files, which might take a couple of hundred megabytes for a moderately large pool, growing to several gigabyte for pools with tens of TB. Access pattern to the data/ and meta/ may be quite different and for this reason we often recommend to moving the meta/ to different spindles or SSD, e.g. a mirrored system disk. Access patterns to the data/ will be (near) sequential writes, (near) sequential reads and for ALICE random reads with small blocks. In particular for ALICE it is important to not ignore IOPS for reads, and also consider that Alice reads small blocks. ZFS for ALICE should thus have a lower record size than the default of 128 kB. Make some tests of file creation and deletion speed. It has turned out that certain combinations of filesystem och underlying raid systems has severe performance issues with a deletion rate of files in the 1-2 files per second range.

Automatic pool reboots

The file /tmp/dcache_is_shut_down tells the site that dcache is down by the NDGF pool operators and the pool machine can be rebooted.

The reboot can either be done automatically by cron or manual intervention by the site. Benefits of automatic reboots: Less interaction needed between site and central operators resulting in very short downtimes. Drawbacks: If the pool doesn't reboot cleanly we urgently need site intervention so interaction is still needed.

The suggested implementation is for sites to automate machine reboot using a cron job or systemd timer. Remember to cater for:

  • Only reboot during office/staffed hours
  • Only reboot if machine really needs reboot (ie. Debian/Ubuntu presence of /var/run/reboot-required, RHEL/CentOS: needs-restarting -r returns false)
  • Don't reboot if someone is logged in (systemd loginctl list-sessions can help determine this)

When the pool machine restarts it will automatically start dCache and remove this file.

As of now, 2021-10-08, this section is optional. This might change in the future.

Firewall configuration

The machine firewall should:

  • Allow all outbound connections (TCP and UDP).
  • Allow inbound TCP connections for ports between dcache.net.*.port.min to dcache.net.*.port.max (from dcache.conf) for data transfers. This differs between sites, but defaults to 20000-25000.
  • Allow inbound TCP connections from prometheus.ndgf.org to port 19100 (used by Prometheus) or port 9100 if you give us access to your own Prometheus.
  • Allow inbound TCP connections from chaperon.ndgf.org on port 8649 (used by Ganglia).
  • Allow inbound TCP connections for SSH/port 22 from NDGF for pool management:
109.105.124.128/25
2001:948:40:2::/64

It there is external network firewall the same traffic must be allowed there also.

Java

Java should not be installed unless needed by other site local services.

dCache will use the JAVA or JAVA_HOME environment variables if found to locate the Java installation directory. This will conflict with the tarpool installed java so it should preferably not be defined system wide.

User account

An NDGF Tier-1 dCache pool is recommended to run as a non-root user. The FHS compliant version of the dCache packages automatically create the user account 'dcache' during installation and dCache will switch to this account during startup.

The pool directories need to be writable to that user.

The dCache packages comes with a /etc/security/limits.d/92-dcache.conf file that sets decent defaults. (at least in the .deb and .rpm). This sets a good default for open files, if dCache is run as the user dcache. If you run as another account, you'll have to make sure limits are set in a similar way for that. Once dCache is started, please verify that the elevated limits have been applied by checking /proc/PID/limits.

Networking

The tier 1 pools MUST be on the LHCOPN in normal conditions, but must be able to switch over to the general network in case of an interruption that lasts longer than just a few hours. Exceptions to this are Slovenian pools which are connected to the LHCONE network. It is recommended to use proper routing fail-over using BGP, but it is also acceptable for this to be a manual procedure that can be implemented the next working day, but no later than that.

Host name and IP numbers

Internal transfers between dCache pools relies on the receiver pool initiating the transfer, and it does so using the IP address found by looking up the hostname. This means that hostname(1) must return the appropriate name that incoming transfers can contact.

The pools must have both an public IPv4 and an IPv6 address. No NAT.

Network addresses, routing and DNS can be discussed with NDGF operators.

Host certificates, CA certificates and CRLs

For third party copy with davs, the pools should have the certificate revocation lists (CRLs) and certification authorities (CAs) locally installed and updated regularly (just like the ARC services).

The IGTF CAs packages should be installed, automatically updated and the fetch-crl-cron service enabled and CRLs automatically updated.

CentOS 7 example installing additional CA packages:

yum install -y epel-release
yum install -y https://repository.egi.eu/sw/production/cas/1/current/repo-files/egi-trustanchors.repo
yum install -y ca-policy-egi-core fetch-crl  
systemctl enable fetch-crl-cron.service
fetch-crl -v 

EL9 needs probably this for updating CRLs:

systemctl enable fetch-crl.timer

For updating CA packages automatically something like this will do:

[user@pool ~]# cat /etc/cron.d/update-ca-certificates 
# Update ca-certificate packages every day at midnight
0 0 * * * root /usr/bin/dnf update -qy ca-certificates ca-policy-egi-core ca-policy-lcg

Currently pools do not require host certificates, but this may change in the future.

Special EL9 instruction:

Redhat disabled SHA1 cryptyo in OpenSSL in RH9. This is still needed for some IGTF Root CA:s. So please run:

update-crypto-policies --set DEFAULT:SHA1

Eventually these Root CA:s will be replaced at which point we can remove this again.

Timekeeping

dCache requires synchronised clocks, so make sure that NTP or similar is both running and working.

Local pool monitoring

The site is expected to do basic monitoring of the machines like checks (Nagios style) of raid controllers and disks.

(TODO: This sections needs to be expanded)

Metrics collection

To ease debugging and day to day operation of dCache, we need some rudimentary metrics collection of all pools. We have chosen Ganglia as our preferred solution for dCache pools but are currently investigating Prometheus as a Ganglia replacement.

Ganglia

Ganglia consists of three components:

  • The monitoring daemon, gmond
  • The meta daemon, gmetad
  • The web interface, ganglia-webfrontend

It is only necessary to run the monitoring daemon, gmond, on a dCache pool. Inside a site gmond can either exchange data via UDP multicasting or direct UDP unicast to a few selected instances. This data is periodically queried by the meta daemon running in Ørestaden, Copenhagen, and presented on the web at https://chaperon.ndgf.org/ganglia/.

In the configuration file you should configure send_metadata_interval to something between 60 seconds and a few minutes and give your site ("cluster" in Ganglia terminology) a good name (this should be discussed with the NDGF admins).

Remember to open your firewall such that chaperon.ndgf.org can talk to the monitoring daemon on port 8649, unless you explicitly configured Ganglia to operate on another port.

UDP channels

Even if you only run a single gmond on a single host, the UDP receive and send channels have to be set up. It seems that gmond consists of two separate parts (collecting and publishing) and these only communicate through UDP unicasts or multicasts. If you have configured a tcp_accept_channel that accepts localhost, you can check that the data collecting works properly by telneting to port 8649 and checking that the required attributes are present.

A note on UDP buffer sizes on collectors/aggregators

Traffic from hosts to collectors is bursty, and for larger setups it often happens that the default UDP receive buffers are too small to handle these bursts. If you have hosts behaving weird with some or all metrics missing at times there is a high possibility that this is an issue.

On Linux you can check /proc/net/udp for drops, look for the ganglia run-time uid. If drops is non-zero, your default UDP buffer is too small.

Remedy this by adding the following to all udp_recv_channel definitions:

  buffer = 4194304

4MB has shown to be enough on setups with hundreds of nodes, but can be increased further.

You can see the buffer size chosen by manually starting gmond with the --debug=11 debug flag.

Monitoring multiple hosts

If you want to monitor multiple local hosts, run the Ganglia monitoring daemon on each of them. By default, they will locate each other by UDP multicasting on channel 239.2.11.71 and port 8649 (this channel is administratively scoped and will not leave your site [if you have not configured multicast routing, it will not even leave the subnet]). It's often better to setup one or two dedicated hosts, "collectors", and use UDP unicast from the other hosts.

If you have configured a firewall on the hosts, make sure to open it for this internal communication. See Firewall configuration If you have multiple NICs, then make sure that ganglia traffic is routed on the correct NIC (i.e. add a route for 224.0.0.0/4 to that NIC if using multicast). Only one or two hosts needs to communicate with chaperon.ndgf.org though.

This is an example of a site configuration using direct UDP messages instead of multicast.

globals {
  ...
  send_metadata_interval = 120
}

cluster {
  # Name of the cluster that is displayed in the ganglia web gui
  name = "XXXXXXX"
  ...
}

udp_send_channel {
  # This should be your local "collector"
  # Add more udp_send_channels if you have more than
  # one collector for the cluster. Configure this on
  # all hosts.
  host = foo.example.org
  port = 8649
}

# The udp_recv_channel and tcp_recv_channel is only needed on the
# dedicated collectors. udp is used internally for the data 
# collection and tcp for the central gmetad to collect the data.
udp_recv_channel {
  family = inet4
  port = 8649
  # Consider defining the udp buffer size if you experience UDP packet drops, see /proc/net/udp
  #buffer = 4194304
  acl {
    default = "deny"
    access {
      # Your local network goes here
      ip = XXX.XXX.XXX.XXX
      mask = XX
      action = "allow"
    }
  }
}

udp_recv_channel {
  family = inet6
  port = 8649
  # Consider defining the udp buffer size if you experience UDP packet drops, see /proc/net/udp
  #buffer = 4194304
  acl {
    default = "deny"
    access {
      # Your local network goes here
      ip = XXXX:::::
      mask = 64
      action = "allow"
    }
  }
}

tcp_accept_channel {
  port = 8649
  family = inet4
  acl {
    default = "deny"
    access {
      ip = 127.0.0.1
      mask = 32
      action = "allow"
    }
    access {
      # Current IP of chaperon
      ip = 109.105.124.161
      mask = 32
      action = "allow"
    }
  }
}

tcp_accept_channel {
  port = 8649
  family = inet6
  acl {
    default = "deny"
    access {
      ip = ::1
      mask = 128
      action = "allow"
    }
    access {
      # Current IP of chaperon
      ip = 2001:948:40:2::161
      mask = 128
      action = "allow"
    }
  }
}