DCache TSM interface
Efficient Northern Dcache Interface to TSM
ENDIT is the dCache-tsm interface developed during DC4 in april 2006 at HPC2N.
This is in use for most of the tape pools within NDGF, since most sites are running TSM (IBM Spectrum Protect).
Concept
ENDIT connects the dCache pools with TSM (IBM Spectrum Protect, previously known as Tivoli Storage Manager). From dCache's point of view tape storage is a plugin that is run from the tape-connected pools with either "get", "put", or "remove". To get efficient transfers ENDIT uses a couple of staging areas on the same filesystems as the pools. From the dCache point of view, the tape storage is partitioned into "hsminstance"s, where pools that can read and write to the same tape backend namespace share the same hsminstance. In NDGF the hsminstance is made up of the storage component and the domain name, for instance "atlas.hpc2n.umu.se" or "alice.bccs.uib.no". The hsminstance is set by the dCache administrators through dCache's administrative interface, not by the tape pool admin.
Filesystem layout
In NDGF's deployment, each tape connection needs a separate read and write pool. These can be on the same host or on different hosts. For ideal performance they should be independent, but it is quite possible to have a tape pool on just one filesystem. From a data availability point of view it is also good to have two independent tape pools, reconfiguring a write pool to be a read pool is usually much easier than setting up a brand new read pool.
A little bit of extra space is good, in case dCache gets restarted at the "wrong" moment files staged in might be forgotten about, and the same might possibly happen on the write pool. So have a bit wider margins for space than in normal pools. If you find lots of old files in "in" or "out", please contact NDGF operations and we'll figure out why and help clean up.
Also note that the path and UID/GIDs needs to be the same across all pools for the same hsminstance, in our example the "/grid/pool/out" on the write pool needs to be matched by a "/grid/pool/in" on the read pool. This is the path that the files are stored in the TSM node, so once it is chosen it is fixed and is not easy to change.
Example layout
This is a sample layout from HPC2N's ATLAS pools, where the basedir is chosen to be "/grid/pool". Note that the "in" directory must be in the same filesystem as the read pool, and "out" must be in the same filesystem as the write pool, this is so that rename (reading) and hardlinks (archiving to tape) work in ENDIT.
Read pool
atlas_tape_read in out request requestlists trash
Write pool
atlas_tape_write in out request requestlists trash
OS Performance tuning
You MUST apply OS tunings to achieve the desired performance of a dCache tape pool. This should already be done if you have followed DCache Pool installation, but if you havent please see Operations Tuning Linux.
Verifying filesystem IO bandwidth
Verifying fs IO bandwidth prior to deployment is trivial as the IO pattern of a tape pool is single/few-stream sequential IO. It's essential that the filesystem can handle the load, and it's no fun finding the bottleneck later on when the system is in production.
The procedure can look like this:
- Start a program to monitor IO bandwidth,
dstat 4
is known to behave well. - Test write bandwidth:
- Write a few large files (at least triple your RAM size) with
dd if=/dev/zero of=fileN bs=256k
and verify that dd reports approx the same bandwidth as your monitoring program.- NOTE: If you are using a file system that does compression or detects zero-filled segments as sparse files this will yield unrealistic performance numbers. You'll need to use something like fio that supports writing pseudo-random data at speed.
- Record the steady-state write bandwidth from your monitoring tool.
- The write bandwidth should be at least on par with the NIC speed.
- Write a few large files (at least triple your RAM size) with
- Test read bandwidth:
- Read previously written files with
dd if=fileN of=/dev/null bs=256k
and verify that dd reports approx the same bandwidth as your monitoring program. - Record the read bandwidth.
- The read bandwidth should be at least on par with the NIC speed.
- Read previously written files with
- Test simultaneous read-and-write bandwidth/balance:
- Start writing a new file with dd as in the write test. Leave it running.
- Start reading a previously written file (not the same one you read last in the read bw test).
- Look at your IO monitor, when the numbers have stabilized record the simultaneous read-and-write bandwidths.
- Ideal is the read and write IO bandwidth being identical (ie. balanced) and at the same level as the separate write and read bandwidth.
- If read/write is severly unbalanced (more than a factor 2 off) you need to apply tuning to get it in balance. Commonly this means fiddling with the RAID controller write cache and readahead settings in combination with OS readahead tuning.
TSM configuration
This documentation uses the acronym TSM to refer to IBM Spectrum Protect (previously known as Tivoli Storage Manager).
IBM documentation is available at:
- v7 and older: https://www.ibm.com/docs/en/tsm
- v8 and newer: https://www.ibm.com/docs/en/spectrum-protect
TSM server setup
Please ensure that your TSM server is a supported configuration. One of the common mistakes is to use an unsupported Linux OS, for example CentOS. IBM will refuse to give support unless you're using one of the supported Linux operating systems (specific versions of RedHat and SUSE when this was written), this is not fun finding out when the world has collapsed and you really need that support...
The recommended way of setting up TSM is:
- Create a TSM domain for each storage group you support (example: NDGFT1ATLAS and/or NDGFT1ALICE).
- For each TSM domain you then set up a storage hierarchy.
- Setup archive data (ARCHIVE COPYGROUP) to go directly to tape without landing in a diskpool on the TSM server. ENDIT batches data writing, so having a TSM diskpool is no gain but more likely a bottleneck.
- Specifically verify that
Retain Version
is set toNo Limit
.
- Add a target node to hold all data (usually with the same name as the domain). This node will be used as a proxynode target (no direct access so set a long unguessable random password not known to anyone).
- Then setup dedicated ENDIT user pool nodes as agents for the relevant target proxynode.
- Note that the proxynode (target) and the user pool nodes (agents) can belong to different TSM domains.
- The proxynodes should correspond to the dCache hsminstances (ie ATLAS or ALICE).
- Avoid sharing multiple storage groups/hsminstances in the same TSM domain/storage hierarchy.
Remember to modify node parameters for the proxynode (and pool nodes for clarity/corner cases):
MAXNUMMP
- set greater than the sum of concurrent dsmc sessions that might use this proxy node. See #Endit configuration retriever_maxworkers for read pools and archiver_threshold2_dsmcopts resourceutilization for write pools. We recommend adding some extra slack due to TSM having a tendency to also include recently released volumes when summing up the resource usage.SPLITLARGEObjects
- set to No to optimize for tape.- See the IBM documentation on node parameters for details: https://www.ibm.com/docs/en/spectrum-protect/8.1.12?topic=commands-register-node-register-node
If you haven't done so already, please review the IBM guidelines on Performance Tuning: https://www.ibm.com/docs/en/spectrum-protect/8.1.12?topic=performance-tuning-components
At a minimum, apply the recommended tunings for high performance tape drives found at https://www.ibm.com/docs/en/spectrum-protect/8.1.12?topic=tuning-high-performance-tape-drives
TSM client setup
For performance you want a really fast raid setup for the disk area used on the dCache tape pool pool. See DCache Pool Hardware for current sizing guidelines. From a tape efficiency point of view you want to be able to stream full tape speed even when there is lots of incoming transfers going to disk at the same time.
ENDIT uses dsmc archive and dsmc retrieve to transfer files to/from tape. To make it possible to have several hosts retrieving and archiving to the same namespace, the proxy node setup is needed as well as a common path to the "out" directory among all the nodes in the same hsminstance. To make this setup work when there might be different mount points for filesystems, we strongly recommend to have a virtual mount point for the "out" directory.
Note that ENDIT runs as a non-root user, and the best practice is to use a dedicated TSM node for this. See https://www.ibm.com/docs/en/spectrum-protect/8.1.12?topic=cspc-enable-non-root-users-manage-their-own-data for detailed setup instructions.
For tarpools (ie. NDGF-managed dCache instances), place the DSM_CONFIG environment variable setup in the file .bash_site_config
file in the runtime user home directory, this file is sourced by the NDGF-managed shell initialization files.
In order to specify the location of the dsmc error log, use the -errorlogname=/path/to/file dsmc commandline option.
dsm.sys
Ensure the following is present in your dsm.sys:
* TXNBYTELIMIT controls the maximum transaction size, we want * it big enough to minimize the impact of buffer flushes. TXNBYTELIMIT 10G VIRTUALMountpoint /grid/pool/out DISKBUFFSIZE 256
The TXNBYTELIMIT
tuning is optimizing for tape write, increasing the transaction size reduces the number of time consuming buffer flushes.
We strongly recommend always specifying the VIRTUALMountpoint
option. This defines the /grid/pool/out
directory as a separate filesystem to TSM.
NOTE: There must be no trailing slash to the VIRTUALMountpoint
option, or exclude parsing will behave weird.
The DISKBUFFSIZE
option sets the IO block size in kB and is valid in conjunction with the read-ahead tuning on Operations_Tuning_Linux#dCache_tape_pools.
Remember to set the MAXNUMMP
node option high enough, see #TSM server setup for details.
dsm.opt
Ensure the following is present in your dsm.opt config file:
SKIPACL YES
SKIPACL yes
avoids storing ACL information. ACL info is filesystem/architecture specific, so even if you don't plan to migrate to another OS this will reduce the chance of ACL data causing issues in the future. Examples of TSM ACL issues are files archived on Linux with ACL info that aren't restorable using Solaris, or Linux XFS/EXT4 ACLs being different causing the files not to be restorable onto a different file system.
include-exclude
Remember to exclude the pool filesystems from the backup, use something similar to this in your include-exclude file:
exclude.dir /grid/pool*/.../* exclude.fs /grid/pool*/.../*
Checking TSM client tunables
To check TSM client tuning options:
dsmc q opt
To check TSM client include-exclude list:
dsmc q inclexcl
Endit configuration
dCache plugin installation is not required on NDGF tarpool setups, but required in traditional site-managed setups: Get the ENDIT dCache plugin from the NeIC GitHub page (https://github.com/neicnordic/dcache-endit-provider/releases/latest) and install it in dCache, typically by unpacking the tarball in /usr/local/share/dcache/plugins/ so you get a dcache-endit-provider-NN subdir there.
Get the latest version of the ENDIT daemons from the NeIC GitHub page (https://github.com/neicnordic/endit) preferably by cloning the GIT repository master branch (recommended for NDGF Tier1 sites) or a release tag. Using a git checkout is preferred as it makes upgrading, testing bugfixes etc easier. There is also the option to download and unpack a release archive file. Follow the installation/configuration instruction in the README (https://github.com/neicnordic/endit#endit-daemons).
On NDGF, we recommend to tune the following endit.conf
items. This tuning assumes that there are multiple tape drives available for peak usage:
desc-short
- Set this to match the dCache
pool.name
set by the NDGF dCache admin. This is to make it easy to collect and analyze the ENDIT logs.
- Set this to match the dCache
currstatsdir
- This directory (default
/run/endit
) must exist in order to deliver Prometheus data, so either arrange for the directory to be created before the endit daemon starts or set it to a pre-created directory on a persistent file system, preferably$HOME/endit/stats
.
- This directory (default
archiver_thresholdN_usage
- Set these to make sense with your combination of tape drive speed and tape pool storage space. You want to reach 10 Gigabit (1+ GB/s) migration speed to tape with a sizable margin of storage space left for transfer bursts.
N
corresponds to the number of single-drive worker sessions spawned, for examplearchiver_threshold2_usage
sets the limit for when to use 2 tape drives.archiver_threshold1_usage
is usually tuned to be 20-30 minutes or more of tape activity (remember to check your worst-case performance as well). This plays well with shared tape libraries. For dedicated tape libraries this can be done differently depending on the TSM server Mount Retention device class setting.- A typical setting for a 10T tape pool with 400 MB/s class tapedrives is
500
(GB).
- A typical setting for a 10T tape pool with 400 MB/s class tapedrives is
archiver_threshold2_usage
is used to trigger usage of two tape drives if one drive can't keep up. We don't want to scatter files on multiple tapes if we can avoid it, so don't set it too low. 30-60 minutes of tape activity using two tape drives is a good starting point, but no more than 30% of the tape pool storage space.- A typical setting for a 10T tape pool with 400 MB/s class tapedrives is
2000
(GB).
- A typical setting for a 10T tape pool with 400 MB/s class tapedrives is
archiver_threshold3_usage
,archiver_threshold4_usage
and so on. Add more thresholds to achieve a ramp up to 1+ GB/s migration speed to tape. The highest threshold should hit when there is still ample space left for transfer bursts, no more than 50% of the tape pool storage space.- A tape pool with 400 MB/s class tapedrives needs three of them to achieve 1+ GB/s. A typical config for such a pool with 10T storage space is setting
archiver_threshold3_usage
to4000
(GB).
- A tape pool with 400 MB/s class tapedrives needs three of them to achieve 1+ GB/s. A typical config for such a pool with 10T storage space is setting
retriever_maxworkers
- Setting this to
3
enables 3 concurrent retriever sessions, and thus using 3 tape drives. This number might vary between sites, depending on pledge/allocation and TSM server and tape pool IO capability. - The goal is to achieve at least 1 GB/s transfer speed from tape.
- Setting this to
retriever_hintfile
- The use of a tape hint file is required on NDGF.
- The hint file should be refreshed daily.
- See the ENDIT daemons README for the various ways on how to create/refresh the hint file.
retriever_remountdelay
- On small/SSD based tape pools that doesn't have space to fit multiple tapes, set this to
1800
(instead of the default 7200s). The reason for this is the current dCache/ENDIT plugin implementation only issuing recalls that fit in the pool, so we'll see remounts to process an entire tape.
- On small/SSD based tape pools that doesn't have space to fit multiple tapes, set this to
dCache configuration on the pool node
For dCache you need no special configuration to configure the pools as tape frontends these days, just make sure there is a little bit of extra room for forgotten staged files etc.
Central dCache configuration
By the dCache operator, the following things needs to be remembered, remember to save:
cd the_pool_name_NNN jtm set timeout -lastAccess=3600 jtm set timeout -total=86400 st set timeout 86400 rh set timeout 345600 set breakeven 0 # This setting should be defined in nt1-ansible dcache_poolinfo for the pool, max_active_movers_regular_queue: "12" mover set max active 12 # Only do checksums once. On transfer is always whatever the setting. If you have ssd storage you could possibly consider -onrestore=on. YMMV. csm set policy -onwrite=off -onflush=off -onrestore=off # Create hsm, number of threads must be > max file restore rate in Hz. Use the same hsm instance for all tape pools at the same site. hsm create osm atlas.hpc2n.umu.se endit-polling -threads=200 -directory=/dcache/pool/ # Tune queue for the storage class to do continuous flushing, instead of waiting for all flushes # to finish before issuing more stores # if you don't have a queue defined for a storage class, it'll create one dynamically with # default. So make sure you get the storage class right for the files on this pool. Usually `atlas:default` or `alice:tape`. queue define class osm atlas:default -expire=1 -pending=1 -total=1 -open save
Deployment
Create the filesystems, pools, etc and configure ENDIT as described above. Then either grab an NDGF dCache admin in the chat or send an email with the full path to the directory together with information on which is the read and write pools (if you care). Ensure that the ENDIT daemons (tsmarchiver.pl, tsmretriever.pl and tsmdeleter.pl) are running and being started on reboot. The NDGF dCache admin will then run a set of tests on the pools. After successful testing, the pools will then be taken into production use.
To configure our monitoring systems to pick up the ENDIT metrics run:
ansible-playbook plays/prometheus.yml -t prometheus -i environments/production -l prometheus_server,tarpool-group-or-host
Debugging
In the log directory there will be a logfile each for the background tasks and there is information in the pool log file. Not all of these are easy to read and understand, but they contain the information needed to debug problems. Some statistic/usage data is logged, so expect these files to grow. If the log file name matches /var/log/dcache/*.log then log rotation is usually handled by the /etc/logrotate.d/dcache logrotate configuration file, otherwise you need to manage log rotation in some other manner.
Data availability
The WLCG MoU roughly says that for data availability we have until the next business day to fix problems, but problems needs to be fixed the following work day. This means it is a good idea to have a secondary machine that can take over reading of data in case the read pool breaks totally, reconfiguring the write pool to do this is one option.
Writing is less critical as we have redundancy there (new incoming writes can go to another tape instance within NDGF), but of course once you have accepted a file it is very important that you don't have a failure mode that will lose that data.