NDGF dCache tape pool validation
Page status
THIS PAGE IS WORK IN PROGRESS AND A BASE FOR DISCUSSION
Introduction, purpose and goal
Validating that a dCache tape pool is production ready is hard. In addition to validating the tape pool is correctly installed, the server and the entire tape environment needs to perform as required. Experience has showed that the only way to verify everything is to do tests that resembles production load.
This page aims to document the procedure and commands needed to assess the function and performance of a tape pool.
Methodology
While it would be helpful with automation the current process is fully manual. The benefit is educating the tester in all bits and pieces involved.
Logging
Since we need to touch multiple systems and do long-running things it will be hard to keep track of everything. Keep a simple date/time-stamped log. It doesn't have to be advanced, an example:
2019-04-01 12:00 Functional write test performed, 12 cat pictures successfully stored on tape 2019-04-01 13:00 Functional read test performed, 12 cat pictures read from tape, matches original data 2019-04-01 14:00 Performance write test started, 900 GB, network performance approx 900 MB/s 2019-04-01 14:20 Performance write test finished, 900 GB, tape migration speed approx 360 MB/s with small dips
Functional tests
The purpose of these tests is to verify basic functionality. This catches basic setup issues like file permissions, TSM client misconfiguration, etc.
Write
Migrate a single tape file from another site (cached copy in the read or write pool) with -tmode=precious and let it flush to tape. After this has concluded, try to restore it (rh restore) and verify that the read pool could read the stored file.
Read
Read a file with "rh restore". If it works, you can also choose 1k files and repeat.
Delete
Performance tests
Here we test the performance, as in data transfer bandwidth, of the tape pool and underlying tape system.
Prereqs
- Access to the dCache admin instance
- Access to tape pool instance (tarpool)
- File set(s)?
- Machine to transfer from? Or transfer locally on node?
- Access to ENDIT configuration file
- Permission/script to stop/start ENDIT
Tape write without network transfer traffic
- Stop ENDIT
- Write data, amount just below endit.conf
archiver_threshold2_usage
- Start ENDIT
- Monitor tape transfer performance (network graph is easiest) and record performance, number of tape mountpoints used etc
Repeat for each ENDIT archiver threshold configured.
Tape write with network transfer traffic
Repeat the above tests, but when tape transfer has started and reached steady-state, start incoming transfer and monitor performance impact. The experienced tester can register bandwidths for both idle network and active network transfers during this test.
Tape read without network transfer traffic
Tape read with network transfer traffic
Endurance tests
This test is the real shakedown, can it cope with sustained high load or are there shortcomings in the system that show up after a while (ie. site TSM system swamped/overloaded and thus affecting non-WLCG operations, etc).