NDGF dCache tape pool validation

From neicext
Jump to navigation Jump to search

Page status

THIS PAGE IS WORK IN PROGRESS AND A BASE FOR DISCUSSION

Introduction, purpose and goal

Validating that a dCache tape pool is production ready is hard. In addition to validating the tape pool is correctly installed, the server and the entire tape environment needs to perform as required. Experience has showed that the only way to verify everything is to do tests that resembles production load.

This page aims to document the procedure and commands needed to assess the function and performance of a tape pool.

Methodology

While it would be helpful with automation the current process is fully manual. The benefit is educating the tester in all bits and pieces involved.

Logging

Since we need to touch multiple systems and do long-running things it will be hard to keep track of everything. Keep a simple date/time-stamped log. It doesn't have to be advanced, an example:

2019-04-01 12:00 Functional write test performed, 12 cat pictures successfully stored on tape
2019-04-01 13:00 Functional read test performed, 12 cat pictures read from tape, matches original data
2019-04-01 14:00 Performance write test started, 900 GB, network performance approx 900 MB/s
2019-04-01 14:20 Performance write test finished, 900 GB, tape migration speed approx 360 MB/s with small dips

Functional tests

The purpose of these tests is to verify basic functionality. This catches basic setup issues like file permissions, TSM client misconfiguration, etc.

Write

Migrate a single tape file from another site (cached copy in the read or write pool) with -tmode=precious and let it flush to tape. After this has concluded, try to restore it (rh restore) and verify that the read pool could read the stored file.

Read

Read a file with "rh restore". If it works, you can also choose 1k files and repeat.

Delete

Performance tests

Here we test the performance, as in data transfer bandwidth, of the tape pool and underlying tape system.

Prereqs

  • Access to the dCache admin instance
  • Access to tape pool instance (tarpool)
  • File set(s)?
  • Machine to transfer from? Or transfer locally on node?
  • Access to ENDIT configuration file
  • Permission/script to stop/start ENDIT

Tape write without network transfer traffic

  • Stop ENDIT
  • Write data, amount just below endit.conf archiver_threshold2_usage
  • Start ENDIT
  • Monitor tape transfer performance (network graph is easiest) and record performance, number of tape mountpoints used etc

Repeat for each ENDIT archiver threshold configured.

Tape write with network transfer traffic

Repeat the above tests, but when tape transfer has started and reached steady-state, start incoming transfer and monitor performance impact. The experienced tester can register bandwidths for both idle network and active network transfers during this test.

Tape read without network transfer traffic

Tape read with network transfer traffic

Endurance tests

This test is the real shakedown, can it cope with sustained high load or are there shortcomings in the system that show up after a while (ie. site TSM system swamped/overloaded and thus affecting non-WLCG operations, etc).

Final checks