TSM Administration Hints

From neicext
Jump to navigation Jump to search

This page aims to collect useful hints for site operators of tape systems using the IBM product Storage Protect (formerly known as Spectrum Protect, Tivoli Storage Manager (TSM), ADSM).

For instructions on how to set up dCache tape pools within NDGF, see DCache TSM interface.

Handling tapes with damaged files

Tape volumes with damaged files (objects) needs to be handled promptly. Files are marked as damaged when there is an error on a read attempt, these errors are often spurious commonly due to a tape drive needing cleaning but a long-running operation has prevented this, or simply a damaged drive causing tape read issues.

Symptoms of a damaged tape are typically persistent files to restore on the dashboard and ANS4035W File '/path/0000C2XYZ' currently unavailable on server messages in the tsmretriever.log.

To handle tapes with damaged files the TSM administrator needs to perform the following in the administrative interface (dsmadmc):

  • Identify affected tapes
    • The fastest method is to use the undocumented command SHOW DAMAGED STORAGEPOOLNAME
    • The alternative is to do QUERY CONTENT TAPENAME DAMAGED=YES for each tape in the storage pool.
  • Audit the affected tapes
    • NOTE: It's important to run the audit command with the FIX=NO option. The "fix" deletes damaged objects which is a bad idea for a spurious issue.
    • Run the audit command as: AUDIT VOLUME TAPENAME FIX=NO
    • Note the process number and wait for the process to finish, this can take quite a while for large tapes.
    • Check the result with QUERY ACTLOG BEGINDATE=-2 SEARCH="PROCESS: YOUR-PROCESS-NUMBER" or QUERY ACTLOG BEGINDATE=-2 SEARCH=ANR4133I
    • If the audit failed, retry at least once since the tape can read OK in another tape drive.
    • If the audit can't be made to succeed, report the damaged files as lost to the NDGF OoD, either using the list logged by the audit command or list the files using the QUERY CONTENT command above.

Handling unavailable tapes

TSM flags a volume as unavailable (ACCESS=UNAVAILABLE) when a volume mount fails, this is to prevent TSM from wasting time retrying broken volumes.

However, the reason for a mount to fail is quite often due to temporary path or tape drive issue.

Tape-specific checks

Sometimes when tape drives act up volumes are not unmounted correctly and the tape library then has to move the tape out of the way by putting it into an unused SCSI element address. Since TSM addresses tapes/drives by their SCSI element number this causes these tapes to be misplaced in the TSM point of view.

To view where TSM thinks the tape should be located, do a query libvolume and read the Home Element column:

 q libv YOURLIBNAME YOURVOLNAME

IBM TS4500

  • Navigate the TS4500 web interface to the Cartridges view.
  • Enter the start of the tape in question in the Filter entry at the top of the list, for example HN0720
  • Verify that the Element Address matches the TSM point of view. If not shown, click the checkbox icon top-right of the list and mark Element Address to be shown.

Since the TS4500 has no functionality to change/move an element address for a tape it's cumbersome to fix. An audit library in TSM must be done to correct a SCSI element mismatch issue.

NOTE: The library must be idle and no activity is allowed to occur or the inventory can be incomplete! If your setup has scheduled administrative tasks your safe bet is to start the server in maintenance mode. See the IBM TSM documentation for details about the maintenance mode and runing the AUDIT LIBRARY command, a typical invocation is along the line of AUDIT LIBRARY YOURLIBNAME CHECKLABEL=BARCODE.

Check the actlog for any information/error messages, ie. something like:

 QUERY ACTLOG SEARCH="PROCESS: process-number"

After the audit process has completed, recheck the scsi element address(es) of the volume(s) to verify than the audit fixed the problem.

IBM TS3500

  • Navigate the TS3500 web interface to view information about the affected volume.
  • Compare the volume TS3500 SCSI Element Address to what TSM thinks it should be.
  • If they don't match, do a TS3500 Move operation to change the element address to match what's registered in TSM.

Generic checks

List unavailable volumes with:

Q VOL ACC=UNAVAIL

Investigate the cause of the mount failure in the actlog, start by doing a rough search to narrow down the date/time and then review the entire actlog around that time.

 Q ACTLOG BEGIND=-1 SEARCH=YOURVOLNAME
 Q ACTLOG BEGIND=-1 BEGINT=01:23
 ...

If you suspect that the volume is in fact OK (this is usually the case when you see multiple volumes failing in the same tape drive), set the access state to read-only:

 UPDATE VOL YOURVOLNAME ACC=READONLY

Verify the volume as described in Handling tapes with damaged files.

Deleting a broken tape

If a tape is completely broken you will end up needing to delete it, but before you do that try to rescue any readable files and grab a list of the files that will be lost.

  • Move readable data
    • Flag tape as READONLY if it's UNAVAILABLE.
    • Try moving data to another tape in the same storage pool: MOVE DATA YOURVOLNAME
  • Grab a list of any files remaining on the tape:
    • QUERY CONTENT YOURVOLNAME > /tmp/YOURVOLNAME,txt
  • Contact the NDGF OoD and report that the files in the list are lost.
  • Forcibly delete the tape
    • DELETE VOLUME YOURVOLNAME DISCARDDATA=YES
  • Check out the broken tape volume from the library
    • To drop it into the I/O slot without mounting it: CHECKOUT LIBVOLUME YOURLIBNAME YOURVOLNAME REMOVE=BULK CHECKLABEL=NO

Server upgrades

There is no downtime needed for tape server/system interventions that's finished within a normal work day. For longer interventions a downtime is needed as per Site Admin Operations basics.