DCache Useful Stuff

From neicext
Jump to navigation Jump to search

dCache admin

Hints: Tab expansion usually works. The help command is context sensitive, for example \sn help and \sp help yields different output.

Set pools RO and RW

\c PoolManager
psu set pool POOLNAME rdonly
psu set pool POOLNAME notrdonly

Drain a door

cd LoginBroker

disable <door>

Where <door> is the cell name

FIXME: This needs to be fixed. The LoginBroker is gone and you should probably stop publishing in the door instead.

Converting pnfsid to filename and back

\c PnfsManager
pathfinder PNFSID
pnfsidof FILENAME
storageinfoof FILENAME|PNFSID

OR

\sn pathfinder PNFSID
\sn pnfsidof FILENAME
\sn storageinfoof FILENAME|PNFSID

File replica information

Where is a file stored (ie. replica location):

\c PnfsManager
cacheinfoof FILENAME|PNFSID

OR

\sn cacheinfoof FILENAME|PNFSID

Change pool size

\c YOUR_POOLNAME
set max diskspace 18500G
save

Restart a domain

\c System@YOUR_Domain
kill System

List files on a pool through the dCache interface

Log in to dCache and run:

\c POOLNAME
rep ls

List stuck files

Sometimes a file request gets stuck for different reasons. List them with:

\c PoolManager
rc ls

Use retry to release them:

rc retry PNFSID

Upgrade cached files to sticky

migration copy -tmode=cached+system -id=rescue2 -target=pool -- uppmax_uu_se_16

Watch p2p transfer rate

\s YOUR_POOLNAME p2p ls

Database

When doing read-only queries, do them on the secondary if you have a HA setup. This avoids unintended slowdown of primary DB if you make mistakes.

AVOID writing/changing the database when possible, it's error prone and might be really hard to correct if a big mistake is made.

FIXME: List relevant databases

psql - preparation

  • Log in to the appropriate host (secondary for read-only queries).
  • Switch to the postgres user: su - postgres
  • Start psql on the appropriate database, example for chimera (namespace): psql chimera

psql - output to file

Copy output of SQL statement to file (TAB separated):

\copy (SQL statement) to /tmp/myfile.txt

Typical recipe for mangling a file to dCache admin commands using bash: FIXME: Either set separator to TAB, or point out that avoiding columns with white spaces (ie path/file names) is a good idea.

while read pnfsid badlocation location path; do echo "\s $badlocation rep rm $pnfsid"; done < /tmp/badlocation-list.txt > /tmp/badlocation-remove.dcache

Then use that file directly as input to the dCache admin interface:

ssh -p 22224 admin@dcache.example.com < /tmp/badlocation-remove.dcache

dCache - creating lists from pgroups

When dealing with cleaning up after broken pools or migration job mistakes, it's convenient to use the pgroup definitions from dCache instead of manually entering/matching pool names.

This is one way of doing this, there might be others that are better :-)

To create a regular list, for use with shell scripting etc:

ssh -p 22224 admin@dcache.example.com '\sp psu ls pgroup -l default'|awk '/pgroups=[0-9]/ {print $1}'

To create an IN(...)-list for use with postgres/SQL:

ssh -p 22224 admin@dcache.example.com '\sp psu ls pgroup -l default' | perl -e 'print "IN('"'"'"; @l = map({/^\s*(\S+).*pgroups=\d/} <>); print join("'"', '"'", @l), "'"'"')\n"'

When dealing with multiple groups, it's hard to keep track of the generated lists. Name them carefully so you know which pgroup it's generated from.

Also, remember that pgroup:s can change. A list you made yesterday might well be obsolete if you have colleagues doing actual work :-)

List files on a failed pool

See also #List files from a broken pool that exists on other pools

To list ALL files on the failed pool hpc2n_umu_se_sXX, in psql chimera:

SELECT count(t_locationinfo) FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX';
SELECT inumber2pnfsid(inumber) AS PnfsId, inumber2path(inumber) AS Path FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX';

To ONLY list files on the failed pool hpc2n_umu_se_sXX that have copies on other pools, in psql chimera:

SELECT li.ilocation,count(*) FROM t_locationinfo AS li, 
  (select inumber,ilocation FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX') AS sub 
  WHERE li.inumber=sub.inumber AND li.ilocation!=sub.ilocation GROUP BY li.ilocation ORDER BY li.ilocation;

SELECT li.ilocation AS Location, inumber2pnfsid(li.inumber) AS PnfsId, inumber2path(li.inumber) AS Path 
  FROM t_locationinfo AS li, 
  (select inumber,ilocation FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX') AS sub 
  WHERE li.inumber=sub.inumber AND li.ilocation!=sub.ilocation ORDER BY li.ilocation;

To ONLY list files WITHOUT copies on other pools, in psql chimera:

SELECT COUNT(*) FROM t_locationinfo li
  WHERE li.ilocation='hpc2n_umu_se_sXX' AND 
    NOT EXISTS (SELECT FROM t_locationinfo sli WHERE li.inumber=sli.inumber AND li.ilocation!=sli.ilocation);

SELECT inumber2pnfsid(li.inumber) AS PnfsId, inumber2path(li.inumber) AS Path
  FROM t_locationinfo li
  WHERE li.ilocation='hpc2n_umu_se_sXX' AND
    NOT EXISTS (SELECT FROM t_locationinfo sli WHERE li.inumber=sli.inumber AND li.ilocation!=sli.ilocation);


For single entries, removal of stale locations can be done manually in the dCache admin interface:

\c PnfsManager
clear file cache location 000073CDC2BB18E042718F1CF3B728FFF167 hpc2n_umu_se_sXX

OR

\sn clear file cache location 000073CDC2BB18E042718F1CF3B728FFF167 hpc2n_umu_se_sXX

For multiple entries, see #psql - output to file on how to output the SQL query to file and mangle it to use with the admin interface.

Suggested workflow is:

  • Save the lists of affected files.
  • First handle the files with accessible replicas:
    • Remove stale locations.
    • Re-replicate files.
  • Redo queries and save lists of files left, if you've done the previous step correctly you should only have files that only existed on the dead pool left.
    • Remove stale locations
    • Tell the relevant users which files were lost.

List files with locations on non-existant pools

Start with creating an IN(...) with the pools in the default pgroup, see #dCache - creating lists from pgroups. It will be a hefty chunk to cut&paste into the SQL snippets below, replacing the IN(...) section.

To get a summary of locations on non-existant pools:

SELECT count(inumber) AS Count,ilocation FROM t_locationinfo WHERE ilocation NOT IN('existingpool1','...','existingpoolN') GROUP BY ilocation ORDER BY ilocation;

List files by PnfsID/Path with locations on non-existant pools:

SELECT inumber2pnfsid(inumber) AS PnfsId,inumber2path(inumber) AS Path,ilocation FROM t_locationinfo WHERE ilocation NOT IN('existingpool1','...','existingpoolN') ORDER BY ilocation,inumber;

Check for non-zero sized files without location

Summary:

SELECT COUNT(*) FROM t_inodes ti WHERE ti.isize>0 AND ti.itype=32768 AND NOT EXISTS (SELECT FROM t_locationinfo tl WHERE ti.inumber=tl.inumber);

List:

SELECT ipnfsid, inumber2path(inumber) as Path FROM t_inodes ti WHERE ti.isize>0 AND ti.itype=32768 AND NOT EXISTS (SELECT FROM t_locationinfo tl WHERE ti.inumber=tl.inumber);

List/clear stale trash table entries

FIXME: This entire section needs to be revisited since the cleaner cell is gone after that code got refactored.

FIXME: Use the dCache admin commands instead, from the dcache mailing list:

The cleaner service has its own database. 
In the admin interface you can instruct a cleaner to forget a pool: 
  
\s cleaner forget pool <pool name>
 
Cleaner should also blacklist such pools and not try to delete from them as
often as ones it is able to contact. You can query which pools are currently
blacklisted by running: 
  
\s cleaner ls blacklist 

The trash table contains all files that have been deleted in the namespace but hasn't been deleted on the actual pools. If a pool is destroyed (or decommissioned the wrong way) there can be leftovers.

Start with creating an IN(...) with the pools in the default pgroup, see #dCache - creating lists from pgroups. It will be a hefty chunk to cut&paste into the SQL snippets below, replacing the IN(...) section.

To get a summary of stale trash table entries:

SELECT count(ipnfsid),ilocation FROM t_locationinfo_trash WHERE ilocation!='' AND ilocation NOT IN('existingpool1','...','existingpoolN') GROUP BY ilocation ORDER BY ilocation;

To clear stale trash table entries (needs to be done on the primary/read-write postgres host):

 BEGIN;
 DELETE FROM t_locationinfo_trash where ilocation!='' AND ilocation NOT IN('existingpool1','...','existingpoolN');
 # (check that the expected number of rows were affected)
 COMMIT; # If OK
 ROLLBACK; # If not OK
 

TODO

Or rather, more like FIXME ;)

Make a cache copy on another site

FIXME: This depends on a NDGF util script?

./bin/dcache_pool "^somesite_0(0[5,6,9]|1[0-4])" -id=migtest migration cache -exclude=somesite -target=pgroup atlas_disk

Some misc SQL commands

FIXME: Document what this is and why it's useful.

  $ time psql chimera -c "ALTER TABLE t_access_latency ALTER COLUMN ipnfsid TYPE char(36);"