DCache Useful Stuff
dCache admin
Hints: Tab expansion usually works. The help command is context sensitive, for example \sn help and \sp help yields different output.
Set pools RO and RW
\c PoolManager psu set pool POOLNAME rdonly psu set pool POOLNAME notrdonly
Drain a door
cd LoginBroker
disable <door>
Where <door> is the cell name
FIXME: This needs to be fixed. The LoginBroker is gone and you should probably stop publishing in the door instead.
Converting pnfsid to filename and back
\c PnfsManager pathfinder PNFSID pnfsidof FILENAME storageinfoof FILENAME|PNFSID
OR
\sn pathfinder PNFSID \sn pnfsidof FILENAME \sn storageinfoof FILENAME|PNFSID
File replica information
Where is a file stored (ie. replica location):
\c PnfsManager cacheinfoof FILENAME|PNFSID
OR
\sn cacheinfoof FILENAME|PNFSID
Change pool size
\c YOUR_POOLNAME set max diskspace 18500G save
Restart a domain
\c System@YOUR_Domain kill System
List files on a pool through the dCache interface
Log in to dCache and run:
\c POOLNAME rep ls
List stuck files
Sometimes a file request gets stuck for different reasons. List them with:
\c PoolManager rc ls
Use retry to release them:
rc retry PNFSID
Upgrade cached files to sticky
migration copy -tmode=cached+system -id=rescue2 -target=pool -- uppmax_uu_se_16
Watch p2p transfer rate
\s YOUR_POOLNAME p2p ls
Database
When doing read-only queries, do them on the secondary if you have a HA setup. This avoids unintended slowdown of primary DB if you make mistakes.
AVOID writing/changing the database when possible, it's error prone and might be really hard to correct if a big mistake is made.
FIXME: List relevant databases
psql - preparation
- Log in to the appropriate host (secondary for read-only queries).
- Switch to the postgres user:
su - postgres - Start psql on the appropriate database, example for chimera (namespace):
psql chimera
psql - output to file
Copy output of SQL statement to file (TAB separated):
\copy (SQL statement) to /tmp/myfile.txt
Typical recipe for mangling a file to dCache admin commands using bash: FIXME: Either set separator to TAB, or point out that avoiding columns with white spaces (ie path/file names) is a good idea.
while read pnfsid badlocation location path; do echo "\s $badlocation rep rm $pnfsid"; done < /tmp/badlocation-list.txt > /tmp/badlocation-remove.dcache
Then use that file directly as input to the dCache admin interface:
ssh -p 22224 admin@dcache.example.com < /tmp/badlocation-remove.dcache
dCache - creating lists from pgroups
When dealing with cleaning up after broken pools or migration job mistakes, it's convenient to use the pgroup definitions from dCache instead of manually entering/matching pool names.
This is one way of doing this, there might be others that are better :-)
To create a regular list, for use with shell scripting etc:
ssh -p 22224 admin@dcache.example.com '\sp psu ls pgroup -l default'|awk '/pgroups=[0-9]/ {print $1}'
To create an IN(...)-list for use with postgres/SQL:
ssh -p 22224 admin@dcache.example.com '\sp psu ls pgroup -l default' | perl -e 'print "IN('"'"'"; @l = map({/^\s*(\S+).*pgroups=\d/} <>); print join("'"', '"'", @l), "'"'"')\n"'
When dealing with multiple groups, it's hard to keep track of the generated lists. Name them carefully so you know which pgroup it's generated from.
Also, remember that pgroup:s can change. A list you made yesterday might well be obsolete if you have colleagues doing actual work :-)
List files on a failed pool
See also #List files from a broken pool that exists on other pools
To list ALL files on the failed pool hpc2n_umu_se_sXX, in psql chimera:
SELECT count(t_locationinfo) FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX'; SELECT inumber2pnfsid(inumber) AS PnfsId, inumber2path(inumber) AS Path FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX';
To ONLY list files on the failed pool hpc2n_umu_se_sXX that have copies on other pools, in psql chimera:
SELECT li.ilocation,count(*) FROM t_locationinfo AS li, (select inumber,ilocation FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX') AS sub WHERE li.inumber=sub.inumber AND li.ilocation!=sub.ilocation GROUP BY li.ilocation ORDER BY li.ilocation; SELECT li.ilocation AS Location, inumber2pnfsid(li.inumber) AS PnfsId, inumber2path(li.inumber) AS Path FROM t_locationinfo AS li, (select inumber,ilocation FROM t_locationinfo WHERE ilocation='hpc2n_umu_se_sXX') AS sub WHERE li.inumber=sub.inumber AND li.ilocation!=sub.ilocation ORDER BY li.ilocation;
To ONLY list files WITHOUT copies on other pools, in psql chimera:
SELECT COUNT(*) FROM t_locationinfo li
WHERE li.ilocation='hpc2n_umu_se_sXX' AND
NOT EXISTS (SELECT FROM t_locationinfo sli WHERE li.inumber=sli.inumber AND li.ilocation!=sli.ilocation);
SELECT inumber2pnfsid(li.inumber) AS PnfsId, inumber2path(li.inumber) AS Path
FROM t_locationinfo li
WHERE li.ilocation='hpc2n_umu_se_sXX' AND
NOT EXISTS (SELECT FROM t_locationinfo sli WHERE li.inumber=sli.inumber AND li.ilocation!=sli.ilocation);
For single entries, removal of stale locations can be done manually in the dCache admin interface:
\c PnfsManager clear file cache location 000073CDC2BB18E042718F1CF3B728FFF167 hpc2n_umu_se_sXX
OR
\sn clear file cache location 000073CDC2BB18E042718F1CF3B728FFF167 hpc2n_umu_se_sXX
For multiple entries, see #psql - output to file on how to output the SQL query to file and mangle it to use with the admin interface.
Suggested workflow is:
- Save the lists of affected files.
- First handle the files with accessible replicas:
- Remove stale locations.
- Re-replicate files.
- Redo queries and save lists of files left, if you've done the previous step correctly you should only have files that only existed on the dead pool left.
- Remove stale locations
- Tell the relevant users which files were lost.
List files with locations on non-existant pools
Start with creating an IN(...) with the pools in the default pgroup, see #dCache - creating lists from pgroups. It will be a hefty chunk to cut&paste into the SQL snippets below, replacing the IN(...) section.
To get a summary of locations on non-existant pools:
SELECT count(inumber) AS Count,ilocation FROM t_locationinfo WHERE ilocation NOT IN('existingpool1','...','existingpoolN') GROUP BY ilocation ORDER BY ilocation;
List files by PnfsID/Path with locations on non-existant pools:
SELECT inumber2pnfsid(inumber) AS PnfsId,inumber2path(inumber) AS Path,ilocation FROM t_locationinfo WHERE ilocation NOT IN('existingpool1','...','existingpoolN') ORDER BY ilocation,inumber;
Check for non-zero sized files without location
Summary:
SELECT COUNT(*) FROM t_inodes ti WHERE ti.isize>0 AND ti.itype=32768 AND NOT EXISTS (SELECT FROM t_locationinfo tl WHERE ti.inumber=tl.inumber);
List:
SELECT ipnfsid, inumber2path(inumber) as Path FROM t_inodes ti WHERE ti.isize>0 AND ti.itype=32768 AND NOT EXISTS (SELECT FROM t_locationinfo tl WHERE ti.inumber=tl.inumber);
List/clear stale trash table entries
FIXME: This entire section needs to be revisited since the cleaner cell is gone after that code got refactored.
FIXME: Use the dCache admin commands instead, from the dcache mailing list:
The cleaner service has its own database. In the admin interface you can instruct a cleaner to forget a pool: \s cleaner forget pool <pool name> Cleaner should also blacklist such pools and not try to delete from them as often as ones it is able to contact. You can query which pools are currently blacklisted by running: \s cleaner ls blacklist
The trash table contains all files that have been deleted in the namespace but hasn't been deleted on the actual pools. If a pool is destroyed (or decommissioned the wrong way) there can be leftovers.
Start with creating an IN(...) with the pools in the default pgroup, see #dCache - creating lists from pgroups. It will be a hefty chunk to cut&paste into the SQL snippets below, replacing the IN(...) section.
To get a summary of stale trash table entries:
SELECT count(ipnfsid),ilocation FROM t_locationinfo_trash WHERE ilocation!='' AND ilocation NOT IN('existingpool1','...','existingpoolN') GROUP BY ilocation ORDER BY ilocation;
To clear stale trash table entries (needs to be done on the primary/read-write postgres host):
BEGIN;
DELETE FROM t_locationinfo_trash where ilocation!='' AND ilocation NOT IN('existingpool1','...','existingpoolN');
# (check that the expected number of rows were affected)
COMMIT; # If OK
ROLLBACK; # If not OK
TODO
Or rather, more like FIXME ;)
Make a cache copy on another site
FIXME: This depends on a NDGF util script?
./bin/dcache_pool "^somesite_0(0[5,6,9]|1[0-4])" -id=migtest migration cache -exclude=somesite -target=pgroup atlas_disk
Some misc SQL commands
FIXME: Document what this is and why it's useful.
$ time psql chimera -c "ALTER TABLE t_access_latency ALTER COLUMN ipnfsid TYPE char(36);"