Running ARC CEs for NeIC Tier 1
Preparing new services/hardware
When you have a new service or hardware ready for production it is important that NDGF is informed. It might be necessary to do performance test or add information to GocDB. Central monitoring of the service will also have to be prepared for the service. To ensure that the procedure is tracked the site admin should open a ticket for NDGF:
* Submit a ticket through GGUS (You must have a valid certificate) * Add your email if it was not automatically parsed from your certificate * Describe the request in subject and description * Set the "Concerned VO" to: ops.ndgf.org * Set the "Ticket category" to: Change Request * Set the "Affected ROC/NGI" to: NGI_NDGF * In the Routing Information section please set "Assign to Support unit" to: NGI_NDGF * All the other boxes can be left as is * Submit the ticket
If you encounter any problems contact the NDGF operator on duty (OoD), either through the mailing list or on the chat.
Requirements for Monitoring
The T1 CEs are monitored by the EGI SAM infrastructure through https://ngi-sam.ndgf.org/nagios/, as well as our local Nagios at https://chaperon.ndgf.org/nagios/. The requirements related to the former are described on the NorduGrid Service Monitoring page.
- Enable job submission by the ops VO. Priotitize these jobs, so that they always get to the front of the queue.
- Install the ENV/PROXY runtime environment.
- Register the CE in GOCDB.
- Get a eScience host certificate.
- Enable a-rex webservice interface by having an [arex/ws] block. Check that tcp/443 is open to the world. (both IPv6 and IPv4)
- To ease monitoring, each ARC-CE has only one hostname. Not 2, 3 and absolutely not 4. 5 is right out.
- Don't NAT the cluster via the ARC-CE, not even the same hardware if virtualized. In many cases the machine will have enough traffic with out the NATed traffic too passing.
* Provide a ganglia group with the CE, cache servers, etc for the NDGF central ganglia.
The following block is required to satisfy the org.nordugrid.ARC-CE-ARIS metric in ARGO:
[infosys/nordugrid]
- It may be possible to have your CE tested by the Kosice test team before going production. See the NorduGrid Service Monitoring page.
For local monitoring you must also enable the ARC cache using cacheindex.ndgf.org:
- Install nordugrid-arc-acix-cache.
- Make sure port 5443 is open for incoming traffic, at least from cacheindex.ndgf.org, though wider access may be needed for cache stealing.
- Check that you allow the ops.ndgf.org VO. This is used to authenticate cacheindex.ndgf.org.
- Start the acix-cache service and add it to the init-scripts.
The ARC cache setup is described in 4.4.3 Enabling the Cache in the ARC Computing Element System Administrator Guide.
ARC exporter monitoring
The arc-exporter is available from here: https://source.coderefinery.org/nordugrid/arc-exporter
Instructions for installation is also found there.
You must decide whether you want just to open the firewall to the central prometheus.ndgf.org => 109.105.124.160, 2001:948:40:2::160 to port 11010 or if you want the extra layer of TLS authentication. For the latter, you would have to set up e.g. nginx or something else that serves the exporter with TLS.
Once arc-exporter is installed and access to the exported metrics is set up - Petter needs to configure the monitoring to fetch the metrics.
Authorizing Users
The authorization of T1 users are done through the use of Virtual Organizations(VO). Check the ARC documentation linked below for general information on giving VOs access. you should authorize the relevant VOs on your cluster (Atlas,Alice, etc.) it is also very important that you enable the ops and ops.ndgf.org VOs. If you don't monitoring will not work. You can check here for more information on setting up WLCG VOs: VOMS LSC information. Once you have configured the LSC information you should add the following sources to your VO configuration in arc.conf:
ATLAS
source="vomss://voms2.cern.ch:8443/voms/atlas?/atlas"
source="vomss://lcg-voms2.cern.ch:8443/voms/atlas?/atlas"
Alice
source="vomss://voms2.cern.ch:8443/voms/alice?/alice"
source="vomss://lcg-voms2.cern.ch:8443/voms/alice?/alice"
CMS
source="vomss://voms2.cern.ch:8443/voms/cms?/cms"
source="vomss://lcg-voms2.cern.ch:8443/voms/cms?/cms"
Monitoring
source="vomss://voms2.cern.ch:8443/voms/ops"
source="vomss://lcg-voms2.cern.ch:8443/voms/ops"
Auth config ARC
Example configuration blocks for ATLAS - the authgroup name (here "atlas") should be changed to the authgroup name you are using. For alice - just replace "atlas" with "alice" etc.
voms and authtokens (and anything else) works in parallel.
[authtokens]
[authgroup:atlas]
authtokens=* https://atlas-auth.cern.ch * * *
authtokens=* https://atlas-auth.web.cern.ch * * *
voms=atlas * * *
The following authgroups (authgroup name is up to you, here we use ops) is required for monitoring and operations (job submission must be allowed):
[authgroup:ops]
subject=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
voms=atlas * lcgadmin *
voms=ops * * *
voms=dteam * * *
[arex/ws/jobs]
allowaccess = ops
For ARC 6:
[gridftpd/jobs]
allowaccess = ops
Running SLURM
Please take a look at this page for SLURM best practices: SlurmBestPractices
Running Singularity
Please take a look at this page for Singularity integration: ARC_and_Singularity
Or perhaps rather this (/jens)?
Job Requirements
The job requirements for the experiments can be found here: ATLAS, Alice, CMS
The most relevant information is that jobs need 2 GB of memory per core and that the upper limits on used scratch space is 20 GB per ATLAS or CMS job and 10 GB per Alice job. Check the links for current information
RTE for ATLAS memory
Be aware that atlas jobs oversubscribe the memory in the beginning of the job and you might need at least 2.5 gb per core during peaks. If you set a hard limit of 2 GB per core, you will have many job failures. To avoid this, the following RTE could be used to give all jobs e.g. 20% more RAM than requested (thanks Gianfranco). Another option is to use cgroups and the option AllowedRAMSpace=120 or also use job_submit.lua to change the memory requirement.
$ grep runtimedir /var/run/arc/arcctl.runtime.conf runtimedir=/.... $ cat ${runtimedir}/ENV/MEMORY # description: scale up memory request by 20% if [ "x$1" = "x0" ]; then # RTE Stage 0 # You can do something on ARC host as this stage. # Here is the right place to modify joboption_* variables. # Note that this code runs under the mapped user account already! # No root priveleges! if [ ! -z "$joboption_memory" ] ; then joboption_memory="$(($joboption_memory*12/10))" fi fi
Enable and make the rte default:
arcctl rte default ENV/MEMORY
arcctl rte enable ENV/MEMORY
RTE for SAM tests
The SAM tests may require a dummy GLITE runtimeenvironment. To create it, run
arcctl rte enable -d ENV/GLITE
.
RTE for ATLAS jobs
This RTE is expected for ATLAS jobs
arcctl rte enable --dummy APPS/HEP/ATLAS-SITE
Also at least ATLAS (and maybe other experiments) expect this system RTE to be default and enabled:
arcctl rte enable ENV/PROXY
arcctl rte default ENV/PROXY
ARC Session dir
Sizing the session dir depends a lot on your specific set-up, the projects you run and the single core / multi core job mix. Please check the links in job requirements for current information. There are three main setups:
Linked cache files and node scratch
Most data will reside in cache and on the node local disk. Only the requested output files will take up space in your session dir. You should scale the session dir to handle peak load, when jobs finishes. The expected average load is then around 200 GB + 2 G pr job (Caveat emptor. This will depend on your atlas/alice job mix )
Linked cache files and session dir as scratch
In this case you must size the session dir to take the job scratch space into account. This means a session dir of around 10 GB per alice job and 20 G per atlas job. (This can have a huge spread if you are only running single core jobs vs only multi core.) It can be less because of sharing of cached files.
Cache files copied to session dir and session dir as scratch
In this case you are no longer sharing cache and have to have the full scratch in session. In this case the session dir should be sized as 200 G + 20 G per job if you run Atlas jobs. I.e. you must take into account that you could be unlucky and a full cluster load of only single core ATLAS jobs.
ARC Cache
Sizing
ATLAS loads these days could require as much as 25TB/1kcore, but typical load seems lower. Sizing for 20TB + 5TB/1kcore seems reasonable, and probably oversized for large clusters. Vega does fine with 500TB for 200k cores.
Speed is important, the cache disk will have fast sequential writes from ARC, intensive random reads from compute nodes, and some random small write loads too if used as session directory servers as well. But since this scales well horizontally, a handful of 1U machines with 100G networking and decent NVME drives should be speedy enough.
N+1 redundancy might be useful, in case one of the cache servers breaks and repairs take a long time.
Special requirements
The cache cleaner removes files by LRU that are not locked. It relies on the atime stamp. If using a shared file system, make sure it supports this.
Atime notes:
- Do not mount cache dir with noatime. This includes NFS mounts etc as well.
- Modern Linux defaults to relatime which causes atime only to be updated every 24 hours or when the file is changed.
- The strictatime mount option forces atime to always be written/updated. This hurts ext* filesystems, xfs handles it OK.
- Newer Linux kernels (4.0 for ext4, 4.17 for xfs) supports the lazytime mount option, which causes in-memory atime values to always be updated while sporadically writing updates to disk. This is an option in addition to strictatime.
FIXME: Needs to support many hard links to a single file. Panassas had issues with this, should probably be noted if there are new shared filesystems showing up.
Opening the remote access to the local cache
NOT RECOMMENDED TO RUN ACIX ANY LONGER (NDGF-AH 2020)
To let NDGF sites download the input files for ATLAS jobs from each other. See configuration notes for ARC5 and ARC6 below.
When it's configured and running, write to support@ndgf.org that you opened your cache for the access; NDGF will make your acix service be pulled by cacheindex.ndgf.org, so the other CEs will get aware about the files in your cache.
ARC5 configuration
acix-cache
service should be running.
In [grid-manager] section.
- Enable A-REX WS interface. The download happens through it.
arex_mount_point="https://your.ndgf.host:443/arex"
- Let users from ATLAS VO proxies access ATLAS files from your cache. ATLAS files are rucio files
cacheaccess="rucio://rucio-lb-prod.cern.ch/.* voms:vo atlas"
- Limit the number of concurrent downloads from your cache
max_data_transfer_requests="20"
In [data-staging] section
- Make your CE look up the input files in other remote caches, that register to cacheindex.ndgf.org
acix_endpoint="https://cacheindex.ndgf.org:6443/data/index"
- Adjust
preferredpattern
so that nearby CEs are used as a fallback, if available, before storage elements in foreign clouds. This example also prefers abisko-ce above srm.ndgf.org, use with caution.preferredpattern="pandaserver.cern.ch$|www-f9.ijs.si$|abisko-ce.hpc2n.umu.se$|ndgf.org$|.se$|.dk$|.no$|.fi$|.si$"
ARC6 configuration
arc-acix-scanner
should be running.
In [arex/ws] subblock.
- Limit the number of concurrent downloads from your cache
max_data_transfer_requests=20
In [arex/ws/cache] subblock.
- Let users from ATLAS VO proxies access ATLAS files from your cache. ATLAS files are rucio files
cacheaccess=rucio://rucio-lb-prod.cern.ch/.* voms:vo atlas
In [arex/data-staging] subblock.
- Make your CE look up the input files in other remote caches, that register to cacheindex.ndgf.org
use_remote_acix=https://cacheindex.ndgf.org:6443/data/index
- Adjust
preferredpattern
so that nearby CEs are used as a fallback, if available, before storage elements in foreign clouds. This example also prefers abisko-ce above srm.ndgf.org, use with caution.preferredpattern=pandaserver.cern.ch$|www-f9.ijs.si$|abisko-ce.hpc2n.umu.se$|ndgf.org$|.se$|.dk$|.no$|.fi$|.si$
Make sure [acix-scanner] block is present (can be completely empty).
Investigating problems
If ATLAS jobs are acting weird, please contact ND cloud <atlas-support-cloud-nd@cern.ch> for ATLAS help with debugging.
If ALICE jobs are acting weird, please contact Erik Edelmann <erik.edelmann@csc.fi> for help with debugging.
As a fallback, contact support@ndgf.org for help in figuring out where it is. Please keep it on Cc: for the others contacts too for anything serious.
SGAS reporting
If you want to report jobs to SGAS, the following needs to be sent to somebody who can add it to SGAS, e.g. Erik Edelmann <erik.edelmann@csc.fi>
- The certificate DN
- A hepscore23 or hepspec06 value
Relevant Documentation
More documentation can be found on