Running ARC CEs for NeIC Tier 1

From neicext
Jump to navigation Jump to search

Preparing new services/hardware

When you have a new service or hardware ready for production it is important that NDGF is informed. It might be necessary to do performance test or add information to GocDB. Central monitoring of the service will also have to be prepared for the service. To ensure that the procedure is tracked the site admin should open a ticket for NDGF:

* Submit a ticket through GGUS (You must have a valid certificate)
* Add your email if it was not automatically parsed from your certificate
* Describe the request in subject and description
* Set the "Concerned VO" to: ops.ndgf.org
* Set the "Ticket category" to: Change Request
* Set the "Affected ROC/NGI" to: NGI_NDGF
* In the Routing Information section please set "Assign to Support unit" to: NGI_NDGF
* All the other boxes can be left as is
* Submit the ticket

If you encounter any problems contact the NDGF operator on duty (OoD), either through the mailing list or on the chat.

Requirements for Monitoring

The T1 CEs are monitored by the EGI SAM infrastructure through https://ngi-sam.ndgf.org/nagios/, as well as our local Nagios at https://chaperon.ndgf.org/nagios/. The requirements related to the former are described on the NorduGrid Service Monitoring page.

  • Enable job submission by the ops VO. Priotitize these jobs, so that they always get to the front of the queue.
  • Install the ENV/PROXY runtime environment.
  • Register the CE in GOCDB.
  • Get a eScience host certificate.
  • Enable a-rex webservice interface by having an [arex/ws] block. Check that tcp/443 is open to the world. (both IPv6 and IPv4)
  • To ease monitoring, each ARC-CE has only one hostname. Not 2, 3 and absolutely not 4. 5 is right out.
  • Don't NAT the cluster via the ARC-CE, not even the same hardware if virtualized. In many cases the machine will have enough traffic with out the NATed traffic too passing.

* Provide a ganglia group with the CE, cache servers, etc for the NDGF central ganglia.



The following block is required to satisfy the org.nordugrid.ARC-CE-ARIS metric in ARGO:

[infosys/nordugrid]

  • It may be possible to have your CE tested by the Kosice test team before going production. See the NorduGrid Service Monitoring page.

For local monitoring you must also enable the ARC cache using cacheindex.ndgf.org:

  • Install nordugrid-arc-acix-cache.
  • Make sure port 5443 is open for incoming traffic, at least from cacheindex.ndgf.org, though wider access may be needed for cache stealing.
  • Check that you allow the ops.ndgf.org VO. This is used to authenticate cacheindex.ndgf.org.
  • Start the acix-cache service and add it to the init-scripts.

The ARC cache setup is described in 4.4.3 Enabling the Cache in the ARC Computing Element System Administrator Guide.


ARC exporter monitoring

The arc-exporter is available from here: https://source.coderefinery.org/nordugrid/arc-exporter

Instructions for installation is also found there.

You must decide whether you want just to open the firewall to the central prometheus.ndgf.org => 109.105.124.160, 2001:948:40:2::160 to port 11010 or if you want the extra layer of TLS authentication. For the latter, you would have to set up e.g. nginx or something else that serves the exporter with TLS.

Once arc-exporter is installed and access to the exported metrics is set up - Petter needs to configure the monitoring to fetch the metrics.


Authorizing Users

The authorization of T1 users are done through the use of Virtual Organizations(VO). Check the ARC documentation linked below for general information on giving VOs access. you should authorize the relevant VOs on your cluster (Atlas,Alice, etc.) it is also very important that you enable the ops and ops.ndgf.org VOs. If you don't monitoring will not work. You can check here for more information on setting up WLCG VOs: VOMS LSC information. Once you have configured the LSC information you should add the following sources to your VO configuration in arc.conf:

ATLAS

source="vomss://voms2.cern.ch:8443/voms/atlas?/atlas"

source="vomss://lcg-voms2.cern.ch:8443/voms/atlas?/atlas"

Alice

source="vomss://voms2.cern.ch:8443/voms/alice?/alice"

source="vomss://lcg-voms2.cern.ch:8443/voms/alice?/alice"

CMS

source="vomss://voms2.cern.ch:8443/voms/cms?/cms"

source="vomss://lcg-voms2.cern.ch:8443/voms/cms?/cms"

Monitoring

source="vomss://voms2.cern.ch:8443/voms/ops"

source="vomss://lcg-voms2.cern.ch:8443/voms/ops"


Auth config ARC

Example configuration blocks for ATLAS - the authgroup name (here "atlas") should be changed to the authgroup name you are using. For alice - just replace "atlas" with "alice" etc.

voms and authtokens (and anything else) works in parallel.

[authtokens]
[authgroup:atlas]
authtokens=* https://atlas-auth.cern.ch * * *
authtokens=* https://atlas-auth.web.cern.ch * * *
voms=atlas * * *


The following authgroups (authgroup name is up to you, here we use ops) is required for monitoring and operations (job submission must be allowed):


[authgroup:ops]
subject=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management
voms=atlas * lcgadmin *
voms=ops * * *
voms=dteam * * *

[arex/ws/jobs]
allowaccess = ops


For ARC 6:

[gridftpd/jobs]
allowaccess = ops

Running SLURM

Please take a look at this page for SLURM best practices: SlurmBestPractices

Running Singularity

Please take a look at this page for Singularity integration: ARC_and_Singularity

Or perhaps rather this (/jens)?

https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCContainersDeployment#Recommended_setup_to_use_singula

Job Requirements

The job requirements for the experiments can be found here: ATLAS, Alice, CMS

The most relevant information is that jobs need 2 GB of memory per core and that the upper limits on used scratch space is 20 GB per ATLAS or CMS job and 10 GB per Alice job. Check the links for current information

RTE for ATLAS memory

Be aware that atlas jobs oversubscribe the memory in the beginning of the job and you might need at least 2.5 gb per core during peaks. If you set a hard limit of 2 GB per core, you will have many job failures. To avoid this, the following RTE could be used to give all jobs e.g. 20% more RAM than requested (thanks Gianfranco). Another option is to use cgroups and the option AllowedRAMSpace=120 or also use job_submit.lua to change the memory requirement.

   $ grep runtimedir /var/run/arc/arcctl.runtime.conf
   runtimedir=/....
   $ cat ${runtimedir}/ENV/MEMORY
   # description: scale up memory request by 20%
   if [ "x$1" = "x0" ]; then
      # RTE Stage 0
      # You can do something on ARC host as this stage.
      # Here is the right place to modify joboption_* variables.
      # Note that this code runs under the mapped user account already!
      # No root priveleges!
      if [ ! -z "$joboption_memory" ] ; then
        joboption_memory="$(($joboption_memory*12/10))"
      fi
   fi

Enable and make the rte default:

arcctl rte default ENV/MEMORY

arcctl rte enable ENV/MEMORY

RTE for SAM tests

The SAM tests may require a dummy GLITE runtimeenvironment. To create it, run

arcctl rte enable -d ENV/GLITE.

RTE for ATLAS jobs

This RTE is expected for ATLAS jobs

arcctl rte enable --dummy APPS/HEP/ATLAS-SITE

Also at least ATLAS (and maybe other experiments) expect this system RTE to be default and enabled:

arcctl rte enable ENV/PROXY

arcctl rte default ENV/PROXY

ARC Session dir

Sizing the session dir depends a lot on your specific set-up, the projects you run and the single core / multi core job mix. Please check the links in job requirements for current information. There are three main setups:

Linked cache files and node scratch

Most data will reside in cache and on the node local disk. Only the requested output files will take up space in your session dir. You should scale the session dir to handle peak load, when jobs finishes. The expected average load is then around 200 GB + 2 G pr job (Caveat emptor. This will depend on your atlas/alice job mix )

Linked cache files and session dir as scratch

In this case you must size the session dir to take the job scratch space into account. This means a session dir of around 10 GB per alice job and 20 G per atlas job. (This can have a huge spread if you are only running single core jobs vs only multi core.) It can be less because of sharing of cached files.

Cache files copied to session dir and session dir as scratch

In this case you are no longer sharing cache and have to have the full scratch in session. In this case the session dir should be sized as 200 G + 20 G per job if you run Atlas jobs. I.e. you must take into account that you could be unlucky and a full cluster load of only single core ATLAS jobs.

ARC Cache

Sizing

ATLAS loads these days could require as much as 25TB/1kcore, but typical load seems lower. Sizing for 20TB + 5TB/1kcore seems reasonable, and probably oversized for large clusters. Vega does fine with 500TB for 200k cores.

Speed is important, the cache disk will have fast sequential writes from ARC, intensive random reads from compute nodes, and some random small write loads too if used as session directory servers as well. But since this scales well horizontally, a handful of 1U machines with 100G networking and decent NVME drives should be speedy enough.

N+1 redundancy might be useful, in case one of the cache servers breaks and repairs take a long time.

Special requirements

The cache cleaner removes files by LRU that are not locked. It relies on the atime stamp. If using a shared file system, make sure it supports this.

Atime notes:

  • Do not mount cache dir with noatime. This includes NFS mounts etc as well.
  • Modern Linux defaults to relatime which causes atime only to be updated every 24 hours or when the file is changed.
  • The strictatime mount option forces atime to always be written/updated. This hurts ext* filesystems, xfs handles it OK.
  • Newer Linux kernels (4.0 for ext4, 4.17 for xfs) supports the lazytime mount option, which causes in-memory atime values to always be updated while sporadically writing updates to disk. This is an option in addition to strictatime.

FIXME: Needs to support many hard links to a single file. Panassas had issues with this, should probably be noted if there are new shared filesystems showing up.

Opening the remote access to the local cache

NOT RECOMMENDED TO RUN ACIX ANY LONGER (NDGF-AH 2020)

To let NDGF sites download the input files for ATLAS jobs from each other. See configuration notes for ARC5 and ARC6 below.

When it's configured and running, write to support@ndgf.org that you opened your cache for the access; NDGF will make your acix service be pulled by cacheindex.ndgf.org, so the other CEs will get aware about the files in your cache.

ARC5 configuration

acix-cache service should be running.

In [grid-manager] section.

  • Enable A-REX WS interface. The download happens through it.
  • Let users from ATLAS VO proxies access ATLAS files from your cache. ATLAS files are rucio files
    • cacheaccess="rucio://rucio-lb-prod.cern.ch/.* voms:vo atlas"
  • Limit the number of concurrent downloads from your cache
    • max_data_transfer_requests="20"

In [data-staging] section

  • Make your CE look up the input files in other remote caches, that register to cacheindex.ndgf.org
  • Adjust preferredpattern so that nearby CEs are used as a fallback, if available, before storage elements in foreign clouds. This example also prefers abisko-ce above srm.ndgf.org, use with caution.
    • preferredpattern="pandaserver.cern.ch$|www-f9.ijs.si$|abisko-ce.hpc2n.umu.se$|ndgf.org$|.se$|.dk$|.no$|.fi$|.si$"

ARC6 configuration

arc-acix-scanner should be running.

In [arex/ws] subblock.

  • Limit the number of concurrent downloads from your cache
    • max_data_transfer_requests=20

In [arex/ws/cache] subblock.

  • Let users from ATLAS VO proxies access ATLAS files from your cache. ATLAS files are rucio files
    • cacheaccess=rucio://rucio-lb-prod.cern.ch/.* voms:vo atlas

In [arex/data-staging] subblock.

  • Make your CE look up the input files in other remote caches, that register to cacheindex.ndgf.org
  • Adjust preferredpattern so that nearby CEs are used as a fallback, if available, before storage elements in foreign clouds. This example also prefers abisko-ce above srm.ndgf.org, use with caution.
    • preferredpattern=pandaserver.cern.ch$|www-f9.ijs.si$|abisko-ce.hpc2n.umu.se$|ndgf.org$|.se$|.dk$|.no$|.fi$|.si$

Make sure [acix-scanner] block is present (can be completely empty).


Investigating problems

If ATLAS jobs are acting weird, please contact ND cloud <atlas-support-cloud-nd@cern.ch> for ATLAS help with debugging.

If ALICE jobs are acting weird, please contact Erik Edelmann <erik.edelmann@csc.fi> for help with debugging.

As a fallback, contact support@ndgf.org for help in figuring out where it is. Please keep it on Cc: for the others contacts too for anything serious.

SGAS reporting

If you want to report jobs to SGAS, the following needs to be sent to somebody who can add it to SGAS, e.g. Erik Edelmann <erik.edelmann@csc.fi>

  • The certificate DN
  • A hepscore23 or hepspec06 value

Relevant Documentation

More documentation can be found on