NDGF all hands 2016 notes

From neicext
Jump to navigation Jump to search

NDGF 2016 All Hands Meeting Notes

24-25/5

Oslo, USIT

https://wiki.neic.no/wiki/NDGF_all_hands_2016

Day 1

Pre-lunch

  • Norway
    • UiB moving to 100Gb link
    • Norway is centralizing into two data centers in two years. Trondheim and Tromsö. Disk + computing. New hardware. So in a few years the Oslo site is no more
  • Denmark
    • some T1 funding might be moved to a longer running program
    • no plans for national centralization of resources
    • ??
    • Profit
  • HIP site report:
    • No compute managed by CSC anymore - HIP runs all. Some physical clusters some in CSC's openstack - pouta.csc.fi (pending name change)
    • New hardware installed and in use 8 x HP SL4510 Gen9 - 68*4TB disks (3 for ALICE, 5 for T2_FI_HIP CMS)
    • centos7 and journald corruption - rate decreased/stopped by setting compression=no in journald.conf
    • ansible - using ansible to configure everything - except dCache itself
    • CMS will hopefully be writing more to T2_FI_HIP - which can and do fill the 10Gb LHCOPN pipe to Finland
  • Sweden
    • Linköping, Jens
      • Money found. Extra months acquired. New tape frontend and grid cache servers. DL360 Gen9 D6000 disk enclosures (looks like SL4510 Gen9 but no servers inside). Same as for the dCache pools.
      • decommed a few dcache pools
      • tapes
        • new tape pool held up nicely, tape backend did not (from TSM server to tape drives).
        • tape pool balancing should be looked at - idea is to increase tape bandwidth
        • how to group or bundle requests?
      • new grid cache not deployed yet. Old 8 x Dell 510 have been extremely stable but now 5-6 years old.
      • new director for NSC starting _now_
      • currently two computer rooms - one really small and one really big. They're keeping old and forfeiting half of the space in the new one. More external partners - want a second one for redundancy. Paper archive.
      • moving to new SUNET connections in the FALL (2x20Gb)
        • no BGP magic fall-back router solution deployed - perhaps not needed in new SUNET?
    • HPC2N
      • Dell X730xt ? new dCache pools 16x8TB drives 100TB effective space. RAID6. https://wiki.neic.no/wiki/DCache_Pool_Hardware
        • Firmware downgrades itself sometimes..
        • Should we look at CPU requirements? A lot of idling CPUs on the pool hardware
        • new grid cache machine 64GB RAM
          • first procurement failed - too little disk space and cheated on the specs
          • bought in feb, delivered in march, deployed late march. Last piece / switch delivered in April. Wrong airflow on switch (contrary to spec)
        • new cluster in october. 28 core Intel. Fujitsu and GPU part. Might be not be used for WLCG
          • no BGP magic fall-back router solution deployed - perhaps not needed in new SUNET?
          • will replace the ARC CE in the near future
          • tape library is maybe in the 2016 budget - thinking about jaguars
        • competence pooling sweden funding agency
  • Non-OPN networks at T1 sites - on dCache pools
    • If a pool routes some public IP traffic over a LAN (to the local computes) - then this will work. But probably nobody does this - would this anyway leave the internal interface unused?
    • IP ranges on a slide - these will eventually be removed (not planned by .dk only?)
    • Reasons:
      • A lot easier to deal with computers which are not multi-homed
      • backup-link - for sites without BGP magic fallback
      • Tape Access is via a router at NSC and HPC2N
  • syntax load / load syntax - secret dCache run command on pool servers
  • 20Gb dedicated bandwidth at HPC2N - can use more than 100Gb at low-effort. If contention drop packets.
    • Maswan has nice and secret internal planning material
    • NUNOC is busy rolling out new SUNET
    • new hardware for örestaden has not arrived yet

Post-Lunch

  • Logstash4dcache and elasticsearch - some issues with newer version of ES 2? NDGF are seeing workers getting stuck

dCache hardware update

Could this be updated: https://wiki.neic.no/wiki/DCache_Pool_Hardware

Sites could update this with some more recent procurements.

  • dCache pool of 100TB is OK 200TB is too many files for ALICE at least
    • dCache 2.15 helps a lot - read-only quickly
    • it is hard to build small pools with the large disks these days.
    • ask sites for benchmark of "time ls -l /pool1/pool"
      • could it make sense to make this part of the usual disk benchmarks
      • make multiple directories in dCache?
  • xyz extrapolation?
  • future
    • shingled drives would be nice in dCache - but is it worth it?
    • SSD cache - for the T1 this probably fits better for ARC caches (but should be large)
    • future disks look like tape
    • future == tape if disks do not improve
    • tape does not have a technological limit like disks right now - to scale up
    • 3 parity disks for the future if RAIDs increase? - if we continue with the current technology choices we'll have pools with many spindles
      • disk metric: IOPS / TB
      • tiering - one use case if one can sequentially read the files from drives that like sequential read - can then copy? the files over to a faster tier
      • flashes
      • Only PDC have managed to loose 3 drives in a RAID6
  • 40Gb Ethernet cards are much more expensive than Infiniband cards that support 40GbE
    • 100Gb Optics are expensive
    • 1Gbps per 10TB disk space
      • Suggestion 2x10GbE for >150TB ??

decomm norwegian T2

Not a WLCG T2

"Ulf takes care of everything"

It is not possible to rename or delete a site

nagios changes - argo.egi.eu

Another nagios / DN with an /ops proxy

Still nagios but a lot more complicated.

CMS will use check_mk ?

Sites do not have to do anything.

It's the same probes

There are nagios probes in here: https://github.com/ARGOeu

Coffee

NFSv4 on the T2_FI_HIP cms xrootd redirectors?

Post Coffee

Transfer Statistics Grinding

Why send to ELK if everything is already in SGAS?

Collect more information:

  • file age in cache?
  • files being read often?

JURA was fixed - it now stores all the things.

File transfers of jobs that read from ARC cache does not show

Ganglia ARC plugin should/could be made to include more metrics from the ARC caches

UiO should probably get more disk space for the ARC Cache

NSC recently bought 3 servers with 70x2TB disks - currently untested however

You trade disk for network

some sacct slurm thing is improved shortly.

Publishing of ARC logs and statistics in our ELK.

https://chaperon.ndgf.org/kibana

It's possible to use a proxy certificate with curl

can this be done: average upload throughput to slovenia

Two methods discussed;

  • Would be nice to extract data from SGAS to ELK:
    • A parser is needed
  • Have sites send statistics should be possible too.
  • One can have rsyslog read the files and ship elasticsearch
  • Perhaps one could configure ARC to ship the job statistics
  • transfer statistics
  • per job throughput

Have sites send statistics should be possible too.

Oslo can start testing it.

Action point production

One can setup a second billing service and change output to JSON - no need to use grok and regexps and logstash

dCache OS configuration

https://wiki.neic.no/wiki/DCache_Pool_installation

HPC2N: Local site resolvers

nordunet resolvers might be slow - and the queries might be rate limited

something that can be improved: file system tuning - should be in https://wiki.neic.no/wiki/Operations_Tuning_Linux

If you made a change - were you able to test and reproduce it?

Often tape pools and grid caches needs to be tuned

network sysctl changes should be changed - but these are not on the wiki

MTU:

  • all hosts in the broadcats domain must have the same MTU
  • path MTU discovery blocked by firewalls
  • only 9K
  • bad for page allocations?
  • sites have not touched this

bonus topic - dcache web alarms

depends on dCache development work to setup the authentication for the admin admin pages in dCache is not likely to happen

  • the alarm service
    • make nagios a consumer of alarm services
    • pool config change
    • setup the central alarm service
  • authentication

Day 2

https://en.wikipedia.org/wiki/Unconference

"An opportunity for sites to talk among eachother."

Unconference

= The One Dashboard to Rule Them All

  • CPU
  • Storage
    • Disk
    • Tape
  • How much do we have/use?
  • Per VO/Site/Total should be the lowest denominator
  • How much can we offer to ATLAS this or next year?
  • To cross-check the numbers with the ones reported on a CERN dashboard (REBUS, which only have pledges, not accounting). Not sure about the source.
  • A histogram. Time interval should be changeable.

Having dCache do allocation from a tool is difficult.

Storage accounting.

Maybe one can send the storage information to Elasticsearch?

An SQL query and throw that into Elasticsearch

Two APs, one for XT + Maswan. Another for someone to throw the Data into Elasticsearch

ATLAS computing news

  • Cache registration to RUCIO
    • One objection was: "Why cannot the cache be counted towards the pledges?"
    • Can be fed to brokerage system - so jobs could be sent where the data already is.
    • Perhaps some arc.conf setting recommendations for the cache - like when the cleaner kicks in
  • Any site with < 400TB should stop investing in that SE - "to reduce complexity of data management system"
    • A smaller site could perhaps live with only the ARC cache
      • Take the T2 pledges and install in the T1?
      • reporting of the accounting is difficult. No obvious solution
  • efficiency has improved a lot
    • some test jobs that check resources of the cluster
    • then the real jobs come in with better parameters ( RAM, cores, etc)
  • xrootd cache
    • change the load pattern
    • still some problems with xrootd cache
    • it can take one funding cycle to adjust the T1 to handle a lot of random io
  • next release of ARC - support for S3 uploads
    • used by atlas for event service
    • all the job output after every event is written to an object store - the job is check pointed and can be killed at any time
    • does have NDGF have any plans to support S3?
    • problem is with all these tiny files - POSIX is not so good
    • the jobs already use S3 protocol - it would be easier to be able to continue use that
    • not used much yet. What are the transfer rates?
      • BOINC, some cloud rounds?
    • written once and then read once
    • stream of events is a current hype. Apache Spark & Storm & Ignite
    • DESY approaching 1kHz file access in dCache

dCache pool upgrade

  • how to coordinate - how to take one pool down per site at a time?
  • ssh login to the arctic pools as the dcache user for ndgf staff non-root
    • ansible?
    • use a tar ball instead of rpm
    • java - can do also that with a tar ball
  • sites would still do OS/firmware updates
  • sites getting shut out
    • mostly an issue for sites that run other dCache installs.
      • not an issue if the admin is also an OOD a good thing
  • "not middleware but underwear" - Gerd
  • direct access
  • user treceability concern - if dcache user is used - how to track ssh key access.
    • signed RPM debs with the ssh pubkeys
    • static master key or per user keys
    • site specific as sites probably want to use puppet or the like for this
  • endit and tape pools
  • need to get an OK from company security to allow external to SSH in

AP:

  • a number of sub-action points per site to investigate locally
  • keep track of the above aps

hackathons

  • ARC CE and Ganglia
    • Old issue: All CE and cache servers should turn up in chaperon ganglia


ELK and SGAS
  • ELK and SGAS
    • SGAS has succeeded and failed jobs
    • Easy into there.
    • Future
      • State changes in graphs. How many jobs are starting and how many are stopping, failing.
      • rate of jobs
      • more real-time data
    • Put ARC logs into there

Possible action points?:

  • do we want multiple events per a single job?
    • if so, how would an elasticsearch query/graph look like?
  • In general, what do we want to query?
  • data structure in es
  • one can update docs in elasticsearch, this could be used to update a previous event in elasticsearch during a job's stages.
  • how to get the ARC logs to es?
    • rsyslog would be nice. which output plugin? One that is in EPEL / similarly would be nice. Debian considerations needed?
  • Create an ARC feature req - json
    • this should include a proposed format of the JSON
  • we guestimated around 10 ARC CEs could potentially send logs - retention period - size estimate.