Nordic SLURM Workshop 2016

From neicext
Jump to navigation Jump to search

This page is used for planning and organizing a Nordic SLURM Workshop. All information displayed represents the current status, and is not meant to limit the scope of topics and audience.

Dates, Venue, Admission & Accommodation

  • to be decided

Purpose & Goals

  1. educating cluster administrators
  2. sharing experiences
  3. targeting problems/challenges

Topics of interest

Note, topics not ordered and assignments to categories (ACCOUNTING, SYSTEM DESIGN, ...) due to my limited knowledge.

  • T0 - SITE REPORTS
    • every site gives a very short status report of their Slurm usage, most pressing issues, expectations for participating in the workshop
  • T1 - ACCOUNTING
    • discuss with others how one best can use SLURM's accounting features to limit the number of cpu-/memory-hours projects can use in a given period. How much work would it be to modify SLURM to a) do this at the same time as using fair share priorities, and b) be more flexible wrt. the period start/end dates?
    • software accounting (what software are we running on our clusters)
  • T2 - SYSTEM DESIGN
    • heterogeneous/mixed clusters (traditional CPUs, GPUs, KNL, ...?)
  • T3 - PARALLEL JOBS
    • learn more about how best to run OpenMPI- and IntelMPI-jobs in SLURM (when using cons_res to hand out cpus and memory), especially wrt. binding jobs/job-steps to cpus.
  • T4 - SCHEDULING
    • fairshare with SLURM in the context of multiple user groups accessing the same set of nodes
      • complex fairshare strategies based on time, cputime, I/O usage, ...
      • alternative schedulers
      • best solutions for jobs with different lengths/requirements: more queues? better scheduling algorithm?
    • Has anyone looked into what it would take to implement priorities on accounts?
  • T5 - ENFORCING RESOURCE LIMITS
    • prolog and epilog common use cases
      • how to secure processes running on nodes, e.g. how to prevent unix forks to escape the control of the batch system
      • how best to limit memory and the cpus jobs and job-tasks have access to
      • learn more about how to configure SLURM wrt. sockets, cores, hyperthreads and memory (when using cons_res to hand out cpus and memory)
  • T6 - JOB ENVIRONMENT
    • prolog and epilog common use cases
      • how to initialize the environment where the job runs in a sane way: environment variables, temporary folders, where to store results
    • Are others using any custom Spank plugins and/or advanced job submit scripts?
  • T7 - SLURM DEPLOYMENT & MAINTENANCE
    • cleanup and maintenance of the usage database: backing up statistics, cleaning up to reduce storage usage, resetting job IDs
    • how do sites deploy SLURM?
    • is anyone running backup controllers? how is this done with a slurmdbd mysql?
  • T8 - SLURM SOURCE CODE
    • SLURM code quality: Is it declining, with newer versions having more bugs? Are reported bugs getting fixed? Are reported bugs getting fixed if you have a support contract? Do others have local patches to Slurm? How are your patches maintained?
  • T9 - SLURM TOOLS
    • additional software that is useful, for example, the sview command does not come with SLURM distribution so I wonder if someone is keeping track of those additional tools
  • T10 - USER EXPERIENCE
    • tools to simplify the user experience with SLURM commands
  • T11 - SUPPORT
    • Has anyone any experience with buying a support contract or similar from SchedMD? (http://schedmd.com/#services) What does it cost, and what level of support do you get?

Proposed Talks & Session Chairs

Talks

  • T0 - SITE REPORTS
    • short presentations (5 min max each) by every site including status, most pressing issues, expectations on workshop
  • T1 - ACCOUNTING
    • Jens Svalgaard Kohrt, Accounting at SDU, 5-10 min
  • T7 - SLURM DEPLOYMENT & MAINTENANCE
    • Johan Guldmyr, Deploying a SLURM cluster with ansible [1], 5-10 min
  • T10 - USER EXPERIENCE
    • N.N. (Aalto University), Aalto's 'slurm' utility - user friendly interface to original SLURM commands, 10 min

Session Chairs

  • one of T4, T5 or T6: Florido Paganelli

Contact List of People Being Potentially Interested

Below are those listed who have confirmed that they are potentially interested. If you know someone who is missing, please send me (Thomas Röblitz) a note.

  • Ulf Tigerstedt, CSC, FI
  • Erik Edelmann, CSC, FI
  • Johan Guldmyr, CSC, FI
  • Luís Alves, CSC, FI
  • Janne Blomqvist, Aalto University, FI
  • Ivan Degtyarenko, Aalto University, FI
  • Mikko Hakala, Aalto University, FI
  • Simo Tuomisto, Aalto University, FI
  • Bjørn-Helge Mevik, USIT, NO
  • Dmytro Karpenko, USIT, NO
  • Steinar Trædal-Henden, UiT, NO
  • Magnus Jonsson, HPC2N, SE
  • Roger Oscarsson, HPC2N, SE
  • Robert Grabowski, Lunarc, SE
  • Tore Sundqvist, Lunarc, SE
  • Florido Paganelli, Lund, SE
  • Pär Lindfors, NSC, SE
  • Jens Svalgaard Kohrt, SDU, DK
  • Dejan Lesjak, IJS, SI