Dellingr Phase 2 kick-off meeting minutes

From neicext
Jump to navigation Jump to search

NeIC Dellingr Phase 2 - Kick-Off meeting, October 3, 2017, Reykjavik, Iceland

Attendance

Tomasz Malkiewicz, NeIC

Jørn Amundsen, Sigma2

Kurt Nielsen, DeiC

Hans Karlsson, SNIC

Kine Nordstokkå, NeIC

Gudmund Høst, NeIC

Jens Svalgaard Kohrt, DeIC

Ebba Þóra Hvannberg, RHnet

John White, NeIC

Hjörleifur Sveinbjörnsson, RHnet

Per-Olov Hammargren, SNIC

Vera Hansper, CSC

Known absences: Juha Fagerholm, Petri Nikunen, Bjørn Lindi, Rob Pennington

Goals of the meeting

  • Plan who is doing/delivers what and when during the first Phase 2 project year (or until the next project's annual meeting).
  • Ensure everyone can start working right away after the meeting
  • Engage research community in Iceland
  • Build chemistry within the project

Intended Audience

  • Dellingr team/staff.
  • Dellingr steering group.
  • Icelandic research community representatives.
  • NeIC director and administrative coordinator.


Agenda

Welcome (Tomasz)

Resource Sharing, Nordic eScience Action Plan and NeIC Strategy (Gudmund)

Dellingr Phase 1 outcomes (Tomasz)

Action: There needs to be an executive summary to the SG of the DOs from the phase 1. (Tomasz will do this as he was present for the DO writing and discussions).

  • Question about User Support in Dellingr project. Can the installation of software be counted towared the contracted FTEs? How about contacts with legal people, is this also allowable to count towards the FTEs?
  • There has not been any contact to the legal person (Kristin Lyng at met NO) as the questions to be solved aren't really clear. (These come into some of the phase 2 deliverables). The Nordic Met office collab managed to sort out the VAT issue (see later minutes).

NeIC project cycle, https://wiki.neic.no/wiki/Project_process 15' (Kine)

Questions:

Q: How are these health checks done?

A: Look at findings, not necessarily the deliverables. This is performed by the SG. There is a check list available.

Q: Where is Dellingr now?

A: Dellingr is at DP4. For SE. The project plan is approved by SG. (Correct). This project is a bit different as it moves from Phase 1 to 2.

Q: How about NT1?

A: NT1 is a different case. But try to use the same PPT concepts. It has a project manager and sort of SG (or reference group)

Q: How about for Dellingr... the midway report to the board?

A: This should be done in approx 1 year (Sept 2018?)

There will be a PPS training 11th January 2018, Arlanda for the Steering Group.

Dellingr Phase 2 overview & status (John)

(Notes needed?)

Community & NeIC partner presentations (people, vision for Dellingr Phase 2) (Ebba, Hans, Jørn, Kurt, Vera, Tomasz)

  • IS: Ebba: We would like the ability to provide resources transparently. Have information on what resources are available alongside the national offerings. Also federated authentication, e.g. EduGAIN? that would work throughout Nordics. Strategies for technical support to scientists. Accounting of usage of the resources important.
  • SE: Hans: We have the new resources being offered through LUNARC. Also determining the preconditions for sharing are important. One basic question is what are HPC? Or computing in general? We would like to see to document regulations and routines.
  • NO: Jørn: Again we see AAA as important. Trying to find the paths for solving obstacles. In HPC in Nordics we are doing the same thing over and over in parallel. We should share effort. Looking at technologies such as containers and queuing - using containers should make it easier to move software between HPC installations. Quotas to be managed by national granting bodies.
  • DK: Kurt: For Denmark wa are having the discussions on new HPC strategy for DeIC. (discussions at DeIC meeting last week). So we are currenly not able to say what the Danish plans are in the future. We see Dellingr as a useful access for international cooperation.
  • FI: Vera: Agree with a number of points above and would like to add that this is also a good opportunity for Nordic sharing into the future. Would envisage a pool of resources provided by the partners which applicants can apply for. These could be assessed by a small committee from the partner resource allocation groups.
  • NeIC: Tomasz: NeIC sees Dellingr as a key part of the overall strategy. Also important for working with other projects on subjects such as resource sharing and policies.

Group work (everyone, facilitated by John and Tomasz)

Group work outcome discussion (everyone, facilitated by John and Tomasz)

Proof of concept through research driven pilots track

DO1:
  • Need to benchmark all the available resources which are used in the pilot. What is the most basic benchmark HPL, HEPspec.
  • Produce a list of the types of compute resources:

- CPU (normal) ... amount of RAM/core.

- GPU different types.

- High memory modes.

Need to calculate the cost of GPU to CPU. How to normalize the resource costs? Need to normalize to the cost of a basic node. i.e. the nothing-special situation.

  • Storage...

- How much and how long?

- Back-up? Responsibility for backups on user?

- Local rules for a short-term project.

  • A timeline for the current call is to be instituted, and be visible on the website. The current call will close December 31st 2017. Allocated CPU hours are to be available for use until June 30th 2018.
  • Research data generated within the frame of the call will adhere to the local policies of national providers. Data generated must be removed by Sept 2018. (or at least is not guaranteed by the local provider).
  • User Support: this is provided by the guest provider. Pilot participants are to receive the same user support the same as a normal national user. Also ticketing systems in place to calculate the effort for user support.
DO2: "Resource exchange analysis"
  • Question: Do we continue the pilot?

- VH: What do we want to get out of the pilot?

- Experience in admitting/registering/supporting users across borders?

- If we see there's a need then we should extend but there could be a gap.

- Need to consider this before the end of the pilot.

- For the pilot to continue we need to get more pledges of hours.

- An additional pilot may be launched following the evaluation of the current pilot.

  • The evaluation is important...

- Also the feedback from the users (part of DO2) (Should start on the evaluation as soon as possible)

- How about interviews with the applicants? (Also possible, probably before the computing ends though.)

DO3: "Pilot projects: NLPL and additional pilots"

- The NLPL is the NO participation to the pilot.

- An example of a similar "pilot" is the Nordic Met offices collaboration.

- What are other pilots?

- Physics...

- Glaciology.

- Climate Science.

- eSTICC (has applied to be a pilot)

- Still needs further investigation.

DO4: "Resource exchange implementation and agreement"
  • What are preconditions? (conditions to be in place before agreement?)
  • Input from PH (SE):

- Legal. What are the legal positions from each of the countries? (eg. we (Dellingr pilot) have this issue with university vs. non-university)

- Conditions from the national funders.

- We need a good timeline for the call.

- Do we need an allocation committee?

- Or do we form a committee for projects smaller than a certain size?

  • JW: Is this deliverable work more for the SG rather than PG?

- The SG is going to have to be more involved with this one as the issues are at a higher level than just technical.

  • Some considerations for a "allocation" committee (maybe we need a better name for this?)

- Do we judge on merit-based for the larger scale projects?

- Should ask why you want to run your code somewhere else?

- Scale of the new projects? Should be considered later on in the project.

- Could ask for larger scale next year?

- Should this deliverable be moved to June 2019? (Tentatively yes)

DO5 "User authentication, authorisation and accounting"

Maybe this needs to be moved earlier as this needs to be roughly in place for the second call. But the text states "Mid-term progress report in June 2018, expectation is that a system is in the testing phase."

DO6: "Nordic availability of shared resources"

Do we need more time?

High level planning and coordination track

DO7: Establishment of high-level information exchange mechanisms

Suggest 2 WP:

1. Benchmark all known HPC resources in the Nordics using the standard developed by Bencheit/DeIC (which includes spending and numbers on HPC resources). Includes 4 recent reports of HPC in EU. HYPERION report,IDC report, HIPEAC report, BDEC report. The benchmark includes manpower and can by object for analyzing cost savings

Responsible: Kurt, timeframe november 17

2. Establish a forum lead by NeIC project manager to facilitate a forum of national HPC providers. Inviting of national HPC providers in the Nordics to discuss sharing policies.

Members: - high-level leadership, general terms: director of SNIC, DeIC, Sigma2, CSC and IS IT, NeIC (one might want to anchor it first with the board) - Forum of experts, less formal internal meeting: people running the procurement exchanging their experiences

Subjects: - Sharing policy for the forum - Supplement to PRACE, exchange in Nordics - Forum of experts, less formal internal meeting: people running the procurement exchanging their experiences

Responsible: John, spring 18

DO7 to be established before DO8.

DO8:

Proposal: to be reformulated:

New formulation: Investigate how other collaboration, e.g. Met Offices, WLCG, have solved the cross border procurements.

Met Offices

- Contact lawyers and get from them the project status from their perspective, especially regarding VAT. (John) - Check with the Icelandic Met Office and get some info about what has been agreed (Hjörleifur)

NT1

The pledge share is the "6% of ATLAS and 9% of ALICE Tier-1 resources (or ALICE's fair share percentage), split by author key"

- The MoU for NT1 is with the individual countries (DK,FI,IS,NO,SE) funding agencies.


  • Long-term collaboration already established, based on HEP specs, what Dellingr Phase 2 can add?
  • Exchange model for exchanging cpu cycles to storage (tape, dCache)?


Planning for next 12 months (October 2017 - September 2018): What, Who, When (everyone, led by John)

DO1

Benchmarking utilizing HPL is to be used for the resources used in the pilot.

General note: we discussed the timeline for this.

Due date March 2018 quite tight.

John to compile the instructions based on the project group meeting.

DO2

General note: For this DO we discussed: What we want to measure via the pilot? What do we want to know? What indicators do we provide to the users? I can’t remember the exact wording though.

DO3

DO4

The report which was written by E&Y in 2016 regarding VAT is to be followed up via a meeting between the project group and E&Y.

For DO 4, three types of preconditions needs to be investigated:

Policy preconditions: What is the opinion of national funders, ministries, et.al. regarding resources sharing.

Legal preconditions: Limits placed on the usage of funds, sharing across borders, and VAT,

Practical: To facilitate sharing of resources across borders a process, from call to allocation decision, needs to be established.

DO5

- Maybe this needs to be moved to earlier?

- DO5 needs to precede DO4 (they need to be switched in the project plan)

- What should be the date here (TBD by me?)

DO6

Current Dellingr pilot is open until 31.12.2017, computations are to be completed by 30.06.2018. New pilot/continuation can run immediately after the current pilot ends or a bit later. The pool of the shared resources can be expended further according to the needs.

A call for HPC resources is to be made the same way like other calls for HPC-resources, e.g. the national calls, and PRACE DECI. Can be expanded for the future, to a grand challenge call. The model for an open call with no resources allocated will not be used.

It is to be noted that DO 6, i.e., continous availability of Nordic shared resources, cannot be completed unless the preconditions of DO4 are met.

The date for this DO is to be moved to June 2019.

DO7

DO8

Set of current and proposed tools facilitating resource sharing

- Vera Hansper, Web Portal

- Bjørn Lindi, Metacenter Administration System slides

- Petri Nikunen, Haka slides

National thematic session: Iceland and resource sharing, engaging the research community

  • Jesús Zavala Franco, assistant professor, Centre for Astrophysics and Cosmology describes a use case/need
  • Helmut Neukirchen, professor computer science, University of Iceland describes a use case
  • Elvar Örn Jónsson, post-doctorate in chemistry, University of Iceland
  • Viðar Guðmundsson, professor in physics, University of Iceland
  • Egill Skúlason, professor in chemistry, University of Iceland

Group work: Developing Icelandic use cases for Dellingr

  • Jesús Zavala Franco, assistant professor, Centre for Astrophysics and Cosmology describes a use case/need (Vera, Hjörleifur)

- Can snapshot runs, simulations are long time scales - 10^9 simulation would need about 1000 cores, so about 10^6 elements per core. Would need to run for about 10^6 core hours - Issues are that current systems in Iceland are too small for their needs - Researchers can benefit by getting access to systems where *big* jobs can run more frequently. - Could be a good candidate for a pilot run (say 100k to 200k core hours) - Is a good use case for longer term Dellingr as a service.


  • Elvar Örn Jónsson, Per-Olov, John.

- In general “Yes”. 50k CPU (core)-h and some 10’s of GB data.

- The processes are interruptible i.e., checkpointed at each step.

- Once in production can run for weeks (this will happen later on).

- Opensource S/W GPAW with standard libraries. Please apply.

- This would be very useful for the testing development stages of this project.

- Currently have to wait days/week to run a test.

- Access to resources will speed up the development process.

- Can apply to later pilot also.

  • Viðar Guðmundsson, GPU needs

- Communication by e-mail.

- GPUs available from Denmark.


  • Egill Skúlason: (Jørn, Jens)

- General HPC use case, which should fit in on most HPC installations.

- Can do snapshots and run only one node or several nodes.

- When using Gardar his group applied for 500 k CPU hours/ per half-year, runs a large number of one-node jobs, interruptible, lasting for 1-14 days.

  • Helmut Neukirchen:

- Climatology simulation, using software that are only installed at CSC, has already applied for 200 000 core-hours.

- The project is sponsored by NordForsk.

- Presentation: [1]

- Code needs to run a CSC Sisu (only cluster where proprietary HiDEM discrete element model is available).

- Time frame might be a problem: resources may not be available soon.

- Less than the 200 000 CPU h applied for would also be fine!

- In the context of NordForsk-funded project eSTICC => NordForsk also has to keep funding of CPU hours in mind!

Wrap-up

  • Good meeting
  • Updating Project Plan and sending to the SG
  • Approval via e-mail
  • Annual meeting for Dellingr SG + PG
  • To be added to the project budget

Actions

  • PM to update the project plan deliverables according to the discussions above.
  • Tomasz to write executive summary of the Phase 1 deliverables.