Glenna2/Team-Meeting-2017-11-15

From neicext
Jump to navigation Jump to search


Glenna2 team meeting

Minutes

Meeting Nov-15 2017 at 10:30-11:30 CEST

Present: Ola Spjuth, Matteo Carone, Anders Larsson <anders.larsson@icm.uu.se>; Marco Capuccini <Marco.Capuccini@farmbio.uu.se>; Olli Tourunen ; Risto Laurikainen, Aleksi Kallio, Michaela Barth

Channels: Google Hangouts: https://plus.google.com/hangouts/_/g4g5nyl5glc66wqgbqk4yjvb4ua

1. Review of meeting and News

Focus call on Kubernetes Spjuth Use Case

2. Action points - progress & issues

  • Arrange focus call on OpenShift (AP on Dan) with Staffan and Jonathan


3. Todays topics - issues to discuss:

  • Ola gave a presentation of KubeNow and Phenomenal

Why cloud in the life sciences?

  • Access to resources – Flexible configurations – On-demand – Cost-efficient?
  • Collaborate on international level – Publish/federate data – E.g. Large sequencing initiatives, “move compute to the data”
  • New types of analysis environments

– Hadoop/Spark/Flink etc. – Microservices, Docker, Kubernetes, Mesos 14

MicroServices

  • Decompose functionality into smaller, loosely coupled, on-demand services communicating via an API – “Do one thing and do it well”
    • Services are easy to replace, language-agnostic
    • Minimize risk, maximize agility
    • Suitable for loosely coupled teams
    • Portable - easy to scale
    • Multiple services can be chained into larger tasks Software containers (e.g. Docker) are ideal for microservices!

PhenoMeNal

  • Horizon 2020 project, 2015-2018
    • Virtual Research Environments (VRE), Microservices, Workflows
    • Towards interoperable and scalable Metabolomics data analysis
    • Private environments for sensitive data http://phenomenal-h2020.eu/ DockerHub Virtual Infrastructure GitHub


M. Capuccini:

  • EasyMapReduce has been developed for scientific applications.
  • High-throughput methods produced massive datasets using frameworks like Spark and Hadoop for high-throughput analysis.
  • sometimes the effort of reimplementing scientific tools in Spark or Hadoop can't be sustained by research groups.

EasyMapReduce aims to provide the means to run existing serial tools in MapReduce fashion. Many of the available scientific tools are trivially parallelizable, MapReduce can be used to parallelize the computation.

  • it's difficult for the system administrator to maintain software, which may be installed on each node of the cluster, in multiple version.
  • Instead of running commands straight on the compute nodes, EasyMapReduce starts a user-provided Docker image that wraps a specific tool and all of its dependencies, and it runs the command inside the Docker container.
  • The data goes from Spark through the Docker container, and back to Spark after being processed, via Unix files.

If the TMPDIR environment variable in the worker nodes points to a tmpfs very little overhead will occur.