E3DDS OperationalProfiling

From neicext
Jump to navigation Jump to search

Operational Profiling

This is an update of E3DOperationalProfiling EISCAT_3D Operational profiling.

Overview

“Operational profiling” of the EISCAT 3D e-infrastructure There is a hen and egg situation when it comes to describing the various operational, computational and storage needs for the e-infrastructure for the EISCAT 3D project (both for the Operational Centre and for the Data Centres), as neither does the antenna systems exist or the HW and SW for controlling them and analysing the data, nor the systems providing users access to data and the resources for further analysis. To be able to prepare and plan for the necessary computational and storage resources, foresee the usage of them and create RFT documents for sites to host the Operational Centre and the (at least) two Data Centres it is important to establish a best possible picture of the anticipated “usage profiles” for the centres. I.e. make best possible estimates and descriptions of as many as possible of the operational features / aspects for both the Operational Centre and the Data Centres. This document lists up some topics which are considered as important to get a best possible knowledge about / understanding of when it comes to the SLA requirements and the usage of the envisaged computational and storage resources at the Operational Centre and the Data Centres. The list is far from exhaustive and this document should be look at as a brainstorming discussing some of the questions needed to be answered.

Total Cost of Ownership / CAPEX / OPEX List of issues influencing TCO / CAPEX / OPEX for the lifetime of the project.

  • Tender work and planning
  • Periodic evaluation of delivery and performance
  • HW costs: initial plus a number of upgrades
  • SW licenses
  • Maintenance costs
  • Electricity
  • Cooling
  • Server room
  • Network rental / access
  • Manpower: Site computing and network maintenance, Operational Centre, Data Centres and EISCAT control centre.

Many of topics discussed above should be considered from the point of view of service description and service level agreement

  • What service management framework is used (FitSM hopefully), what service management processes will be taken into use, who maintain EISCAT ITSM database and tools ?
  • What is the required service time (office hours or something else) and are there know exceptions to normal service hours
  • What kind of reliability and availability are needed from each service component
  • Can normal office hours be used for maintenance work and can we have regular service breaks ?
  • What kind of durability (=propability not to lose data) is required from storage components used in various service components ?
  • Data provenance requirements, i.e. do we have to be able to prove from which datasets and how data products has been created ?
  • Recovery time objectives (RPO) - if something catastrophical happens, how quickly should the service or service components be operational again ?
  • IT security requirements
  • IT hardware change management. What kind of IT hardware changes the service provider should be prepared to do ?
  • Who will own the IT hardware, and who will be responsible of asset management including service contracts, insurances, etc. ?

Organisation

  • Is it desirable to have a tender for a “complete e-infrastructure solution”, covering the Operational Centre and all Data Centres?
  • Or would bids for separate parts (the Operational Centre and each of the Data Centres) of the solution also be an option?
  • Is it desirable to have the “complete e-infrastructure solution” provided by a consortium or by individual institutions?
  • And what about “cross border” versus “single country” partnership / consortiums?


Site-local buffer

This will be described in D1

  • A 20 PB file system is required
    • Is 20 PB of disk really required?
    • Where should this be located?
    • But how will it be used / what will it be used for?
    • A shared file system for all computational nodes?
    • I/O demands and profile
    • Bandwidth requirements

Comments (Roy): The specifications should not refer to specific technologies like disk and tape. It should rather specify classes of storage profiles and respective access patterns:

  • Things that will affect pricing:
    • Bandwidth:
      • Write bandwidth
      • Read bandwidth
      • Do one need simultaneous sustained read and write performance?
    • Access patterns:
      • Streaming IO. (cheap easy to implement)
      • Random IO. (expensive, hard to do in clustered environments)
      • Maximum wait time to access data.

Experiments command and control

  • Instruments are controlled by EROS software.
  • Master Event List distributed by database.
  • Configurations distributed by database.

General operations

Experiments will be scheduled in a distributed Master Event List (MEL) at a resolution of seconds to minutes. See D1.

Data Centres

For the Data Centres there is specified a need of a 50 TFLOPS compute resource and approximately 10 PB of initial storage (first 3 years). The Data Centre should be (at least) duplicated to ensure data security in case of possible disasters. The number of users, concurrent and total, working with the data products, is unknown. Various issues / questions to be answered, etc.:

  • After asking EISCAT it is clear that 50 TFLOPS is more or less just a “nice number” as they answer:
    • If the demand is larger, users just have to wait in queue.
    • If the demand is less, there will be free resources (for others to use?).

So, is it really necessary to duplicate this computational resource? Probably not. This “uncertainty” is a good argument for incorporating these resources into an existing / larger HPC resource.

  • 10 PB of initial storage must be duplicated, both meta-data and data. But must it be duplicated
    • On equal HW systems?
    • With equal SW systems?
    • With equal and fully operational functionality?
    • With user access both places?
  • The total amount of storage will consist of three parts, of X+Y+Z PB:
    • Fast access data: X PB
    • Archived data: Y PB
    • Scratch area: Z TB/PB

The numbers X, Y and Z plus other details must be detailed further.

Comments (roy):

The compute demand should be split in two. One part for the operations of the infrastructure (if needed) and one part for the scientific analysis of the data. One can expect that the operational part has instantaneous response demand (seconds?, minutes?) and that most of the scientific part will be more ad-hoc and can be operated in batch mode within the normal batch systems of current supercomputers. Also, being able to plan ahead will be important to the data centers as it is possible to reserve compute resources for specific scientific campaigns.

Networking Questions