OOD Description

From neicext
Jump to navigation Jump to search

This document is not for production use yet. It documents the future OoD (Operator on Duty) service for NT1.

Service Description

The Operator on Duty (OoD) is a service offered by the Nordic WLCG Tier 1. The service oversees the daily operation of the distributed Tier 1 according to defined procedures. The OoD team is responsible for monitoring and solving problems within the Tier 1, coordinating between and supporting participating sites, and maintaining the daily contact to CERN, WLCG, EGI, Tier 1 users, NORDUnet and other Nordic NRENs. The team responds to any operational request or alert, ensuring that problems and requests are properly recorded and progress in a timely fashion.

The team coordinates internal as well as external operational tasks, e.g. by ensuring that maintenance windows are scheduled at times convenient for all stake-holders, are announced and communicated, and recorded in monitoring tools, and by tracking operational action points.

The team participates in regular operational virtual meetings within WLCG and EGI. The OoD service adds value by freeing participating sites from having to track and participate in these frequent operational meetings, relaying information to and from participating sites as needed.

Likewise, the team isolates the Tier 1 user communities from the internal distribution of the Tier 1, providing a uniform public face for the Tier 1.

The Operator on Duty service covers the EGI Regional Operator on Duty service in accordance with defined EGI procedures for the NDGF NGI, with similar responsibilities for EGI sites in the the nordic and baltic area.

The team reports on the status of the Tier 1 during weekly NeIC NT 1 staff meetings.

The service is operated by NeIC NT 1 operations staff, manned in shifts during regular office hours, and on call during daytime on weekends and holidays. The role rotates among the team. Any other time the NORDUnet provided NUNOC service offers emergency response for the Nordic WLCG Tier 1.

Daily operation

Incomplete list of daily tasks performed by the OoD.

  • Respond to GGUS tickets. The OoD is the person that should check them for validity and change state from "Assigned" to "In progress".
  • Make sure the site knows it has gotten a ticket or guide it to the person responsible for the service, and if possible help the site solve the issue.
  • Answer any generic question to support@ndgf.org.
  • Daily check the list of open tickets.
  • Keep track of the EGI operations-portal ROD dashboard. It's not that useful anymore, but gives an aggregation of different Nagioses.
  • Keep track of the nagios on chaperon
  • Keep track of the nagios on ngi-sam
  • Attend the WLCG operations meetings (Mo and Thu during LS1, normally daily) at 15:00 CE(S)T. Report any service interruptions to the storage and/or network.

Implementation plan

01/07
Job vacancy notice posted (announcement)
22/08
Deadline for job applications
01/09 - 02/09
Selected candidates are invited to the NT1 all-hands
10/09
Candidates have been selected
30/09
Contracts have been signed
06/10
Service start

The implementation plan depends on suitable applicants having applied to the position.

Candidates are selected on the following criteria:

  • Technical skills
    • in particular generic linux system administration, and
    • to a lesser extend knowledge of WLCG and NorduGrid operations and tools
  • Soft skills
    • Self motivated
    • Comfortable working in virtual environments
    • Good communication skills in English, in particular writing skills
  • Affiliation
    • Since these are half time positions, it is relevant to us what the other 50% is spend on. These two jobs should complement each other rather than compete for time.
    • We want to obtain a good coverage of our participating sites, thus placement at one of these is important (NBI, NSC, HPC2N, PDC, CSC, UiO, or UiB).
    • Must have a work-supplied phone capable of receiving SMSes.