NeIC Conference 2013: Report on "WLCG - quo vadis?"

From neicext
Jump to navigation Jump to search

"Plans for the Large Hadron Collider" (Ian Fisk)

Ian Fisk is the coordinator of CMS. LHC is down now for an (major) upgrade, will turn back in 2015. Energy levels be much closer to 14TeV.

Using 1kHz collection and reconstruction rate.

About 6 times higher (what???)

‣ In 2015 about 40 ‣ non-linear increase of something ‣ Computing models are based on MONARC ‣ 600 Mb/s links ‣ Assumes poor networking ‣ Hierarchical structure

CMS has going for full mesh, IE Tier2 will talk to any Tier2.

‣ Finally looking into cloud ‣ Flattening of the tier model ‣ Was not possible to use a small disk cache in front of a big large tape storage ‣ Going to archive centre and access to large disk farms ‣ Disk centre (farms) can be located anywhere ‣ Reduction of number of tape archives ‣ More than 1 less than 10 ‣ Difference between T1 and T2 will be less ‣ Model less Monarc-like

‣ More reconstruct at T1 -> T1 and T0 will be mostly the same ‣ All experiments will have Wide Are Access (to data) ‣ Seems to talk about using xrootd mostly but maybe more protocols can be used ‣ Treating the T1 storage as a "cloud" ‣ Move worker nodes into the OPN ‣ Falling back to xrootd instead of tape archive ‣ What happens with latency speed? ‣ The big change that has allowed this is change of network ‣ 50kB/s per core is sufficient ‣ Intelligent IO and high capacity networks the game changer ‣ Satellite facility at Budapest (100Gb/s network) ‣ 35 ms ping time between centers ‣ Breakdown of site boundaries ‣ Can not tell the difference between budapest and cern ‣ 100Gb links will be in producition ‣ Resource provisioning ‣ Looking at "clouds" as provisioning tool ‣ Using cloud instances instead of pilot jobs ‣ CMS and ATLAS is trying out it with Open Stack ‣ 3500 cores test for CMS ‣ Already exceeded the network (what network?) ‣ Resources might b ‣ e buyed from a provider ‣ breakting the boundaires between sites ‣ less separation of the functionality ‣ the system will be more capable of being treated like a single large facility, rather than a cluster of nodes


Jacko Koster: How would WLCG be redesigned based on todays technology?

Would be the same amount of sites (T0/T1?), but less tape archives. It would be more inclusive. • about the same amount of sites • but less tape archives • more inclusive • amazon is to expensive for now (leasing cars are more expensive than buying it)

"NDGF - lessons learned" (Michael Grønager)

Wild ideas, wanted to start a distributed T1 site ‣ ... using ARC not gLite ‣ 2007Q3 we had something up and running ‣ We could present a distributed T1 ‣ with operations ‣ dCache for storage, and ARC for computing ‣ ALICE, ATLAS, .... ‣ 1st line support, 2nd line support, 3rd line support ‣ two parallel grid, EGEE grid and ARC grid ‣ 2008 kickoff meeting in Stockholm (oct) ‣ Merging the EGEE grid and the ARC grid ‣ Done on Jan 2009 ‣ Setup of Terena Certificate service ‣ Spedup the certificate creation ‣ Pan european certificate authority (TERENCA CS) ‣ operation 2009/10 ‣ e-infrastructure is build to support research ‣ e-infrastructure is NOT research ‣ hence goal for the e-Infra is crucial ‣ Other e-sciences (than HEP) ‣ getting more science into NDGF • Biogrid ‣ Still problems with high energy physic using the most ‣ We need the other sciences to really ask for something ‣ Not the same feeling of success for the other sciences ‣ Should have been a goal for NDGF/EGEE ‣ hard to help when no help asked for ‣ Measuring success of e-infrastructure ‣ The usual mission statements does not work ‣ ask acknowledgement in scientific papers and systematic indexes of these ‣ Should e-infrastructure be embeeded into the actual research project? ‣ the real goals ‣ insight into the world around us (research) ‣ help the researchers doing so (e-Infra) ‣ educate excellent empouless and entrepeaneuds ‣ optimize the use of funds tou got for doing so! ‣ questions; • Experience. being a director. how many levels coordinate of do we need (local, national, nordic, european) ∘ What is coordinator? If you can not change anything are you needed? • What was your struggles? ∘ Give us something to do! A solution in search of a problem

"ATLAS Computing: status and plans" (Oxana Smirnova)

I'm the CERN coordinator ‣ Liason between NDGF and CERN ‣ ATLAS is NeIC (aff) resources ‣ ATLAS > CMS >> ALICE (mostly) ‣ Load even when LHC is shutdown ‣ crossing rates 40 MHz ‣ upgrade will ‣ collision rate 10^7- 10^9 Hz ‣ Sift out 1 event of 10 000 000 000 000 (hopefully the right number zeroes) ‣ The ATLAS detector is not one detector, more like a factory ‣ Tier0 - (OSG, EGI, NDGF) (why is NDGF seperated?) ‣ CSG? should it be CSC? ‣ ATLAS wants to use as many resources as it can ‣ Compare expirement with simulated model to see if we got something new (and/or interesting) ‣ CAF? ‣ Many different types of data: RAW -> ESD (dESD) -> AOD (dAOC -> NTUP) -> TAG ‣ resource utilization 2012 ‣ T1 is about 50% more than expected ‣ T2 is about 100% more than expected ‣ Disk usage for T1 was about 50% more ‣ Tape mostly with in reason ‣ Simulation is the dominant cpu usage ‣ We were not neglibale contributors ‣ T0 was above the tape pledge ‣ running well in 2012, utilizing our resources in full ‣ 2013 - might be partial reprocessing of 2010-2012 ‣ monte carlo simulations ‣ 2014 - MC ‣ full dress rehearsal for "Run 2" ‣ 2015 - new data comes in ‣ resources consumption until 2015 ‣ > 2x cpu T0 ‣ The T2 will have to become more (CPU, disk, tape?) ‣ Accommodate for peaks ‣ Using idle cycles at HPC centers ‣ Cloud resources (for short term crunching) ‣ CHALLENGES ‣ We need to optimize the usage. Code and also actual usage ‣ Push towards Fast simulation (what is Fast simulation really?) ‣ Better usage of vectorization, GPUs, and other morden CPU technology ‣ Deal with increased event size ‣ reduce number of copies ( ‣ New distributed data management system (rucio) being developed ‣ WAN data acess and data caching (file level, event level) • xrootd federation (?) ‣ NEW MC production system (JED+DEFT+) ‣ applicaiotn dirven usage of networks ‣ 2018 next upgrade (ultimate luminosity) ‣ L = 2x10^34 cm^-2s^-1 ‣ 2022 ??? High-luminosity LHC ‣ L = 5x10^3 ‣ Summary ‣ great success for LHC run 1 ‣ growing investment in software development (during shutdown) ‣ advanced scientific instruments need help of large international communities ‣ the instruments produces data that can not be accomaqted at site ‣ -> world wide scientific laboratory ‣ questions: • how could LHC use more resources than it was pledged? ∘ the grid infrastructure really helped, increased flexibility (cpu) ∘ but even some extra funding helped • will LHC ever throw away data? will the amount of data be to excessive? ∘ NO! :-) will archive ∘ come back later to look at it

"EGI: Going beyond support for WLCG" (Steven Newhouse)

Plans and support for the future ‣ What can we learn from the past for the future ‣ e-Infrastructure a 1st class citizen in the research community ‣ 1990 WEB -> 1998 GRID -> 2010 EGI ‣ When e-infrastructure becomes usable people depend on it ‣ 2012 we found the Higgs particle ‣ Whats next ‣ EGI vision for 2020 ‣ 3 areas • community & coordination • operational infrastructure • virtual research environements ‣ we offer a service ‣ the research community needs customization ‣ the e-intrastructure needs to be adjusted for each research ‣ EGI has 26 people staff (that many?) ‣ research communities ‣ High Energy Physics ‣ Health-care ‣ Life Science ‣ extracting knowledge from the data deluge ‣ requires a digital infrastructure to share services and tools personalised to individual research communities ‣ WLCG is a federation of federation • EGI in Europe • OSG in North America ‣ services driven by two WLC needs • scalable service management & operation • supporting the HEP computing models ‣ needs to be sustainable for decades • going beyound the project cycle ‣ sustainable change ‣ who needs which services sustained? • who uses the current services? • how long will this service be needed? • How can/should the continued operation be funded? ‣ who has new requirement for new services? • what should the new/changed service do? • Who will use tjos service? • How can the innovation be developed? • How wil /...? ‣ Centrally funded resources • traditional model allocated by peer review ‣ Pay for private/public resources used • UK: full economic costing for researchers • EC: Give funds (& control?) to researchers ∘ is this new or old policy? ‣ Dealing with service dependencies • both direct (visible) and indirect (invisile) ‣ EGI-inspire (INtegrated Sustain ...) 4 year project 25M euro • project cost 7M ...... ‣ service portfolio (from EGI) ‣ open compute and storage resources • EMI (, IGE, ...) -> community based SW • openstack, opennebula, cloudstack ‣ EGI & NGI services coordination • technical • communtiity • human: outreach, consulta ‣ Driven by WLCG (to much?) ‣ Resource Centr Services • Why not use commercial? Too expensive • But can handle burst capacity • And other specialized needs ‣ Different NGI structures around europe • different funding • different models ‣ On the european level • Generic vs. Domain-specific • we need different ‣ What are the critical services from EGI? ‣ What do we need to do? ‣ retain what we have build, don't through out the baby with the bathwater ‣ identify key services and budget • balance operational & non-operationl • budget: 1.6M cache fees & 0.8 M euro in kind ‣ what next? ‣ building our human networks and capital ‣ virtual teams ‣ operation infrastructure ‣ exporting our expertise ‣ establisjing a european wide fed. cloud • 1M cores, 1EB storage & national support centres ‣ virtual research env. • easy to use ‣ EGI for H2020 • EGI collabation platform • EGI cloud infrastructure • EGI core infrastructure • Community platform (VRE) ‣ reflections • one size does not fit all (researchers) ∘ need domain neutral manag. & structure • openness and innovation are essential ∘ flexible use of EGI through fed. clouds • sustainabiliuty: must be a core thinking ∘ prioritise and identify 'paying' consumers ∘ seperate out recurring vs. investment funding ‣ GO TO TF2013 in Madrid ‣ Questions: • JK: Where will EGI be in 2020? • Will be still be here. • talks t • JK:What will the value of a Nordic collaboration for EGI? • An effective regional is as good as a national structure • Who speaks for the nordics? • GH: How can we broaden the uptake of the services? • The champion project might be the solution. Regional centres of excellence might also be a solution.

Panel Discussion

Session Summary

Lessons Learned

  • Boundaries between WLCG sites are breaking up, with less separation of functionality. This will enable more flexible and efficient operations.
  • If you need compute for a short amount of time, rent it from A Cloud Provider. If you need it every day, you should buy your own compute.
  • The only possible metric for measuring the success of an e-Infrastructure is to count the mentions in acknowledgements in research papers.
  • e-Infrastructure efforts for other sciences failed due to that the ambition was set too low, and noone was requesting the features at the time.
  • The real goals of an e-Infrastructure is insight into the world around us (research) and helping researchers achieve the same (e-Infrastructure). It is also to educate excellent employees and entrepreneurs, and to optimise the use of the funds you got for doing so.
  • ATLAS is a world wide laboratory. This is necessary in order to build instruments and tackle projects of this size and ambition.
  • Projects have to consolidate project life times (1-3yr) against research plan timeframe (20-30yr), and plan for sustainable change:
    • Who uses the current service?
    • How long will this service be needed?
    • How can operation be funded?
    • Who has new services?
    • How can this innovation be implemented?
    • How can it be brought into production?

Future Directions

  • There is a need for human expertise:
    • Code optimisation, to make ATLAS code more efficient, for example using code optimisation, vectorization, better algorithms, GPUs, etcetera.
    • Deal with increased event size, reduce content, reduce data redundancy (which is a scary prospect).
    • Reduce memory footprint.