NeIC Conference 2013: Report on "Workshop: Data Services"

From neicext
Jump to navigation Jump to search


"Developing Global Data Infrastructures: Trends and Requirements" (Peter Wittenburg)

Peter Wittenburg, Scientific Coordinator EUDAT - Max Planck Institute for Pshycholinguistics, Nijmegen Netherlands

Scientific Data - the data fabric

raw data massive complex -> Preprocessing -> Persitent ID -> permanent data store

Old way: trying testing scripting

Future: need a data fabric. Move data forward. Creating new data for all steps. Replicating into new IDs. Integrating data in a lifecycle of data.

Identifiers for all states of the data.

Challenges:

More raw data. Relationships more complex. Processing generates new data and new relationships.

Preservation/archiving and curation?

What are the access problems

Enabling technologies

  • Discovery
  • Access ( reference resolution, protocols )
  • Interpretation (requires look into objects, use content and contextual information)
  • Reuse ( -"- )

Management problems:

  • Collections + Properties
  • Access
  • formalized policies - workflow engine
  • Assesment - Quality of processed data

"Do we have a repo system that allows to support PID, metadata for all object (and collections) and the execution of policy rules so that quality assessment can be done"

Role of a DI

Implement the required building blocks of a "data fabric". Need to support data discovery

Who is working on it?

ESFRI initiatives, EUDAT - common data services, OpenAIRE, Data conservancy, DataONE, SEAD, EGI

No persistence - time limited initiatives EUDAT refers to national data centers.

Who is standardizing

W3C, IETF, CODATA, WDS, DONA, RDA

Summarize

Initiatives such as EUDAT important to implement and try new ways and this shift boundaries in harmonizing data organization

Moving towards a data infrastructure is an evolutionary process

PID:s important for all stages of data management. Processed data will create new PIDs. Replicating into new IDs. The data infrastructures handle the entire data life-cycle.

different layers - different players - different roles Will EUDAT survive? A global federation will come.

"EUDAT: Towards a European Collaborative Data Infrastructure" (Damien Lecarpentier)

Data trends

  • Where to store
  • How to find it
  • How to make the most of it

Create a picture of europe that is more "interoperable"

EUDAT must identify Common Data Services

5 communities on board EPOS, CLARIN, ENES, LifeWatch, VPH Common challenges:

  • Reference models and architectures
  • Persistent data identifiers
  • Metad data management
  • distributed data sources
  • data interoperability

Identify needs from communities. Communities actively involved.

Services in EUDAT

Safe replication service

  • Guard agains data loss in long-term archiving and preservation
  • Optimise access for users from different regions
  • bring data close to powerful computers for compute-intensive analysis

Data Staging Service

  • Support users transferring data from EUDAT to HPC facilities
  • Reliable, efficient and easy-to-use tools to manage data transfers

Simple Store Service

  • user upload "long-tail" data

Meta data servce

  • Easily find collection of sci data

AAI - network of trust among authentication and authorization actors.

Need to reach out to additional communities

New services

Rea-time data handling Semantic annotation Crowd sourcing Web services Memento

Working groups

Dynamic data Semantic Annotation Scientific Workflows Data Access and Re-Use policies

Sustaining a CDI requires

CDI is about providing solutions together in a federated environment

Resarch communities

Other service providers (TERENA, PRACE, EGI …) EUDAT can't and shouldn't address everything alone.

National infrastructures - (more than 90% of research funding comes from Member states, researchers are increasingly organized at a pan-European level.

"BBMRI requirements and use of the e-Infrastructure" (Roxana Martinez)

Today in Europe: disconnected national biobanks, no central management, hard to search Future: networked biobanks, trans-national collaboration

BBMRI: Creation of a distributed research infrastructure 315 biobanks registered , more than 20 mill. samples

BBMRI Nordic -- collaborative network between the nordic countries Data volume increasing faster than Moore’s law -> need a lot of storage capacity. Needs long-term storage with personal data privacy protection. Unsolved issues around how to share sensitive data.

Wants a close collaboration with EUDAT projects.

"EISCAT requirements and use of the e-Infrastructure" (Mats Nylen)

EISCAT Large radar installations in the Arctic, ionospheric research

EISCAT_3D will produce large amounts of data, fully digitalized antennas. Startup 2018.

Single antenna will produce 5 TB/day

16000 antennas will produce 80 PB/day

Data reduction needed already on the antennas. still needs efficient archive and datawarehouse will require 10 gb/s network from all antenna sites

Still work to be done on data formats


TTA, Finnish data services

Services

Data service -- IDA Preserving digital research and increase re-use Integrity

Catalog searches

Long term storage -- PAS

NorStore

Current funding ends 31.12.2013, long term funding necessary.

Establish both storage volume and more high-level data services based on accepted standards.

New storage system acquired in 2013.

IRODS used for dataset management.

Access inteface under development.

Research data initiatives in Sweden

SweStore -- Collaborative national infrastructure.

Focus on the needs from research communities rather than the needs of the data centers.

SweStore provides the backend for the more research specific user interfaces.

Funding is also coming from the research communities.

BILS -- swedish node in ELIXIR. Collaboration between 6 universities.

Considerable potential for better Nordic collaboration on eInfrastructure for research data.

"Panel discussion" - Part 1 (Erwin Laure)

DISCLAIMER: I hope I interpreted all questions and answer correcty (Jonas L)

What is the line between common data services and community provided services?

EISCAT_3D: Project will define the products suitable for all scientific fields. EISCAT is an instrument.

BBMRI: Very dynamic field of research. Methods and tools changes every year. Fast moving.

Can you converge to something?

BBMRI: Standardised formats and metadata. Transfer of data to compute resources is a big challenge.

Q:BBMRI: Are the data proprietary? A: Many different formats. Raw-data depends on equipment. Processed data more opened. Open data formats.

PW: Put communities together -> Build common services. Services results of discussion between communities. There are exceptions for the extreme cases. Will fail for some cases:

EL: For large problems. Have a problem will be solved. How to leverage these solutions. How to create solutions that can be reused.

PW: How to handle all derived data. Can tend to be very complex.

PÖ: How does this efforts in research data relate to national archives and regulations? Focus is often on research sharing.

PW: Problem is solved when you have created a "working" federation. The biggest problem is national regulations. There is convergence on the federations.

MN: Researcher will not wait. National archives / federations have different timescales.

EUDAT: Is an attempt to break these barriers.

Datatype registry is important: Enabling sharing of data.

Where do you find solutions? Which level of solution?

GH: Where is the funding coming from RM: Logically european. There should be national contact

EUDAT: Smaller project possible in cooperation with national centers.

EUDAT: Develop solutions that cover some communities

EL: Where are the librarians? Are we duplicating efforts? What do you mean by libraries.

Important integrating expertise is important: From IT, researchers and library experts.

NeIC: Go beyond eInfrastructure to eCollaborate projects.

Discussion Points

Session Summary

Lessons Learned

Future Directions

Data services - part 2

TTA – National Research Data Project in Finland (Jari Suhonen, CSC)

"The fourth paradigm"

"Dataintensive Science"

Modeling of information

Open Science - Open Access Data / Publishing / communicate

research data map information infrastructure plan metadat

"Clear data policy supported by common e-services" "Data resources generated by public funding should be easy available"

Organised in workgroups / Implementation projects

SErvices:

IDA - data storage (in production)

  • joint storage server
  • safe preservation of data and metadata
  • data preservation in intact and unchanged format by means of managing copies and their integrity

KATA - data catalog (2013)

  • Find available data
  • produces information about existence of data for funders
  • enables the joint terms of usage and rights

PAS - Long term preservation (2015)

  • Timespan 10-100 years
  • Handle file format changes in understandable form
  • Collaboration with National Digital Library

www.csc.fi/tta www.csc.fi/ida

NorStore – Managing Digital Research Data in Norway (Adil Hasan, UNINETT Sigma AS)

"..develop and operate a persistent, nationally coordinated infrastructure that provides non-trivial services…"

Long-term funding is seen as a requirement. Important to secure commitment from user communities.

Part of the national roadmap for infrastructures

Data life cycle

Project area / Data archive area

Complete data life cycle

Motiviation

Data management a concern for disciplines Usually a lack of expertise for long-term storage value in long-term usage of dataset cross-discipline research (open access) NorStore aims to fill a need

Approach

  • Long-term needs to be resilient to change
    • Hardware can change
    • Software can change
    • Terminology can change
  • Adopt standards
    • Metadata standards
  • Allow interchange with existing repos
    • OAI-PMH
    • PID
  • KIS KIM

Layout

  • DB/iRODS
  • Ingest / Web / Command line
  • Access / Web / Command line

100 TB staging 5 TB tape interface 2 PB archive Web UI 1 TB

Process

Submission - Request archival - Publish data set

Defines a data manager and rights holder. Important for long-term storage of data.

Dataset management

  • free from corruption
  • free from errors
  • right to access

Status

  • Metadata schema defined - database and XML
  • Web-based ingest interface alpha version
    • Manages the entire process
    • Populates metadata
  • Archive storage complete
    • Replication and checksum and repair rules

To be done

  • Update metadata scheme
  • Finalise web-based ingest interface
  • Develop access interface
  • First version by October

Research Data Initiatives in Sweden (Jacko Koster)

Much fragmentation in data initiatives

SweStore

Provida a collaborative national infrastructure for Storage for swedish research data

Deployment and services based on demonstrated and documented needs from users, projects and communities

Deployment of services based on agreements with users, projects and communities or funding agency

SNIC Storage

  • Center storage
    • Storage accessible from all local resources
  • Tape storage
    • 3x HPC2N, PDC, NSC
    • Backup for all sites
    • Tape backend for storage
  • SweStore
    • 6x: National accessible storage
  • Discipline-specific solutions

Service areas

Project specific services Community services Services needed by some Services needed by all

Data infrastructure

Big groups BILS, SND ECDS. User communities responsible for PID and metadata. SNIC does not provide the entire stack. Communities develops their own metadata. Define with communities service division between communities / SNIC

SNIC: Long-term, authentication, safe replication.

BILS

  • Distributed national research infrastructure
    • Nodes at each university
  • Swedish node in ELIXIR

Complicated distributed setup within Sweden

Some issues

Data infrastructures have many stakeholders. A distributed national infrastructure comes with various added overheads. International efforts add to that

Delivery of services must be based on concrete needs. Based on service descriptions.

Deployment of services rather than development.

Potential for better collaboration on eInfrastructures

What is required for this to happen?

Collaborate on a nordic level / European level?

Long-term preservation of research data requires permanent infrastructure. Appropriate funding and governance needed.

Insufficient rewarding schemes to encourage documentation of data.

SNIC/SweStore need to define datapolicys and service descriptions. Looking at work done or in progress in other countries is important.

Summary

"Panel discussion" - Part 2 (Erwin Laure)

DISCLAIMER: I hope I interpreted all questions and answer correcty (Jonas L)

Q: What is the policy on long-term archive?

A (FI):Not ready yet. Will be offered to research communities.

A (NO):Depends on the research council. No policy set yet. Researchers have no real interest after they have used their data. No rewards for keeping data. Create incitements for storing data

A (S): A data plan was developed. However, Has been withdrawn on criticism from research groups. Status is unclear. SNIC needs to look on other countries. Needs to implement more policies for data management and storage

A (PÖ): Long-term quota is decided by universities. Guaranteed until 2017.

(PW): Max-Planck. No funding for archives. University policy. President guarantees 50 years availability. Split between data/archiving not possible. All data part of the research cycle. Why they separation between archive and data?

Q: Need to collaborate more. Learn from existing projects.

GH: Governance structure of SweStore, NorStore. (S) Work in progress. (F) controlled centrally as well as from funding agencies.

R: Policies regarding sensitive data. A(NO): Depends on the data producers. Data producers have the responsible for the data. A(FI): Tools needed for access controls.

PW: No interfaces for solving all problems. Data collaboration using iRODS. Going from abstract data ice lands needs some scalability development.

EL: Too many communities as well as different technologies.

JK: Many European collaborate in principle, but does not actually collaborate.

GH: NeIC is a facilitator. Collaborate and connect users and communities. Develop leading edge competence at the different centers. No ambition to coordinate collaboration on a nordic/european level.

PW: Demands must come from the communities. We must not create more overhead structures. From EUDAT discussions. Need to move data to the computational resources. Important to interact between EU-projects. Not create new super-projects.

GH: Dupliation is not good. There is overhead in coordination as well.

PW: Why the low activity by nordic countries in the RDA working groups? Important to have the nordic voices in the RAD working groups.

Discussion Points

Session Summary

  • BBMRI Nordic has data volumes increasing faster than Moore’s law, and thus needs a lot of storage long-term storage with personal data privacy protection.

Lessons Learned

  • Sharing sensitive data still has many unresolved issues, especially across borders.
  • Data reduction needed already on the antennas, as 80Pb/day is a major obstacle. A lot of work needs to be done on formats.
  • Proprietary formats are an obstacle to knowledge extraction, indexability and sharing.

Future Directions

  • BBMRI wants a close collaboration with EUDAT projects, especially for indexing and standardization of non-sensitive metadata.
  • We must go beyond e-Infrastructure to e-Collaborate projects.