NeIC Conference 2013: Report on "Workshop: Data Services"
"Developing Global Data Infrastructures: Trends and Requirements" (Peter Wittenburg)
Peter Wittenburg, Scientific Coordinator EUDAT - Max Planck Institute for Pshycholinguistics, Nijmegen Netherlands
Scientific Data - the data fabric
raw data massive complex -> Preprocessing -> Persitent ID -> permanent data store
Old way: trying testing scripting
Future: need a data fabric. Move data forward. Creating new data for all steps. Replicating into new IDs. Integrating data in a lifecycle of data.
Identifiers for all states of the data.
Challenges:
More raw data. Relationships more complex. Processing generates new data and new relationships.
Preservation/archiving and curation?
What are the access problems
Enabling technologies
- Discovery
- Access ( reference resolution, protocols )
- Interpretation (requires look into objects, use content and contextual information)
- Reuse ( -"- )
Management problems:
- Collections + Properties
- Access
- formalized policies - workflow engine
- Assesment - Quality of processed data
"Do we have a repo system that allows to support PID, metadata for all object (and collections) and the execution of policy rules so that quality assessment can be done"
Role of a DI
Implement the required building blocks of a "data fabric". Need to support data discovery
Who is working on it?
ESFRI initiatives, EUDAT - common data services, OpenAIRE, Data conservancy, DataONE, SEAD, EGI
No persistence - time limited initiatives EUDAT refers to national data centers.
Who is standardizing
W3C, IETF, CODATA, WDS, DONA, RDA
Summarize
Initiatives such as EUDAT important to implement and try new ways and this shift boundaries in harmonizing data organization
Moving towards a data infrastructure is an evolutionary process
PID:s important for all stages of data management. Processed data will create new PIDs. Replicating into new IDs. The data infrastructures handle the entire data life-cycle.
different layers - different players - different roles Will EUDAT survive? A global federation will come.
"EUDAT: Towards a European Collaborative Data Infrastructure" (Damien Lecarpentier)
Data trends
- Where to store
- How to find it
- How to make the most of it
Create a picture of europe that is more "interoperable"
EUDAT must identify Common Data Services
5 communities on board EPOS, CLARIN, ENES, LifeWatch, VPH Common challenges:
- Reference models and architectures
- Persistent data identifiers
- Metad data management
- distributed data sources
- data interoperability
Identify needs from communities. Communities actively involved.
Services in EUDAT
Safe replication service
- Guard agains data loss in long-term archiving and preservation
- Optimise access for users from different regions
- bring data close to powerful computers for compute-intensive analysis
Data Staging Service
- Support users transferring data from EUDAT to HPC facilities
- Reliable, efficient and easy-to-use tools to manage data transfers
Simple Store Service
- user upload "long-tail" data
Meta data servce
- Easily find collection of sci data
AAI - network of trust among authentication and authorization actors.
Need to reach out to additional communities
New services
Rea-time data handling Semantic annotation Crowd sourcing Web services Memento
Working groups
Dynamic data Semantic Annotation Scientific Workflows Data Access and Re-Use policies
Sustaining a CDI requires
CDI is about providing solutions together in a federated environment
Resarch communities
Other service providers (TERENA, PRACE, EGI …) EUDAT can't and shouldn't address everything alone.
National infrastructures - (more than 90% of research funding comes from Member states, researchers are increasingly organized at a pan-European level.
"BBMRI requirements and use of the e-Infrastructure" (Roxana Martinez)
Today in Europe: disconnected national biobanks, no central management, hard to search Future: networked biobanks, trans-national collaboration
BBMRI: Creation of a distributed research infrastructure 315 biobanks registered , more than 20 mill. samples
BBMRI Nordic -- collaborative network between the nordic countries Data volume increasing faster than Moore’s law -> need a lot of storage capacity. Needs long-term storage with personal data privacy protection. Unsolved issues around how to share sensitive data.
Wants a close collaboration with EUDAT projects.
"EISCAT requirements and use of the e-Infrastructure" (Mats Nylen)
EISCAT Large radar installations in the Arctic, ionospheric research
EISCAT_3D will produce large amounts of data, fully digitalized antennas. Startup 2018.
Single antenna will produce 5 TB/day
16000 antennas will produce 80 PB/day
Data reduction needed already on the antennas. still needs efficient archive and datawarehouse will require 10 gb/s network from all antenna sites
Still work to be done on data formats
TTA, Finnish data services
Services
Data service -- IDA Preserving digital research and increase re-use Integrity
Catalog searches
Long term storage -- PAS
NorStore
Current funding ends 31.12.2013, long term funding necessary.
Establish both storage volume and more high-level data services based on accepted standards.
New storage system acquired in 2013.
IRODS used for dataset management.
Access inteface under development.
Research data initiatives in Sweden
SweStore -- Collaborative national infrastructure.
Focus on the needs from research communities rather than the needs of the data centers.
SweStore provides the backend for the more research specific user interfaces.
Funding is also coming from the research communities.
BILS -- swedish node in ELIXIR. Collaboration between 6 universities.
Considerable potential for better Nordic collaboration on eInfrastructure for research data.
"Panel discussion" - Part 1 (Erwin Laure)
DISCLAIMER: I hope I interpreted all questions and answer correcty (Jonas L)
What is the line between common data services and community provided services?
EISCAT_3D: Project will define the products suitable for all scientific fields. EISCAT is an instrument.
BBMRI: Very dynamic field of research. Methods and tools changes every year. Fast moving.
Can you converge to something?
BBMRI: Standardised formats and metadata. Transfer of data to compute resources is a big challenge.
Q:BBMRI: Are the data proprietary? A: Many different formats. Raw-data depends on equipment. Processed data more opened. Open data formats.
PW: Put communities together -> Build common services. Services results of discussion between communities. There are exceptions for the extreme cases. Will fail for some cases:
EL: For large problems. Have a problem will be solved. How to leverage these solutions. How to create solutions that can be reused.
PW: How to handle all derived data. Can tend to be very complex.
PÖ: How does this efforts in research data relate to national archives and regulations? Focus is often on research sharing.
PW: Problem is solved when you have created a "working" federation. The biggest problem is national regulations. There is convergence on the federations.
MN: Researcher will not wait. National archives / federations have different timescales.
EUDAT: Is an attempt to break these barriers.
Datatype registry is important: Enabling sharing of data.
Where do you find solutions? Which level of solution?
GH: Where is the funding coming from RM: Logically european. There should be national contact
EUDAT: Smaller project possible in cooperation with national centers.
EUDAT: Develop solutions that cover some communities
EL: Where are the librarians? Are we duplicating efforts? What do you mean by libraries.
Important integrating expertise is important: From IT, researchers and library experts.
NeIC: Go beyond eInfrastructure to eCollaborate projects.
Discussion Points
Session Summary
Lessons Learned
Future Directions
Data services - part 2
TTA – National Research Data Project in Finland (Jari Suhonen, CSC)
"The fourth paradigm"
"Dataintensive Science"
Modeling of information
Open Science - Open Access Data / Publishing / communicate
research data map information infrastructure plan metadat
"Clear data policy supported by common e-services" "Data resources generated by public funding should be easy available"
Organised in workgroups / Implementation projects
SErvices:
IDA - data storage (in production)
- joint storage server
- safe preservation of data and metadata
- data preservation in intact and unchanged format by means of managing copies and their integrity
KATA - data catalog (2013)
- Find available data
- produces information about existence of data for funders
- enables the joint terms of usage and rights
PAS - Long term preservation (2015)
- Timespan 10-100 years
- Handle file format changes in understandable form
- Collaboration with National Digital Library
www.csc.fi/tta www.csc.fi/ida
NorStore – Managing Digital Research Data in Norway (Adil Hasan, UNINETT Sigma AS)
"..develop and operate a persistent, nationally coordinated infrastructure that provides non-trivial services…"
Long-term funding is seen as a requirement. Important to secure commitment from user communities.
Part of the national roadmap for infrastructures
Data life cycle
Project area / Data archive area
Complete data life cycle
Motiviation
Data management a concern for disciplines Usually a lack of expertise for long-term storage value in long-term usage of dataset cross-discipline research (open access) NorStore aims to fill a need
Approach
- Long-term needs to be resilient to change
- Hardware can change
- Software can change
- Terminology can change
- Adopt standards
- Metadata standards
- Allow interchange with existing repos
- OAI-PMH
- PID
- KIS KIM
Layout
- DB/iRODS
- Ingest / Web / Command line
- Access / Web / Command line
100 TB staging 5 TB tape interface 2 PB archive Web UI 1 TB
Process
Submission - Request archival - Publish data set
Defines a data manager and rights holder. Important for long-term storage of data.
Dataset management
- free from corruption
- free from errors
- right to access
Status
- Metadata schema defined - database and XML
- Web-based ingest interface alpha version
- Manages the entire process
- Populates metadata
- Archive storage complete
- Replication and checksum and repair rules
To be done
- Update metadata scheme
- Finalise web-based ingest interface
- Develop access interface
- First version by October
Research Data Initiatives in Sweden (Jacko Koster)
Much fragmentation in data initiatives
SweStore
Provida a collaborative national infrastructure for Storage for swedish research data
Deployment and services based on demonstrated and documented needs from users, projects and communities
Deployment of services based on agreements with users, projects and communities or funding agency
SNIC Storage
- Center storage
- Storage accessible from all local resources
- Tape storage
- 3x HPC2N, PDC, NSC
- Backup for all sites
- Tape backend for storage
- SweStore
- 6x: National accessible storage
- Discipline-specific solutions
Service areas
Project specific services Community services Services needed by some Services needed by all
Data infrastructure
Big groups BILS, SND ECDS. User communities responsible for PID and metadata. SNIC does not provide the entire stack. Communities develops their own metadata. Define with communities service division between communities / SNIC
SNIC: Long-term, authentication, safe replication.
BILS
- Distributed national research infrastructure
- Nodes at each university
- Swedish node in ELIXIR
Complicated distributed setup within Sweden
Some issues
Data infrastructures have many stakeholders. A distributed national infrastructure comes with various added overheads. International efforts add to that
Delivery of services must be based on concrete needs. Based on service descriptions.
Deployment of services rather than development.
Potential for better collaboration on eInfrastructures
What is required for this to happen?
Collaborate on a nordic level / European level?
Long-term preservation of research data requires permanent infrastructure. Appropriate funding and governance needed.
Insufficient rewarding schemes to encourage documentation of data.
SNIC/SweStore need to define datapolicys and service descriptions. Looking at work done or in progress in other countries is important.
Summary
"Panel discussion" - Part 2 (Erwin Laure)
DISCLAIMER: I hope I interpreted all questions and answer correcty (Jonas L)
Q: What is the policy on long-term archive?
A (FI):Not ready yet. Will be offered to research communities.
A (NO):Depends on the research council. No policy set yet. Researchers have no real interest after they have used their data. No rewards for keeping data. Create incitements for storing data
A (S): A data plan was developed. However, Has been withdrawn on criticism from research groups. Status is unclear. SNIC needs to look on other countries. Needs to implement more policies for data management and storage
A (PÖ): Long-term quota is decided by universities. Guaranteed until 2017.
(PW): Max-Planck. No funding for archives. University policy. President guarantees 50 years availability. Split between data/archiving not possible. All data part of the research cycle. Why they separation between archive and data?
Q: Need to collaborate more. Learn from existing projects.
GH: Governance structure of SweStore, NorStore. (S) Work in progress. (F) controlled centrally as well as from funding agencies.
R: Policies regarding sensitive data. A(NO): Depends on the data producers. Data producers have the responsible for the data. A(FI): Tools needed for access controls.
PW: No interfaces for solving all problems. Data collaboration using iRODS. Going from abstract data ice lands needs some scalability development.
EL: Too many communities as well as different technologies.
JK: Many European collaborate in principle, but does not actually collaborate.
GH: NeIC is a facilitator. Collaborate and connect users and communities. Develop leading edge competence at the different centers. No ambition to coordinate collaboration on a nordic/european level.
PW: Demands must come from the communities. We must not create more overhead structures. From EUDAT discussions. Need to move data to the computational resources. Important to interact between EU-projects. Not create new super-projects.
GH: Dupliation is not good. There is overhead in coordination as well.
PW: Why the low activity by nordic countries in the RDA working groups? Important to have the nordic voices in the RAD working groups.
Discussion Points
Session Summary
- BBMRI Nordic has data volumes increasing faster than Moore’s law, and thus needs a lot of storage long-term storage with personal data privacy protection.
Lessons Learned
- Sharing sensitive data still has many unresolved issues, especially across borders.
- Data reduction needed already on the antennas, as 80Pb/day is a major obstacle. A lot of work needs to be done on formats.
- Proprietary formats are an obstacle to knowledge extraction, indexability and sharing.
Future Directions
- BBMRI wants a close collaboration with EUDAT projects, especially for indexing and standardization of non-sensitive metadata.
- We must go beyond e-Infrastructure to e-Collaborate projects.