A Survey on Publicly Available Open Datasets Derived From Electronic Health Records (EHRs) of Patients with Neuroblastoma

Background: Neuroblastoma is a rare pediatric cancer that affects thousands of children worldwide. Information stored in electronic health records can be a useful source of data forin silicoscientific studies about this disease, carried out both by humans and by computational machines. Several open datasets derived from electronic health records of anonymized patients diagnosed with neuroblastoma are available in the internet, but they were released on different websites or as supplementary information of peer-reviewed scientific publications, making them difficult to find.

Methods: To solve this problem, we present here this survey of five open public datasets derived from electronic health records of patients diagnosed with neuroblastoma, all collected in a single website called Neuroblastoma Electronic Health Records Open Data Repository.

Results: The five open datasets presented in this survey can be used by researchers worldwide who want to carry on scientific studies on neuroblastoma, including machine learning and computational statistics analyses.

Conclusions: We believe our survey and our open data resource can have a strong impact in oncology research, allowing new scientific discoveries that can improve our understanding of neuroblastoma and therefore improve the conditions of patients. We release the five open datasets reviewed here publicly and freely on our Neuroblastoma Electronic Health Records Open Data Repository under the CC BY 4.0 license at:

https://davidechicco.github.io/neuroblastoma_EHRs_data or at

https://doi.org/10.5281/zenodo.6915403

Published on 2022-10-04 10:07:45

KadiStudio: FAIR Modelling of Scientific Research Processes

FAIR handling of scientific data plays a significant role in current efforts towards a more sustainable research culture and serves as a prerequisite for the fourth scientific paradigm, that is, data-driven research. To enforce the FAIR principles by ensuring the reproducibility of scientific data and tracking their provenance comprehensibly, the FAIR modelling of research processes in form of automatable workflows is necessary. By providing reusable procedures containing expert knowledge, such workflows contribute decisively to the quality and the acceleration of scientific research. In this work, the requirements for a system to be capable of modelling FAIR workflows are defined and a generic concept for modelling research processes as workflows is developed. For this, research processes are iteratively divided into impartible subprocesses at different detail levels using the input-process-output model. The concrete software implementation of the identified, universally applicable concept is finally presented in form of the workflow editor KadiStudio of the Karlsruhe Data Infrastructure for Materials Science (Kadi4Mat).

Published on 2022-09-23 10:56:05

Machine Learning Applied for Spectra Classification in X-ray Free Electorn Laser Sciences

Spectroscopy experiment techniques are widely used and produce a huge amount of data especially in facilities with very high repetition rates. At the European XFEL, X-ray pulses can be generated with only 220ns separation in time and a maximum of 27000 pulses per second. In experiments at the different scientific instruments, spectral changes can indicate the change of the system under investigation and so the progress of the experiment. Immediate feedback on the actual state (e.g. time-resolved status of the sample) would be essential to quickly judge how to proceed with the experiment. Hence, we aim to capture two major spectral changes. These are the change of intensity distribution (e.g. drop or appearance) of peaks at certain locations, and the shift of the peaks in the spectrum. Machine Learning (ML) opens up new avenues for data-driven analysis in spectroscopy by offering the possibility for quickly recognizing such specific changes and implementing an online feedback system which can be used near real-time during data collection.

On the other hand, ML requires lots of data that are clearly annotated. Hence, it is important that experimental data should be managed along the FAIR principles. In the case of XFEL experiments, we suggest introducing NeXus glossary and the corresponding data format standards for future experiments.

An example is presented to demonstrate how Neural Network-based ML can be used for accurately classifying the state of an experiment if properly annotated data is provided.

Published on 2022-08-09 12:12:58

A Critical Literature Review of Historic Scientific Analog Data: Uses, Successes, and Challenges

For years scientists in fields from climate change to biodiversity to hydrology have used older data to address contemporary issues. Since the 1960s researchers, recognizing the value of this data, have expressed concern about its management and potential for loss. No widespread solutions have emerged to address the myriad issues around its storage, access, and findability. This paper summarizes observations and concerns of researchers in various disciplines who have articulated problems associated with analog data and highlights examples of projects that have used historical data. The authors also examined selected papers to discover how researchers located historical data and how they used it. While many researchers are not producing huge amounts of analog data today, there are still large volumes of it that are at risk. To address this concern, the authors recommend the development of best practices for managing historic data. This will take communication across disciplines and the involvement of researchers, departments, institutions, and associations in the process.

Published on 2022-07-28 11:28:56

Development and Governance of FAIR Thresholds for a Data Federation

The FAIR (findable, accessible, interoperable, and re-usable) principles and practice recommendations provide high level guidance and recommendations that are not research-domain specific in nature. There remains a gap in practice at the data provider and domain scientist level demonstrating how the FAIR principles can be applied beyond a set of generalist guidelines to meet the needs of a specific domain community.

We present our insights developing FAIR thresholds in a domain specific context for self-governance by a community (agricultural research). ‘Minimum thresholds’ for FAIR data are required to align expectations for data delivered from providers’ distributed data stores through a community-governed federation (the Agricultural Research Federation, AgReFed).

Data providers were supported to make data holdings more FAIR. There was a range of different FAIR starting points, organisational goals, and end user needs, solutions, and capabilities. This informed the distilling of a set of FAIR criteria ranging from ‘Minimum thresholds’ to ‘Stretch targets’. These were operationalised through consensus into a framework for governance and implementation by the agricultural research domain community.

Improving the FAIR maturity of data took resourcing and incentive to do so, highlighting the challenge for data federations to generate value whilst reducing costs of participation. Our experience showed a role for supporting collective advocacy, relationship brokering, tailored support, and low-bar tooling access particularly across the areas of data structure, access and semantics that were challenging to domain researchers. Active democratic participation supported by a governance framework like AgReFed’s will ensure participants have a say in how federations can deliver individual and collective benefits for members.

Published on 2022-06-09 12:14:54

Activities of the Polar Environment Data Science Center of ROIS-DS, Japan

The Polar Environment Data Science Center (PEDSC) is one of the centers of the Joint Support-Center for Data Science Research (DS) of the Research Organization of Information and Systems (ROIS), which was established in 2017. The purpose of the PEDSC is to promote the opening and sharing of the scientific data obtained by research activities in the polar region led by the National Institute of Polar Research (NIPR). Activities of the PEDSC have been carried out along a five year plan with the following seven specific tasks since 2017: (1) construction of an integrated database; (2) upgrade and interoperable use of the three existing database systems (NIPR Science Database, Arctic Data archive System (ADS), and Inter-university Upper atmosphere Global Observation NETwork system (IUGONET)); (3) processing of the time-series digital data; (4) processing of the sample data; (5) data publication in the Polar Data Journal; (6) collaboration with external communities; and (7) promoting data science using the database and database system.

Published on 2022-04-20 10:13:07

Persistent Identification for Conferences

Persistent identification of entities plays a major role in the progress of digitization of many fields. In the scholarly publishing realm there are already persistent identifiers (PID) for papers (DOI), people (ORCID), organisation (GRID, ROR), books (ISBN) but there is no generally accepted PID system for scholarly events such as conferences or workshops yet. This article describes the relevant use cases that motivate the introduction of persistent identifiers for conferences. The use cases were mainly derived from interviews, discussions with experts and their previous work. As primary stakeholders who are involved in the typical conference event life cycle researchers, conference organizers, and data consumers were identified. The resulting list of use cases illustrates how PIDs for conference events will improve the current situation for these stakeholders and help with problems they are facing today.

Published on 2022-04-05 10:21:34

When Your Data is My Grandparents Singing. Digitisation and Access for Cultural Records, the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)

In this paper we discuss the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC), a research repository that explicitly aims to act as a conduit for research outputs to a range of audiences, both within and outside of academia. PARADISEC has been operating for 19 years, and has grown to hold over 390,000 files currently totaling 150 terabytes and representing 1,312 languages, many of them from Papua New Guinea and the Pacific. Our focus is on recordings and transcripts in the many small languages of the world, the songs and stories that are unique cultural expressions. While this research data is created for a particular project, it has huge value beyond academic research as it is typically oral tradition recorded in places where little else has been recorded. There is an increasing focus in academia on reproducible research and research data management, and repositories are the key to successful data management. We discuss the importance for research practice of having discipline-specific repositories. The data in our work is also cultural material that has value to the people recorded and their descendants, it is their grandparents and so we, as outsider researchers, have special responsibilities to treat the materials with respect and to ensure they are accessible to the people we have worked with.

Published on 2022-04-04 08:57:25

Quality Management Framework for Climate Datasets

Data from a variety of research programmes are increasingly used by policy makers, researchers, and private sectors to make data-driven decisions related to climate change and variability. Climate services are emerging as the link to narrow the gap between climate science and downstream users. The Global Framework for Climate Services (GFCS) of the World Meteorological Organization (WMO) offers an umbrella for the development of climate services and has identified the quality assessment, along with its use in user guidance, as a key aspect of the service provision. This offers an extra stimulus for discussing what type of quality information to focus on and how to present it to downstream users. Quality has become an important keyword for those working on data in both the private and public sectors and significant resources are now devoted to quality management of processes and products. Quality management guarantees reliability and usability of the product served, it is a key element to build trust between consumers and suppliers. Untrustworthy data could lead to a negative economic impact at best and a safety hazard at worst. In a progressive commitment to establish this relation of trust, as well as providing sufficient guidance for users, the Copernicus Climate Change Service (C3S) has made significant investments in the development of an Evaluation and Quality Control (EQC) function. This function offers a homogeneous user-driven service for the quality of the C3S Climate Data Store (CDS). Here we focus on the EQC component targeting the assessment of the CDS datasets, which include satellite and in-situ observations, reanalysis, climate projections, and seasonal forecasts. The EQC function is characterised by a two-tier review system designed to guarantee the quality of the dataset information. While the need of assessing the quality of climate data is well recognised, the methodologies, the metrics, the evaluation framework, and how to present all this information to the users have never been developed before in an operational service, encompassing all the main climate dataset categories. Building the underlying technical solutions poses unprecedented challenges and makes the C3S EQC approach unique. This paper describes the development and the implementation of the operational EQC function providing an overarching quality management service for the whole CDS data.

Published on 2022-04-04 09:41:16

Global Community Guidelines for Documenting, Sharing, and Reusing Quality Information of Individual Digital Datasets

Open-source science builds on open and free resources that include data, metadata, software, and workflows. Informed decisions on whether and how to (re)use digital datasets are dependent on an understanding about the quality of the underpinning data and relevant information. However, quality information, being difficult to curate and often context specific, is currently not readily available for sharing within and across disciplines. To help address this challenge and promote the creation and (re)use of freely and openly shared information about the quality of individual datasets, members of several groups around the world have undertaken an effort to develop international community guidelines with practical recommendations for the Earth science community, collaborating with international domain experts. The guidelines were inspired by the guiding principles of being findable, accessible, interoperable, and reusable (FAIR). Use of the FAIR dataset quality information guidelines is intended to help stakeholders, such as scientific data centers, digital data repositories, and producers, publishers, stewards and managers of data, to: i) capture, describe, and represent quality information of their datasets in a manner that is consistent with the FAIR Guiding Principles; ii) allow for the maximum discovery, trust, sharing, and reuse of their datasets; and iii) enable international access to and integration of dataset quality information. This article describes the processes that developed the guidelines that are aligned with the FAIR principles, presents a generic quality assessment workflow, describes the guidelines for preparing and disseminating dataset quality information, and outlines a path forward to improve their disciplinary diversity.

Published on 2022-03-31 10:23:37