Looking Back to the Future: A Glimpse at Twenty Years of Data Science

This paper carries out a lightweight review to explore the potentials of data science in the last two decades and especially focuses on the four essential components: data resources, technologies, data infrastructures, and data education. Considering the barriers of data science, the analysis has been mapped into four essential components, highlighting priorities and challenges in social and cultural, epistemological, scientific and technical, economic, legal, and ethical aspects. As a result, the future development of data science tends to shift toward datafication, data technicity, infrastructuralism, and data literacy empowerment. The data ecosystem, at the macro level, has also been analyzed under the open science umbrella, providing a snapshot for the future development of data science.

Published on 2023-04-05 10:30:30

Attending to the Cultures of Data Science Work

This essay reflects on the shifting attention to the “social” and the “cultural” in data science communities. While recently the “social” and the “cultural” have been prioritized in data science discourse, social and cultural concerns that get raised in data science are almost always outwardly focused – applying to the communities that data scientists seek to support more so than more computationally-focused data science communities. I argue that data science communities have a responsibility to attend not only to the cultures that orient the work of domain communities, but also to the cultures that orient their own work. I describe how ethnographic frameworks such as thick description can be enlisted to encourage more reflexive data science work, and I conclude with recommendations for documenting the cultural provenance of data policy and infrastructure.

Published on 2023-04-03 06:49:21

Scaling Identifiers and their Metadata to Gigascale: An Architecture to Tackle the Challenges of Volume and Variety

Persistent identifiers are applied to an ever-increasing variety of research objects, including software, samples, models, people, instruments, grants, and projects, and there is a growing need to apply identifiers at a finer and finer granularity. Unfortunately, the systems developed over two decades ago to manage identifiers and the metadata describing the identified objects no longer scale. Communities working with physical samples have grappled with these three challenges of the increasing volume, variety, and variability of identified objects for many years. To address this dual challenge, the IGSN 2040 project explored how metadata and catalogues for physical samples could be shared at the scale of billions of samples across an ever-growing variety of users and disciplines. In this paper, we focus on how we scale identifiers and their describing metadata to billions of objects and who the actors involved with this system are. Our analysis of these requirements resulted in the definition of a minimum viable product and the design of an architecture that not only addresses the challenges of increasing volume and variety but, more importantly, is easy to implement because it reuses commonly used Web components. Our solution is based on a Web architectural model that utilises Schema.org, JSON-LD, and sitemaps. Applying these commonly used architectural patterns on the internet allows us to not only handle increasing variety but also enable better compliance with the FAIR Guiding Principles.

Published on 2023-03-01 13:58:20

Correction: 39 Hints to Facilitate the Use of Semantics for Data on Agriculture and Nutrition

This article details a correction to the article: Caracciolo, C., Aubin, S., Jonquet, C., Amdouni, E., David, R., Garcia, L., Whitehead, B., Roussey, C., Stellato, A. and Villa, F., 2020. 39 Hints to Facilitate the Use of Semantics for Data on Agriculture and Nutrition. Data Science Journal, 19(1), p.47. DOI: http://doi.org/10.5334/dsj-2020-047

Published on 2023-02-14 10:19:51

What are Researchers’ Needs in Data Discovery? Analysis and Ranking of a Large-Scale Collection of Crowdsourced Use Cases

Data discovery is important to facilitate data re-use. In order to help frame the development and improvement of data discovery tools, we collected a list of requirements and users’ wishes. This paper presents the analysis of these 101 use cases to examine data discovery requirements; these cases were collected between 2019 and 2020. We categorized the information across 12 ‘topics’ and eight types of users. While the availability of metadata was an expected topic of importance, users were also keen on receiving more information on data citation and a better overview of their field. We conducted and analysed a survey among data infrastructure specialists in a first attempt at ranking the requirements. Between these data professionals, these rankings were very different, excepting the availability of metadata and data quality assessment.

Published on 2023-02-09 09:42:15

Making Drone Data FAIR Through a Community-Developed Information Framework

Small Uncrewed Aircraft Systems (sUAS) are an increasingly common tool for data collection in many scientific fields. However, there are few standards or best practices guiding the collection, sharing, or publication of data collected with these tools. This makes collaboration, data quality control, and reproducibility challenging. To that end, we have used iterative rounds of data modeling and user engagement to develop a Minimum Information Framework (MIF) to guide sUAS users in collecting the metadata necessary to ensure that their data is trust-worthy, shareable and reusable. This paper briefly outlines our methods and the MIF itself, which includes 74 metadata terms in four classes that sUAS users should consider collecting for any given study. The MIF provides a foundation which can be used for developing standards and best practices.

Published on 2023-01-25 09:44:50

Data Management Plans: Implications for Automated Analyses

Data management plans (DMPs) are an essential part of planning data-driven research projects and ensuring long-term access and use of research data and digital objects; however, as text-based documents, DMPs must be analyzed manually for conformance to funder requirements. This study presents a comparison of DMPs evaluations for 21 funded projects using 1) an automated means of analysis to identify elements that align with best practices in support of open research initiatives and 2) a manually-applied scorecard measuring these same elements. The automated analysis revealed that terms related to availability (90% of DMPs), metadata (86% of DMPs), and sharing (81% of DMPs) were reliably supplied. Manual analysis revealed 86% (n = 18) of funded DMPs were adequate, with strong discussions of data management personnel (average score: 2 out of 2), data sharing (average score 1.83 out of 2), and limitations to data sharing (average score: 1.65 out of 2). This study reveals that the automated approach to DMP assessment yields less granular yet similar results to manual assessments of the DMPs that are more efficiently produced. Additional observations and recommendations are also presented to make data management planning exercises and automated analysis even more useful going forward.

Published on 2023-01-25 09:51:59

RDM in a Decentralised University Ecosystem—A Case Study of the University of Cologne

The University of Cologne (UoC) has historically grown in highly decentralised structures. This is reflected by a two-layered library structure as well as by a number of decentralised research data management (RDM) activities established on the faculty and research consortium level. With the aim to foster networking, cooperation, and synergies between existing activities, a university-wide RDM will be established. A one-year feasibility study was commissioned by the Rectorate in 2016 and carried out by the department research management, library and computing centre. One study outcome was the adoption of a university-wide research data guideline. Based on a comprehensive RDM service portfolio, measures were developed to put a central RDM into practice. The challenges have been to find the right level of integration and adaptation of existing and established decentralised structures and to develop additional new structures and services.

We will report on first steps to map out central RDM practices at the UoC and to develop a structure of cooperation between loosely coupled information infrastructure actors. Central elements of this structure are a competence center, an RDM expert network, a forum for exchange about RDM and associated topics as well as the faculties with their decentralized, domain-specific RDM services. The Cologne Competence Center for Research Data Management (C3RDM) was founded at the end of 2018 and is still in its development phase. It provides a one-stop entry point for all questions regarding RDM. The center itself provides basic and generic RDM services, such as training, consulting, and data publication support, and acts as a hub to the decentral experts, information infrastructure actors, and resources.

Published on 2022-12-27 09:53:05

Organization IDs in Germany—Results of an Assessment of the Status Quo in 2020

Persistent identifiers (PIDs) for scientific organizations such as research institutions and research funding agencies are a further decisive piece of the puzzle to promote standardization in the scholarly publication process—especially in light of the already established digital object identifiers (DOIs) for research outputs and ORCID iDs for researchers. The application of these PIDs enables automated data flows and guarantees the persistent linking of information objects. Moreover, PIDs are fundamental components for the implementation of open science. For example, the application of PIDs for scientific organizations is of crucial importance when analyzing publications and the costs of the transition to open access at an institution.

To find out more about the status quo of the use and adoption of organization IDs in Germany, a ‘Survey on the Need for and Use of Organization IDs at Higher Education Institutions and Non-University Research Institutions in Germany’ was conducted among 548 scientific institutions in Germany in the period from July 13 to December 4, 2020, as part of the DFG-funded project ORCID DE. One hundred and eighty-three institutions participated in what was the largest survey to date on organization IDs in Germany. The survey included questions on the knowledge, adoption, and use of organization IDs at scientific institutions. Moreover, respondent institutions were asked about their needs with regard to organization IDs and their metadata (e.g., in terms of relationships and granularity). The present paper provides a comprehensive overview of the results of the survey conducted as part of the aforementioned project and contributes to the promotion and increased awareness of organization IDs.

Published on 2022-12-22 10:58:11

Data Quality Assurance at Research Data Repositories

This paper presents findings from a survey on the status quo of data quality assurance practices at research data repositories.

The personalised online survey was conducted among repositories indexed in re3data in 2021. It covered the scope of the repository, types of data quality assessment, quality criteria, responsibilities, details of the review process, and data quality information and yielded 332 complete responses.

The results demonstrate that most repositories perform data quality assurance measures, and overall, research data repositories significantly contribute to data quality. Quality assurance at research data repositories is multifaceted and nonlinear, and although there are some common patterns, individual approaches to ensuring data quality are diverse. The survey showed that data quality assurance sets high expectations for repositories and requires a lot of resources. Several challenges were discovered: for example, the adequate recognition of the contribution of data reviewers and repositories, the path dependence of data review on review processes for text publications, and the lack of data quality information. The study could not confirm that the certification status of a repository is a clear indicator of whether a repository conducts in-depth quality assurance.

Published on 2022-11-22 09:46:46