{{ site.title }}

Realities of Academic Data Sharing (RADS) Initiative: Research Update #1

Last Updated on April 14, 2022, 5:46 pm ET

graphic depicting open laptop with data on screen, networked to the cloud
image © iStock.com/Andrey Suslov

Central to the aims of the Realities of Academic Data Sharing (RADS) project is an examination of the research-data landscape since the release of the Holdren public-access memo in 2013. The 2013 mandate, while requiring public access to federally funded research data, created direct and indirect infrastructure costs to academic institutions and researchers. Six months into the project, the RADS project team has examined costing frameworks to better understand the activities and services needed to make research data publicly accessible. The team has also extensively examined metadata quality at the project’s six participating institutions to assess research data FAIRness (findability, accessibility, interoperability, and reusability), a useful indicator of data access and reusability.

Metadata Analysis

Where are funded researchers across these institutions making their data publicly accessible?

To determine where researchers are making their data publicly accessible, the RADS team began with an examination of DataCite, a global repository containing a variety of resource types, mostly data sets, primarily used as a source of digital object identifiers (DOIs) for these resources. The team initially queried the DataCite repository for affiliations associated with resource creators at each RADS institution. Table 1 lists the number of DataCite records for each of these institutional affiliations.

Creator Affiliation Count
Cornell University 7,167
Duke University 4,255
University of Michigan 15,462
University of Minnesota 3,312
Virginia Tech 2,930
Washington University in St. Louis 2,473


Table 1. Count of DataCite records for each RADS institutional affiliation

To move data into a global research infrastructure such as DataCite, resource metadata is often created by researchers and data curators, however metadata is also transferred among global infrastructure services (DataCite, ORCID, etc.). Those transfers often include changes in the content, such as adding or removing fields appropriate for the metadata dialect in the target infrastructure.

The number of records for each institutional affiliation and repository source are shown in Figure 1 below. The most common three repositories (Harvard Dataverse, Zenodo, and Dryad) are used by all six RADS institutions, while the next most common (The Qualitative Data Repository and ICPSR) are used by three or four institutions.

bar chart of resources for each RADS institution, by most common repository

Figure 1. Number of resources for each RADS institution, by most common repository

What is the “FAIRness” of the metadata?

Working with Ted Habermann at Metadata Game Changers, we use his FAIR rubric, based on metadata elements found in DataCite (posted on his blog here, here, and here), to define metadata quality as measured by metadata completeness. Elements are classified by:

  • Findability—Essential metadata elements, typically text fields that can support human searches on titles, abstracts, keywords, author names, affiliations, etc.
  • Accessibility—Primarily distribution and rights information, helping users get the data and understand what they can or cannot do with it.
  • Interoperability—Use case primarily aimed at integrating data sets into tools for analysis and comparison with other data sets.
  • Reusability—Elements described to support data reuse, including such elements as Resource Contact, Cited By, and Shared By.

“The DataCite metadata dialect is focused primarily on discovery, i.e., the F in FAIR,” Habermann notes. “Nevertheless, there are a number of DataCite elements that support data access and making connections between datasets, software, institutions, documentation, and people. These connections are critical in supporting interoperability and reusability.”

Habermann’s assessment tests for completeness of 85 different DataCite metadata elements in four different groups (FAIR: Findable-Essential; FAIR: Findable-Support; FAIR: AIR-Essential; FAIR: AIR-Support). Each of these groups was given a score between 0 and 1 for the number of elements in the group observed in the metadata.

As described by Habermann, Figure 2 shows a comparison of FAIRness scores, defined in percent complete, for DataCite metadata from institutional repositories (dashed lines) and from data sets submitted by institutional researchers to other repositories (solid lines).

Line graph of FAIRness scores for DataCite metadata from five RADS institutional repositories (dashed lines) and by RADS institutional researchers to other repositories (solid lines), where % is the average FAIR score for each FAIR completeness category

Figure 2. Comparison of FAIRness scores for DataCite metadata from five RADS institutional repositories (dashed lines) and by RADS institutional researchers to other repositories (solid lines), where % is the average FAIR score for each FAIR completeness category

Next, metadata quality was assessed by comparing institutional metadata completeness across repositories. Local repository metadata completeness was compared with the metadata found in DataCite and metadata created by researchers from the same institutions as part of the submission process into other repositories such as Zenodo, Dryad, and DataVerse.

Table 2 shows the comparison of metadata completeness for the Data Repository of the University of Minnesota (DRUM) and occurrence rates of various metadata elements for the same data sets in three repository sets:

  1. DRUM Metadata—The DRUM repository (local data repository)
  2. DRUM@DataCite—Metadata from DRUM at DataCite
  3. Other Repositories—Metadata for University of Minnesota data sets in other repositories at DataCite
Metadata element DRUM Metadata DRUM@DataCite Other Repositories
dc.description 95% 0.30% 10%
dc.description.abstract 80% 12% 95%
dc.subject 85% 2% 63%
dc.relation.isreferencedby 78% 0% 0.6%
dc.description.sponsorship 72% 1% 12%


Table 2. Occurrence % for metadata elements in three repository sets

The difference between the DRUM Metadata and DRUM@DataCite columns reflects an opportunity to improve the workflows used to share DRUM metadata with DataCite. Ted Habermann is currently developing methods leveraging DataCite’s API to transfer metadata from local repositories, without creating new metadata.

Finally, as shown in Figure 3, we know that RADS institution researchers have increasingly shared their data in a wider variety of other repositories, such as Zenodo, Dryad or DataVerse, showing a clear uptick in submissions since 2013.

Line graph showing data shared by RADS institutional researchers in “other repositories,” indexed by DataCite 2012–2021

Figure 3. Data shared by RADS institutional researchers in “other repositories,” indexed by DataCite 2012–2021

Cost Analysis

As Figure 3 above illustrates, there is an upward trend of research data submissions into repositories, as indexed by DataCite. To understand the infrastructure required to support this growth, and the resulting costs, knowledge of the activities and services required to make research data publicly accessible is needed, not just knowledge of data curation or preservation, but of the entire research-data life cycle, up to reuse.

The Association of Research Libraries (ARL) and the RADS initiative have partnered with the Council on Governmental Relations (COGR) and the Federal Demonstration Partnership (FDP) to thoroughly assess existing costing frameworks used to support public access to research data. These frameworks and tools include: the COGR/APARD Costing Framework (in development), Keeping Research Data Safe (KRDS2), the NASEM Biomedical Data Cost Driver Framework, the OCLC Total Cost of Special Collections Stewardship Tool Suite, and the UK Data Service Data Management Costing Tool and Checklist.

Several of these frameworks include activities or (sub)categories that define public access activities within the larger costs of data management. All of these costing frameworks are useful in identifying activities and processes required for public access to research data, but none address the entire research life cycle.

RADS project investigators recognized the need for a more comprehensive understanding of the processes and activities required to make research data publicly accessible, in order to then determine their corresponding costs. Cynthia Hudson Vitale (ARL, RADS), Wendy Kozlowski (Cornell, RADS), Jim Luther (COGR & FDP), Christi Keene (FDP), Melissa Korf (FDP), and I presented at the Coalition for Networked Information (CNI) Spring 2022 Virtual Membership Meeting on March 21, 2022. Our presentation, “COGR, FDP, and ARL: Putting Numbers behind Institutional Expenses for Public Access to Research Data,” addressed the challenges of understanding the true cost of making research data publicly accessible, and current efforts underway to understand these costs.

Personnel Changes

Lisa Johnston has left her position as research data management/curation lead at the University of Minnesota (UMN) and has moved into the new position of Director, Data Governance at the University of Wisconsin–Madison in the Office of Data Management and Analytics Services. Johnston continues in the RADS initiative as a co–principal investigator, lending her expertise to project metadata analysis.

Lisa Johnston’s departure meant a potential gap for research to be conducted locally at UMN, but we are delighted Alicia Hofelich Mohr will continue this work, as she has agreed to join the project team. Hofelich Mohr is the research support coordinator in the Liberal Arts Technologies and Innovation Services (LATIS) at UMN and has a background in statistics, research-data management, data analysis, and reproducible research using R.

What’s Next?

  • RADS will hold its first in-person meeting at the ARL offices at the end of April, with project researchers from member institutions of the Data Curation Network (DCN) in attendance.
  • Ted Habermann, the RADS metadata consultant, is continuing to assess institutional repository metadata completeness and is analyzing element transfer to DataCite at the remaining five RADS institutions.
  • Project researchers are determining metadata best practices for local repositories, with element transfer to global infrastructure, such as DataCite, and persistent identifiers (PIDs) as a priority for research data connectedness.

Acknowledgements

Metadata analysis in this post is attributed to Ted Habermann of Metadata Game Changers, with Lisa Johnston collaborating with Habermann on the University of Minnesota DRUM analysis.

Additional Information—Related Presentations and Blog Posts

Habermann, Ted. “Metadata Life Cycle: Mountain or Superhighway?” Metadata Game Changers Blog. March 7, 2022. https://metadatagamechangers.com/blog/2022/3/7/ivfrlw6naf7am3bvord8pldtuyqn4r.

———. “Funder Metadata: Identifiers and Award Numbers.” Metadata Game Changers Blog. February 2, 2022. https://metadatagamechangers.com/blog/2022/2/2/funder-metadata-identifiers-and-award-numbers.

Hudson Vitale, Cynthia, Wendy Kozlowski, Jim Luther, Christi Keene, Melissa Korf, and Shawna Taylor. “COGR, FDP, and ARL: Putting Numbers behind Institutional Expenses for Public Access to Research Data.” Presentation at CNI Spring 2022 Membership Meeting, virtual, March 21, 2022. https://www.cni.org/topics/assessment/cogr-fdp-and-arl-putting-numbers-behind-institutional-expenses-for-public-access-to-research-data.

Affiliates