Research Library Collections and AI

Last Updated on April 16, 2026, 1:43 pm ET

Course syllabi taped inside back cover of The Complete Works of Shakespeare 1951 by Anne Spencer — Taped inside the back cover of poet Anne Spencer’s The Complete Works of Shakespeare (1951) are two syllabi for courses in Shakespeare. Photo by Shane Lin / University of Virginia Library.

This post is an extended version of a recent lightning talk I gave as part of a private workshop hosted by Alliance for Responsible Data Collection (ARDC). The TL;DR: libraries have decades of expertise in copyright, mass digitization, and balancing the stewardship of sensitive materials with a mission of broad public access, all of which is relevant to conversations about AI access to library collections.

Libraries are attractive to AI companies because of their collections of rare and unique content; structured, machine-readable metadata; and multi-faceted discovery systems. Thus, AI firms are trying to access library content in a variety of ways. Some are deploying excessive bot traffic and swarms to gather the rich collections and metadata held by research libraries to train on. Others are reaching out to form partnerships with libraries, and support the responsible digitization of library resources. In both scenarios, libraries have an opportunity to lean into our strongly held values and experience adapting to new technology to determine how to respond when access to collections is changing so quickly.

The Unique Vulnerability Responsibility of Research Libraries

The nature of library archival and special collections illustrates why it is essential to engage in deliberate conversations now about exercising agency over these materials. Libraries preserve and provide access to archival and special collections—such as oral histories, personal letters, and family papers—that are unique to their community or state. For instance, my alma mater, Johns Hopkins University (JHU), hosts the American Prison Writing Archive (APWA), an open-source archive of first-person accounts from incarcerated writers accessible to a global readership. The University of North Carolina at Chapel Hill’s Southern Historical Collection (SHC) at Wilson Special Collections Library includes American slavery documents and letters written by enslaved people. Additionally, libraries operate institutional repositories (IRs), which make research outputs created by faculty, students, and staff available, generally on an open-access basis. Libraries have unique obligations to each of their communities.

Library and Archival Standards

Libraries don’t have to start from scratch in approaching the question of how to respond when the scale of technology creates new challenges for infrastructure and for humanity. When the internet transformed access to library and archival materials in the 1990s and 2000s, libraries developed frameworks based on library values and professional consensus to balance preservation of sensitive information with the imperative to make materials discoverable online. Those frameworks are relevant launching points for today’s conversations about AI.

For instance, in 2012 ARL updated its model deed-of-gift in acknowledgment that the internet and other network technologies are both a way to expand access and an opportunity to revisit complex legal and professional obligations to donors and communities. ARL’s model deed-of-gift articulates some of the very considerations that libraries are grappling with today when it comes to AI access:

Digitization is one key strategy in [the] movement toward expanded access, and with it are associated complex, evolving professional practices and legal obligations with respect to donors, intellectual property, and risk.

ARL’s 2010 “Principles to Guide Vendor/Publisher Relations in Large-Scale Digitization Projects of Special Collections Materials” can also be instructive. The principles encourage libraries to seek the broadest possible user access to digitized content, while acknowledging that the nature of distinct collections requires vigilance in digitization. They call for the careful development and application of standards around digitizing distinct collections based on the inherent characteristics of the item to be digitized. The principles can translate to AI, where it is also important to give careful consideration to copyright, privacy, and moral and cultural heritage concerns.

More recently, “The University of Virginia Archival AI Protocol” started an important conversation around ensuring that the provenance of sensitive materials donated to archives is protected in partnerships with AI firms to digitize and train on archival content. Thinking about these issues now can give researchers and libraries an opportunity to guide important conversations about traceability, linking, transparency, and credit in ways that are meaningful to the research community, without succumbing to requirements of vendors or AI firms that may not share library values or commitment to the public good.

Libraries have developed ethical collection and stewardship policies and practices that work toward acknowledging and repairing relationships between communities and institutions whose collection practices have been extractive and harmful to those communities. Such policies require documentation of the context and provenance of collections, which are relevant conversations in the context of digitization and AI training.

Legal Obligations

In addition to special collections, library collections include databases of journals, ebooks, digitized newspapers, streaming audio and video, datasets, and other types of electronic resources that are critical to research and scholarship. These materials are governed by copyright law as well as license agreements. Libraries work hard to ensure that they provide broad access to licensed digital materials in line with copyright law, and communicate those rights to faculty, students, and other researchers. At the institutional and association level, libraries advocate for strong fair use rights, including in the AI context; the Library Copyright Alliance (LCA) “Statement on Copyright and Generative Artificial Intelligence” holds that training AI models on copyrighted works is fair use.

Libraries also have expertise in navigating legal frameworks to protect communities that have already been harmed by extractive practices. The 1990 Native American Graves Protection and Repatriation Act (NAGPRA) governs the repatriation of Indigenous human remains, sacred items, and funerary objects to their rightful Tribal communities. In 2023, commenters suggested that the US Department of the Interior (DOI) update the NAGPRA definitions and policies to include protections for digital scans and casts of human remains. DOI declined, but pointed to flexibility in other regulations that allow Tribes to request records about repatriated objects, which can include digital data.

Access Conversations

The important conversations libraries are having now about how collections are accessed and consumed, and what “open” means in this moment and in the future, are partly driven by scale. Since the advent of computers, it has always been true that machines access library data—again, libraries structure metadata for machine-readability—but the scale of this access is new, and the potential of its misuse warrants discussion. The scale of the AI industry itself is also new, and libraries are grappling with the idea that a corporate entity can profit from decades of libraries digitizing collections and making them as broadly accessible as possible. In a relatively unregulated policy environment, there are few legal guardrails to prevent harmful misuse of health data or other sensitive information.

Libraries are deeply considering new questions around how humans access library and archival content, too. If humans are more likely to access library materials through chatbots and AI agents, rather than through the library directly, how can libraries work with AI companies to improve retrieval-augmented generation (RAG) results? And, how can libraries protect the provenance of collections used in RAG results so people can better understand their content and pursue research pathways that lead back to primary sources?

Another important piece of the access conversation is ensuring that researchers can use generative and non-generative AI research methods on digital collections that libraries license and pay for. ARL has been raising concerns about licenses for digital works restricting fair use and other library rights for at least 26 years, since the 2000 US Copyright Office study of the Digital Millennium Copyright Act (DMCA) Section 104. In recent years ARL has advocated for a legislative solution to this problem. Given the current environment, that remains unlikely. e-Resource Licensing Explained, written by a group of copyright and licensing librarians, empowers libraries to negotiate for broad usage rights.

Technical Mitigation

Bot traffic and swarms can disrupt research by overwhelming and slowing down systems, including institutional repositories. Libraries are also experiencing disproportionate disruptions to front-end interfaces to digital collections, like catalog discovery layers, which are especially appealing to bots because of the seemingly endless search results.

Libraries are implementing technical solutions that mitigate these problems, while maintaining access for legitimate users. Initially, when attempting to block IP addresses, libraries experienced circumvention techniques like disguising IP addresses and ignoring voluntary signals like robots.txt. While these techniques have mitigated the abuse of library infrastructure, the methods used by AI companies that are not collaborating directly with libraries are ever-evolving and require vigilance.

The most effective solutions for blocking swarms and bot attacks on open systems without affecting legitimate users are “prove you’re human” checks, like Cloudflare Turnstile and Anubis. A novel approach to detecting bot patterns and blocking faceted searches above a certain threshold cut malicious requests by 81% in one week at the University of North Carolina at Chapel Hill (UNC).

Community Knowledge Sharing

Ultimately, libraries and archives are working together to build trusted networks to share challenges and solutions, and work toward building and sustaining open and stable infrastructure. A Code4Lib Slack channel developed organically as a way for libraries to share bot detection patterns and mitigation strategies across libraries and archives. As library communities continue to engage in these discussions, clear and distinct roles will emerge for individual libraries, consortia, and associations like ARL.

The ARDC workshop made clear that collaboration among and across sectors is not only necessary but desirable. Research libraries can contribute deep experience in honoring obligations to communities, decades of governance experience, and a commitment to the public good that predates, and will outlast, any particular technology. And the time is now, because at some point libraries might be the only home of original sources and a verifiable historical record; already, a significant and growing portion of online content is AI-generated.

Search Filters

Topics

The Unique Vulnerability Responsibility of Research Libraries

Library and Archival Standards

Legal Obligations

Access Conversations

Technical Mitigation

Community Knowledge Sharing

Affiliates