Contact Us | Members Only | Site Map

Association of Research Libraries (ARL®)

  Resources Contact:
Lee Anne George
ARL: A Bimonthly Report
ARL: A Bimonthly Report, no. 217 (August 2001)

The Metadata Harvesting Initiative of the Mellon Foundation

Share Share   Print

by Donald J. Waters, Program Officer, The Andrew W. Mellon Foundation

While the OAI Metadata Harvesting Protocol was being developed, the Mellon Foundation sponsored a series of workshops at Harvard University to explore how libraries and other repositories of scholarly information could effectively utilize the new protocol in conjunction with harvesting, search engine, and other core Internet technologies to make cataloging and related metadata about scholarly collections more visible to Internet users. The Harvard workshops produced two interrelated recommendations: first, that a set of leading repositories of scholarly materials begin to reveal metadata about their collections to potential portal services according to the specifications of the OAI protocol; and, second, that a series of experiments be organized to demonstrate the kinds of discovery and retrieval services targeted to the scholarly community that would be possible and sustainable given these metadata and the application of the Internet portal technologies.

By November 2000, the Digital Library Federation (DLF) had identified libraries and other repositories that were prepared to expend their own resources to reveal OAI-compliant metadata for over one million items in more than 50 collections. The Mellon Foundation then invited 16 institutions to suggest gateway or portal services that would use the OAI Metadata Harvesting Protocol and other technologies to support inquiries in a significant field or subfield of scholarly interest, or across a wide range of fields. Seven institutions responded with strong proposals and in June 2001 were funded with grants totaling approximately $1.5M.

Each of the seven institutions invited to participate in this initiative was asked to design projects that would test the application of harvesting and search engine technologies in one or more of the following ways:

  • By delivering scholarly information from the "hidden web"—information in internet-accessible databases, including library catalogs, that are not normally accessible to the internet search engines.

  • By handling formats that present special processing or presentation problems, such as the information hierarchies contained in Encoded Archival Descriptions (EAD) or metadata about visual resources.

  • By constructing necessary tools or "middleware" such as registries or broker services.

Given these general parameters, the participants proposed an imaginative set of experiments.

Two institutions will design portal services based on metadata from broad, multi-institutional, and multi-disciplinary domains:

The Research Libraries Group (RLG)

RLG manages one of the largest union library catalogs in the world. It received support to explore how this major scholarly resource could be completely redesigned to take advantage of Internet portal technologies. In its proposed project, RLG catalog records would be made accessible via the Open Archives protocol to a standard Internet search engine, such as Google, and search results would be expected to link the user not only to a set of catalog entries, but also to a set of service options that might, for example, direct users how to purchase the book, find the nearest library that owns the title, find related works by the author or on the subject, or link to an online version if one exists. This project has the potential to revolutionize the way that access to library catalogs is designed and presented to users.

The University of Michigan (UM)

UM’s library expects to harvest, index, and present metadata about digital library objects that are held by academic and scholarly institutions, but which are not currently accessible through standard search engines. UM plans especially to target the collections that DLF had identified. Like RLG, UM has not placed any limitations on subject domain. The University of Illinois at Urbana-Champaign (UIUC) will develop and provide UM with the actual harvester mechanism that will systematically collect, aggregate, and update the metadata from contributing institutions; UM will construct the indexing and presentation tools for organizing harvested data and then provide a search engine service.

The following three participants will focus primarily on the special problems encountered in harvesting metadata from archives and special collections:

University of Illinois at Urbana-Champaign (UIUC)

UIUC received support to develop a general-purpose harvesting tool that it will use to create a portal for searching special collection materials held by members of the Committee on Institutional Cooperation (CIC). UM also plans to use the harvesting tool, and UIUC plans to focus development on the special problems of using the portal as a vehicle for integrating information about special collections, which are described using different standards: the EAD, the Text Encoding Initiative (TEI), and Machine Readable Cataloging (MARC). In developing its portal, UIUC plans to make use of the search and presentation tools that UM would be creating in its project.

Emory University

Emory expects to explore the feasibility of a scholarly portal service based on metadata harvested from selected archives in two research domains—politics and theology. The portal would refer primarily to the papers of major political figures and the institutional records of religious organizations. Fourteen institutions have expressed an interest in contributing records to the portal and experimenting with the results.

Woodrow Wilson International Center for Scholars (WWICS)

The WWICS sought support to develop a service allowing scholars to search across the catalogs of major Cold War document repositories in the U.S. Not all of these archives have cataloged their documents, but those that have, including the Hoover Institution and the National Security Archive, employ very different schemes. Use of the OAI protocol would allow these metadata to be harvested from the different participating archives and indexed for searching through a common interface.

Finally, two institutions will create portal services based on harvested metadata referring to materials on specific topics, but across a range of formats:

University of Virginia (UVa)

The UVa Library holds one of the world’s best collections of rare books and manuscripts in American literature and history, has been an international leader in digitizing materials from these collections, and hosts several innovative academic centers that make use of these materials. The Library received funding to exploit its rich resources in this area by harvesting metadata for an Information Community—a group of scholars, students, researchers, librarians, information specialists, and citizens—formed around American Studies. The metadata—which will cover a wide variety of formats, such as documents, maps, and data sets, and will span several disciplines—will be harvested initially from Virginia’s own extensive and varied online resources, but the Library will also investigate collaborations with such institutions as the Thomas Jefferson Foundation, Virginia Tech, and the Smithsonian Museum of American Art.

Southeastern Library Network, Inc. (SOLINET)

SOLINET received support to harvest metadata for AmericanSouth.org, a portal that has been under development over the past year. It will initially gather, organize, and present online materials related to the history of the American South from ten participating institutions (Auburn University, Emory University, Louisiana State University, University of Florida, University of Georgia, University of Kentucky, The Kentucky Virtual Library, University of North Carolina at Chapel Hill, University of Tennessee at Knoxville, and Vanderbilt University). Selected scholars from these institutions will participate in the selection of materials and in the organization, design, and testing of the portal.

~ ~ ~

Covering a wide span of subject domains and constituencies, these seven projects will explore the requirements for developing scholarly-oriented portal services based on the use of a variety of Internet technologies, including the new Metadata Harvesting Protocol to make the contents of library catalogs and other elements of the "deep" Web more easily accessible. Each project is highly experimental in nature, and is designed to explore not just the technical requirements, but a range of organizational, political, and economic issues associated with the development of scholarly portals. Although the OAI protocol was key to the conception and development of these projects, its role, relative to other issues, in many of them is relatively minor.

What is really at stake in these projects are much larger questions concerning the value of networked information. The annual study of college freshman, which UCLA’s Higher Education Research Institute conducts each year, recently showed that a whopping 82.9 percent of new freshmen—more than four out of five students—are using the Internet for research or homework.1 This finding raises all the usual questions about the uneven quality of Internet-based resources. A recent study of Harvard seniors, however, suggests a very different picture. The Harvard survey looks at student use of print, library electronic sources, and non-library electronic sources in researching papers in the humanities, social sciences, and natural sciences. The results revealed that the highest percentage of resources used (75% in humanities, 69% in social science, and 65% in natural science) were print materials, which students ranked higher than library and non-library electronic sources in four out of five factors. The Internet-based sources scored high only on the factor of convenience, while print materials scored high on the factors that make a difference in the quality of research and learning: generating the information for which the student is looking, the usefulness of the material, its reliability, and the availability of assistance.2 It is, of course, difficult to combine the results of these two very different studies, but one suggestive interpretation is that a college course of study has a tremendous sobering effect in revealing the real value of the currently available Internet resources. If the seven projects that the Mellon Foundation has recently funded to explore methods of metadata harvesting achieve their goals then the Internet will have come a large step closer to being an indispensable place for scholarly research by faculty, by those Harvard seniors, and by the incoming students who are relatively uninitiated in the practices of higher learning.

—Copyright © 2001 Donald J. Waters This article is drawn from a presentation given at the Joint Conference on Digital Libraries in June 2001 and is published in the ARL Bimonthly Report by permission of the author.

Footnotes

  1. L.J. Sax, et al., The American Freshman: National Norms for Fall 1998 (Los Angeles: Higher Education Research Institute, UCLA Graduate School of Education and Information Studies, 1998).
  2. Harvard University, "Class of 2000 Senior Survey," data supplied by the Harvard University Library.

To cite this article

Donald J. Waters, "The Metadata Harvesting Initiative of the Mellon Foundation," ARL: A Bimonthly Report, no. 217 (August 2001): 10–11, http://www.arl.org/resources/pubs/br/br217/br217waters.shtml.