Contact Us | Members Only | Site Map

Association of Research Libraries (ARL®)

  Resources Contact:
Lee Anne George
Publications, Reports, Presentations
Membership Meeting Proceedings

Partners in Preservation: The ICSPR Experience

Share Share   Print

Washington, D.C.
October 15-17, 1997

Preservation of Digital Information

Partners in Preservation: The ICSPR Experience

Richard Rockwell, Executive Director and
Janet K. Vavra
Inter-university Consortium for Political and Social Research

Background

Digital information has been a part of America’s life, since the nation’s infancy. Among the first producers of digital information were the country’s founders, who mandated a decennial census in the Constitution, thereby initiating the collection of basic demographic and socioeconomic information about the population that continues to this day. In 1790, the year of the first census, population statistics were collected by assistants appointed by United States Marshals working under the direction of Thomas Jefferson. Jefferson summarized the information and had it transcribed onto parchment with quill pen in two colors (one for the actual count, and the other for what Jefferson thought was the correct count). [For a fascinating account of the census through the years, see The American Census: A Social History, by Margo J. Anderson, published by Yale University Press, 1988.]

The 1790 census data now reside in electronic form in the archives of the Inter-university Consortium for Political and Social Research(ICPSR). The ICPSR, headquartered at the University of Michigan, is a computer-readable archive of social science data that is accessible through the World Wide Web from anywhere in the world. The 1790 census data have been in the holdings of ICPSR for 30 years.

The 1790 census data are a perfect example of preservation of a data collection for over 200 years. More than two centuries after their collection, the data remain available to scholars throughout the world and in a format that makes them totally compatible with today’s computerized information-handling environment. Since the first paper copies were created, the census data have been preserved, primarily by migrating the information from one medium to another as the times demanded. The data were put on microfilm by the Bureau of the Census (probably about 1890) and converted into machine-readable form by the ICPSR in the late 1960s. The data have since migrated through several different media, keeping pace with changing technology. They were initially entered onto keypunch cards by ICPSR and subsequently written onto 9-track tapes. Later, the data were moved to 3480 cartridge tapes, on which archival security copies are currently stored. They now also reside on magnetic disk so that users can download them directly from the Internet.

Today, college and universities libraries are finding themselves faced with the same set of circumstances that previously confronted organizations such as ICPSR in dealing with the ongoing changes and challenges resulting from the technical revolution in information gathering, storage, and dissemination. These challenges are illustrated by the 1790 census example, which applies toonly one small collection of data but must be multiplied by the thousands of collections that now exist at ICPSR and elsewhere. These challenges, while exciting, also frequently produce sleepless night for the directors and staffs of libraries (and other organizations) who must deal with the daily explosion of digital information sources. Further, setting aside the 1790 census data (where there is surely little argument about the need for preservation), libraries must have serious questions about what to acquire, what to preserve, how to preserve it, and how to finance the activity. These questions will loom large for each collection in the archive. We have yet to de-accession a data set, although this has been considered from time to time for collections thought to be of inferior scientific quality. p> Libraries are natural homes for the preservation of much of the digital information that is being generated and certainly are natural potential partners with other organizations involved in digital preservation. Libraries recognize this, as evidenced by this panel. Together we now find ourselves routinely struggling with issues such as what digital information to preserve, what media to use for preservation and distribution, what format to employ, how to retrain staff to continue to be productive and comfortable in this rapidly changing environment, how to keep abreast of new developments, and, finally, how to pay for it all. Using the ICPSR experience, we will report how ICPSR has faced many of these issues and suggest what we hope can be opportunities for cooperation in the future.

The ICPSR Experience

ICPSR has been archiving digital data since 1962. Until a few years ago, nothing like ICPSR existed in the natural sciences. Founded on the premise that the data collected by research projects are valuable resources that can be mined by many, for both research and teaching, ICPSR was from its beginning charged with responsibility for archiving and preserving electronic social science data in perpetuity and making them available indefinitely. This has enabled the writing of thousands of research papers, books, dissertations, and student papers, and it has conserved scarce resources. The very existence of ICPSR has stimulated the collection of new information, resulting in at least half a century’s accumulating observations on virtually every aspect of the society, and the emergence of standards for survey research, particularly regarding the continuity of survey questions. While the first data collections arrived on punch cards at what when then ICPR (The “Social” was added later) today scholars from around the world download data from ICPSR 24 hours a day.

Our website (http://www.icpsr.umich.edu) received 11,493 different visitors last week from 8,455 different computing sites, who “hit” our website 73,114 times, accessed 4,745 different data files, and downloaded at least 3.1 gigabytes of data (some of it went out other ways, including quarterly CD-ROMs, that contain all additions to the archive in the previous quarter.

That wasn’t our busiest week at all, and the counts actually under-represent usage; for example, the 50 hits from AOL subscribers show up in our records as always being from the same user. Less than half the accesses to ICPSR came from the educational (.edu) domain. U.S. commercial users represent the single largest cluster of users outside academia. Users in 70 countries hit us, including the Faroe Islands. Only the continent of Africa was not among our users that week; Antarctica shows up as “U.S. government.” People in the U.S. government itself came to us 437 times that week for data. We did not distribute data on this scale until a few years ago&151;and then annually, rather than weekly. Today, we are a multi-terabyte operation each year. We service this clientele with a staff no larger than when our customer base was a fifth of today’s size.

We have this enormous customer demand because, first, ICPSR currently has over 40,000 individual data files that represent nearly 3,500 discrete study titles, many of which are unique to ICPSR and were developed at ICPSR; second, because we provide excellent user support, including outstanding documentation and computer support; and third, because our Summer Training Program is world-renowned. Additionally, in recent years we have added a number of new services, including online interactive analytical and data extraction services, computer conferences, web bulletin boards, etc. And increasingly, we see ourselves as a “virtual” archive as well as a real archive: we provide links on our website to other data providers all over the world, and we offer to “really” archive their data for security reasons.

ICPSR is a membership-based organization that has been supported by members dues since its inception. It has grown from 21 research universities plus the University of Michigan to encompass more than 325 member colleges and universities in North America, and the national archives of most developed nations. Essentially, every research university is a member of ICPSR. Additional funding (which has now grown to over 60% of the organization’s budget) comes from special grants and contracts from a variety of funding agencies. It is with this combination of funds that ICPSR pursues its mission of archiving and distributing social science computer-readable data sources. The key to ICPSR’s stability and flexibility is our foundation of funding by the social science community itself, through member dues. Without that funding, we would have been tossed about far more by the changing winds of funding agencies. An endowment would be most welcome, of course; $60 million would do nicely.

ICPSR has been archiving data since its founding, with many of the older files in the holdings moving from early electronic formats to the current formats in which they are stored and used by scholars. ICPSR’s guiding principle is that preservation archiving must be the top priority for the organization. This dedication comes with a commitment of resources, including staff, equipment, media supplies, training, training and retraining, and climate-controlled storage facilities. From its beginning, ICPSR had as one of its central functions the preservation and security of all files for all collections in its holdings. We have not yet lost a data collection, although over the years we have had a few scares.

Today, ICPSR maintains an Archival Operations unit whose sole responsibility is the maintenance and security of copies of all files in the holdings. Staffed by full-time permanent employees with a dedication to the preservation and security of the organization’s data holdings, the unit enjoys equal status with the five other functional units by ICPSR: Administration, Archival, Development, Computer and Network Services, Training Program, and User Support. Archival operation is thus administratively shielded from both external users and internal ICPSR staff, who do not interfere in its operations, and it maintains not only a separate archival collection that is inaccessible by anyone outside the unit, but also separate databases about that archival collection.

The Process of Ensuring Archival Integrity Archival Operations receives the original copies of all files coming into ICPSR. As soon as a new data collection arrives, it is promptly accessioned by being assigned a unique ICPSR study number. The files are then evaluated for contents and format. Virus-checking procedures are employed for any studies coming into the archive that might be vulnerable to viruses (such as studies arriving on diskettes, ftp, etc.). A non-networked virus-checking computer station is equipped with a variety of virus-hunter software, which is rigorously kept current. After the preliminary identification and evaluation steps are completed, the staff creates two copies of any files that have been supplied. Two copies are created to ensure that there is always a back-up copy for any file in the holdings, as a security precaution.

Currently files are stored either on IBM 3480 cartridge tape or on Digital Linear Tape (DLT), depending upon the format in which the files were submitted. Both media are rated to have expected lifetimes of 10-100 years by the National Media Laboratories. Straight ASCII or EBCDIC files can be stored on the cartridge tapes, but files that may contain information embedded in special formats are frequently best stored on DLT tapes. When the tapes are filled, they are physically stored in separate climate-controlled locations as a further security measure.

All materials are inventoried, and the inventory becomes part of the permanent record of the archive. ALl copies of the files created are checked as soon as they are made to assure that there has not been any corruption during the copying process. This checking is performed by comparing the original file to the files created, and, if no discrepancies exist, the copy is considered a duplicate of the original. In addition to creating two archival copies of each file and doing an inventory of the materials received, staff record identifying information about the collection and all of the electronic files and their characteristics in an electronic database.

The same procedures are followed for any data files processed by ICPSR staff (files to which ICPSR adds value). The processed files are treated essentially the same way as newly-accessioned material. An additional step with processed data is the creation of a servicing or distribution copy of the files and the entry of the information about each file in a collection into an Oracle database that we use to manage the electronic data distribution service of ICPSR. Additionally, when data are processed for release, the Archival Development staff who prepare the files for distribution prepare metadata for the collection. The metadata are then made available on our website when the data collection is released for distribution. Finally, the disk copy of the servicing collection is incrementally backed-up in an automated procedure each evening, and those back-ups are stored in a bank vault outside the building in which ICPSR is housed. But we’re still a bit nervous, and for that reason we have agreed with Cornell University to create an incremental mirror archive there. Our partnership with the San Diego Supercomputer Center in the NSF NPACI award may also create a mirror archive there, this time on the San Andreas fault.

The distribution or servicing copy of the data is generated by the User Support staff, who are responsible for maintaining and providing user support for all data available from ICPSR. The process of preparing the distribution file includes making sure that the documentation for any data collection is available, whether in machine-readable form or in hard-copy. While ICPSR plans on eventually converting all hard-copy documentation into electronic text that can be browsed on the Internet, currently a significant proportion of the collection still has only hard-copy documentation. The hard-copy documentation is automatically supplied to users when they order the data.

As technology continues to change at an ever more rapid pace, ICPSR has had to migrate its data holdings from one medium to another just to keep the data viable. We spent about a quarter of a million dollars from our equity on the last migration. While in the past the medium on which the data files were stored remained constant for a significant number of years, each generation of new media has had roughly half the lifetime of its predecessor. Therefore, ICPSR now considers data migration to be more of an ongoing process than a project undertaken every five or even ten years, and data migration has replaced the “refresh” procedures that we followed for almost 30 years.

Data Migration

Data migration involves more than simply copying files from one medium to another. In the past, when the storage medium remained constant for a number of years, preservation activities centered around ensuring that the medium remained readable— was where “refreshing” copies came in. In recent years, significant changes in computing platforms has made many older data formats much more difficult to use. Therefore, migration now routinely includes an evaluation of the formats in which any data are presently stored and the viability of those formats with the current software and hardware. For example, ICPSR has been actively converting many of its older holdings that have been stored exclusively in OSIRIS format (OSIRIS is a statistical package developed at the University of Michigan in the 1960s) to data files with SAS and SPSS data definition statements. If this conversion were not started at this time, many of ICPSR’s older holdings would be at risk of being very difficult if not impossible to use in the near future. This conversion is now a crucial and critical part of the preservation process for these collections. In general, we find that imbedding data into any software format, such as that of OSIRIS, is a bad idea; the SAS and SPSS data definition statements do not embed the data in a proprietary format.

Besides preserving data, migration fulfills another critical function: providing easy access to the data resources. Not only must the data be preserved, but they also must remain accessible and usable by the clientele. A critical component of this migration is assuring that the data continue to be compatible with the current computing environments. Just as staff responsible for the migration of the data must be knowledgeable about technical and substantive matters, so also must the staff who assist scholars in locating and using the data. Consequently, all staff in the archive must be trained and retrained on an ongoing basis to keep pace with the changes in the software and hardware currently in use and to keep abreast of what principal investigators, researchers and users are doing.

Our experience with data migration leads us to a recommendation about standards for preservation archiving. The archiving/library profession has not yet agreed to standards for preservation archiving, but it might try to move in the direction it moved with preservation microfilming and acid-free paper. It would be entirely inappropriate, however, to apply that model to the preservation archiving of digital data. We do not need standards for which media to choose, which format to adopt, how to store the archive physically, etc. What we need, instead, is a standard for the functionality of any preservation archiving process: What must be achieved for what period of time, at what cost, for which kinds of digital archives?

Partnerships are the Future As the above surely communicates, preservation archiving of digital information is neither simple nor cheap. As the amount of digital data grows and the technological changes continue at an ever more rapid pace, other methods for reaching the goals of preservation and access will have to be identified. While ICPSR has functioned as a membership-based organization where membership dues and special grants have funded preservation distribution, the very high costs of these preservation activities and the need to pursue them within an ever shorter time frame calls for complementary approaches, most of which have yet to be developed. It is clear that further automation of all our procedures is essential. It is also clear that we do need that endowment to ensure that the nation’s archive of social science data will remain viable into the next centuries.

Further, the growth and flexibility of the Internet challenges the way we did business in the past. This is especially true in an era when almost all organizations engaged in preservation activities face effectively shrinking budgets and growing costs relative to increased demands on them. At the same time, users continue to demand more access and services (such as our new analytical serves), and they demonstrate a low level of tolerance for fees associated with those services. Although an institutional membership in ICPSR is priced lower than many journal subscriptions that serve five scholars rather than all the social scientists at an institution, we constantly hear complaints about “high dues.” There is still the folk culture of the Internet that holds that “everything should be free.” We have heard our own version of Ross Perot’s “great sucking sound” as vast portions of our archive flew off (but not away) to massive file servers at member institutions.

We know a lot about preservation archiving of digital information, but there is much more that we need to learn. We have not fully conquered the problem of extracting data from behind software interfaces, which have a life expectancy of a few years. We worry about version control and about authentication: How can the user be certain that the data set purporting to contain the 1790 Census data is really that data set? We are considering encryption of all our data with a public-key system so as to ensure authentication, but we worry about hiding the data behind such a proprietary shield. We do not know how to “archive the Web,” but we are certain that the daily changing in our website will eventually come back to haunt us. We have further to go in developing comprehensive electronic documentation standards; in fact, on another front we are just finishing the development of SGML/XML Document Type Definition for “codebooks” that document social science data sets; you can find it on our website. We are doing this in partnership with individuals at member institutions and in federal statistical agencies.

Partnering in Preservation

No one organization can any longer hope to keep pace with the information being generated or to be able to preserve all of it, even if that organization focuses only on the information available in a small field. The ease with which users can access information on the Internet may negate the need for each institution to possess the same materials. Much like many libraries today do not purchase all the publications in every field, organizations involved in digital preservation will not be able to locally maintain all digital information in a given field, but they will want to retain access to them. Therefore, they should look toward developing partnerships in which preservation and access responsibilities are shared with other organizations. And ICPSR plays a critical and central role in making that feasible in the social sciences.

Partnerships appeal to us for many reasons, not the least of which is that none of us has enough resources or enough good ideas to attack the problem alone. The time has come for the organizations facing these challenges to come together to start to look for ways to cooperate. This cooperation is necessary so that valuable information resources now being generated are not lost forever because nobody knew they existed, or everybody thought someone else would do the work, or no one had the resources to archive the information. Accordingly, ICPSR now makes two concrete offers that would start “partnering in preservation.”

First, we offer to organize a Laboratory for Social Science Data Archiving. This will be a place in which we can experiment on some of the unanswered questions mentioned above, in which we can provide training to data archivists from around the world, and in which we can develop new technologies, perhaps including new hardware. We would love to host interns from ARL libraries as soon as possible. We would hope to see similar organizations develop around other topics of digital information, and we offer our hand in partnership with those other efforts.

Second, we offer to host the first National Conference on Social Science Data Archiving Policy. The content would encompass issues of acquisition and processing policies, priorities for preservation, access conditions, funding, and the creation of a “virtual archive” with ICPSR as the solid center. We propose holding this conference in cooperation with ARL and other major archiving and library organizations, and we will allocate some of our funding to that conference. If it makes sense to expand the conference beyond the archiving of social science data, we would be willing to consider doing so. And again, we offer our hand in partnership.