Washington, D.C.
October 15-17, 1997
Preservation of Digital Information
Building Research & Action Agendas for Digital Archiving
Peter Graham, Associate University Librarian
Rutgers University
Good morning. It’s a pleasure to be here.
There’s a story my father-in-law likes very much about a pair of people who drive to a lumber yard. One person gets out of the car and asks the clerk for some lumber. And the clerk says, “Well, what size do you want?” The guy goes back to his car, sticks his head in the window, consults with his colleague, and comes back and says, “Two-by-fours.” The clerk then asks, “All right, how long do you want them?” The guy goes back and sticks his head in the car, consults again, and then returns to the clerk and says, “For a long time; we’re building a house.”
This is, of course, what we’re doing here today: building for a long time. As Meredith and Gloria suggested earlier, it’s very appropriate for ARL to take on the topic of digital archiving. It is the distinctive role and mission of the research library to take on the challenge of making information available for a long time, complementing our colleagues in other important areas of librarianship who are making information currently available, as we do.
One of the questions we have been asked to deal with is whose responsibility it is to archive. I think it is ultimately the library’s and, I would add, the archive community’s responsibility to preserve information over the long haul. Other agencies, such as publishers or authors, basically don’t have a track record on this. Nor do they have the social mandate that we do. If the archivists and librarians together don’t do this, no one else will.
Digital Archiving Challenges
I have been asked to review some of the challenges involved in digital preservation, talk about what is happening here and elsewhere, and then briefly allude to the research and action agendas.
The technical challenges are three, and there are two organizational challenges to be dealt with, as well.
None of the technical challenges are truly easy to deal with, but two of the them are relatively easy to deal with, while the third will very difficult and very expensive. The first challenge, technically, is medium preservation. How do we preserve the actual medium on which information is recorded? One of the reasons that this is fairly simple is that most parties talking about digital preservation have decided that this is really not the issue.
There are mechanisms by which we can preserve information that is digitally recorded, but it’s not typically to conserve the tape itself or to conserve the disk. Instead, it is achieved by copying the information. Michael Lesk, in a Commission on Preservation and Access document, has written very explicitly that preservation in the digital environment means copying or migrating information, not conserving the actual object on which that information is recorded. The present media—, tapes, CD-ROMs— simply too volatile to depend on for any extended length of time.
“Migration” brings us to the second technological challenge, which is the extensive and difficult one: technology preservation. How are we to preserve information through the vicissitudes of technologies that we have seen already so rapidly changing over the past couple of decades, and that we know will change even more rapidly over the long period of time that is before us?
The solution here is migration of information. Not simply copying it, but moving it from one technology to another, whether hardware technologies or software technologies, so that information is useable at a late date, even though the technology that created the information may have disappeared or be unavailable.
There are some obvious examples, e.g., eight-inch DisplayWrite disks from the early 1980s. Who can read them? Even if you have the hardware to do it, is the software still available? I have understood that WordPerfect, which is presently at Version 6.0, cannot read its own WordPerfect Version 1.0 files.
How do we migrate information forward? This presents us with a couple of major strategic decisions. If we want information to be available in the future, we have to think about the “just in case” and the “just in time” options.
Just in case: we could migrate everything. Take all information that’s in digital form, everything that we choose to collect, and as each new technology comes along, move it from one technology to the next. This would assure broad availability of information, but it would also be very costly because of the massive task of moving information through the technologies.
Also, there is, of course, an uncertainty of what information will be used over time. We know very well that many books in our collections are not used, or will be used very little over time. That likelihood is certain to be true for many categories of digital information. So, “just in case” could be a very extensive proposition, for not much return.
On the other hand, we could migrate information just in time, that is, migrate it simply when it’s needed. This presents problems in its own right, and many expenses, as well. We will still need to preserve the old applications in order to know what it is we will be migrating from and to, as well as preserving the documents themselves. Further, if we only migrate at the point of need, the point of use, then we’re introducing a time delay that will necessarily discourage use of the information. The cost of analysis and the cost of conversion will still be very substantial.
There is likely to be a mix of both of these approaches. What really ought to be emphasized, is not only that there will be a mix in this respect, but there will also be a mix in many other respects. There will not be a single solution to any of the digital preservation problems. Documents and applications are varied. Preserving word processing will be different from preserving a digital image or an engineering document, for instance. Both the kind of solutions used and the technology of the solutions will be varied, so the likelihood of there being a single quick fix, or even a single slow fix, is very unlikely.
The third technological issue is intellectual preservation, the assurance of information authenticity and integrity. We must make sure that we know what we have in order to preserve the scholarly discourse that is important in our research environment, and to preserve public credibility in electronic documents of any kind.
A few years ago, at the University of Southern California, I heard Professor Harvey Wheeler speak on the virtues of the dynamic document. He enthusiastically described how one could put a document out electronically, a then a few months later change one’s mind and go back and alter the document. The document thus keeps up with one’s intellectual development. But , for many of us, in scholarly terms, this is an appalling idea. The need for what is sometimes called version control is essential. Does the footnote link that we’re looking at refer to the document that was in place at the time the person first made the footnote, or to a version changed since then? Will we be looking at the same information when we conduct scholarly discourse?
There are three kinds of change that can happen to documents: unintended change, accidental loss, or modification. There are also two kinds of intended changes.
There is the intended change that is well meant. Databases are constantly being developed and in flux, as are directories of various kinds. There are scholarly databases that need constant attention. These constantly change, but there may be snapshots that are needed at points in time, and that can be identified. There is also the intended change that is not well meant, this is to say, fraud. We have had examples of this in all areas of our life: public, government, business, even scholarly. We need protection against it.
There are solutions to all of these problems of intellectual preservation. They involve various uses of cryptographic techniques, which, however, will not require that information be hidden. Information can be public and available, but the cryptographic techniques can assure that a document is what it claims to be.
At this point, I want to note some comments that Cliff Lynch, Executive Director of the Coalition for Networked Information, has recently made on what the nature is of the document that we’re trying to preserve, and what it is we’re trying to save. Are we trying to preserve simply the content of the document, or the document’s form? A printed document might have various forms of typography or of layout. If we are preserving an old recording, are we concerned about the sound characteristics of that recording as recorded: the snaps, the crackles, the pops of a 1920s recording? What about the Ted Turner approach to preserving movies using colorization?
Interestingly, the issues here are related to current literary critical theory, that is, the relationship of form and text. What is the distinction between content and form? Among the things the history of the book has taught us is that text cannot be simply abstracted from its appearance on the page. The way the text is presented to a public has an enormous effect on how the content is perceived and used.
This question of how form affects content is the right question to be asked when considering preservation. What is it we’re really trying to preserve? If we are migrating information through various word processors, to take a simple case, do we need to worry about the boldface and italics, or do we just want to get the basic text (begging the question of what that really is)? The question gets even larger as you consider more complex forms of data. It is certainly a question that will have to be dealt with, in practice, by any preservation techniques we come up with.
There are two organizational challenges, as well, as we face digital preservation. The first, for libraries, is to restructure our personnel arrangements and organization within libraries, recognizing that electronic information crosses many present departmental lines.
How decisions are made to acquire electronic information is intimately tied up with systems issues, with public access issues, and with cataloging issues. The standard tri-partite division of libraries into technical services, public services, and collection development doesn’t serve us too well when we look at how electronic information is acquired and made available. We will have to find ways to use people in organizational structures that will be different from what we tend to have now.
The second organizational challenge is for our institutions— the university libraries, I mean our parent insitutions as well as ourselves. We need to face the question of funding over time.
We are used to funding our library operations on a year-to-year basis. But one of the problems with electronic information is that it needs constant monitoring and tending. Broadly speaking, if you throw a book in a closet and shut the door, in 500 years, you can open the closet, pull out the book and use it. With electronic information, this simply isn’t the case. If it is not cared for more or less carefully, over the course of a year or two that information will be very volatile and very difficult to deal with. The question of how to construct institutional arrangements that provide for funding over long periods of time is one that will call on some our best management and leadership skills.
National and International Activities
I want to talk briefly about what’s happening, both in this country and abroad, as organizations and communities prepare for digital preservation. Let me start with what’s happening in North America. There is a lot of isolated activity. But, in some ways, compared to what’s going on in other countries, relatively little is happening here.
The NSF Digital Library projects, the six major digital library activities funded by the National Science Foundation that started several years ago, have hardly noticed the issue of preservation. The second round of NSF proposals, which is currently being prepared, only uses the word “preservation” once. But, again, the projects that are currently being touted as candidates don’t, in fact, have much to do with preservation.
In the library community, there are several activities going on and we’re hearing about more and more every day. In the past few years, for example, there has been the Research Libraries Group ARCHES project, which is an outgrowth of some activities that the MidDecade Planning Group within RLG proposed in 1994. The ARCHES project is intended to be a model for digital preservation. But, to an extent, it has been subordinated to the content emphases of the Studies in Scarlet projects and now to the immigration projects. ARCHES is a quiet project. We have heard a little bit about it, especially since May, when it was more publicly announced, but there’s not a great deal available about what, in fact, is going on.
OCLC has committed to archiving electronic journals. Among the things it has said is that it is committing to the migration issue, one of the most significant issues about digital preservation. It is silent on the details of how it proposes to go about the migration, except that it is clear they recognize that there is a challenge. They recognize it when they say that, “at its discretion,” OCLC will migrate information forward. That’s not encouraging to those of us who don’t think we have a lot of discretion about this.
The Digital Libraries Federation (DLF) has proposed archiving as one of its three main thrusts. When the DLF was formed a couple of years ago, it was very encouraging to many of us, and it preempted the field in some respects by raising our hopes and expectations about what would be happening. But, in fact, we haven’t heard much from DLF about successes, accomplishments, or planning. I think, however, that will change. It’s very encouraging that we’re scheduled to hear more today about it today from Don Waters, Director of the DLF.
George Soete, in his document Issues and Innovations in the Preservation of Digital Information, has summarized many other efforts that are going on in a kind of partial mode (see http://www.arl.org/transform/pdi/index.html). LC and individual libraries are beginning to attack parts of the digital preservation problem.
In Canada, the Canadian Initiative in Digital Libraries has just been formed and announced. Programs are not evident in archiving. Their focus is primarily on the collections front, that is, on digitizing materials and on rights management. To some extent, they intend conscious emulation of the print environment.
What’s really interesting is the progress in the United Kingdom and Australia. The contrast to North American activity is notable. There is a sustained focused attention from the national communities that has been in place for several years. These are communities that have long traditions of national planning and of doing things in concert.
In the United Kingdom, a libraries review at a national level was conducted in 1993. The Follett Commission, headed by Sir Brian Follett, resulted in the formation of the Joint Information Steering Committee (JISC). From 1994 until the present, JISC has overseen the distribution of �15 million of funding. The funding is likely to be renewed in two waves, supporting at least 63 projects in access, digitization, document delivery, electronic journals, on-demand publication, electronic reserves, and other digital library components.
The mandate that JISC has given on preservation issues has been taken up by the British Library Research Initiatives Consortium. In 1996, the British Library (BL) conducted a major workshop on preservation, at which Margaret Hedstrom (University of Michigan) and myself were speakers. The BL has recently produced a paper entitled, “Digital Archives: Who Keeps Them and Who Pays?” This is just one example of the many reports that they have presented. Chris Rusbridge, who is Director of Electronic Library Activities for the U.K. Office of Library and Information Networking (UKOLN), has been a participant in digital library meetings in this country, as well.
In Australia, in 1993, after recommendations from some oversight committees, the Preserving Australian Digital Information agency, or PADI, was formed. It has developed vision statements, goals, and objectives, and now has a budget of $150,000 a year. It is headed by Jan Lyall, of the National Library of Australia, who is the agency’s Director of National Initiatives and Coordination.
Australians have consciously built strong links between the library and archiving communities. They have developed reports on electronic records management, as well as a number of essays on their web pages that cover such matters as definitions, proposed propositions, and statements of principles on the preservation of Australian digital objects. A national consensus has apparently already been reached. The reports and essays that the Australians have produced include documents on responsibilities. Who is responsible for what kind of archiving? What must be done? How should the information be preserved? They have also begun grappling with what the costs are, and how they are to be paid.
After this survey of international activities, some contrasts and commonalities are evident. What’s common among the efforts going on the English speaking world is consortial activity. Where activity is taking place, it is consortial: NDLF, OCLC, RLG, PADI, JISC, the CIDL. These are all group organizations, working together with the understanding that digital archiving is not a matter that can be solved individually. There are good collection development and technological reasons for reaching this conclusion.
In contrast with the U.S., however, the U.K. and Australian agencies have been very public with their concerns, intentions, and projects. They are very effective at communicating with each other internally and externally. In North America, it really is difficult to ascertain from the outside what thinking is going on among the many agencies that are working on digital archiving. These major players don’t appear to be talking to each other, as well. An exception, of course, is the excellent CPA/RLG report of a year and a half ago (itself a joint project), which laid out many of the important issues. It has been a foundation document for the U.K. and Australian efforts, as well.
Research and Action Agendas
One of the things that came out of the Preservation Committee meeting last May was that there was no clear research agenda for digital archiving. What is it that really needs to be done? Is there a consensus? We agreed that there was none. As a result, one outcome of our meeting was a proposition to the National Endowment for the Humanities for a series of preservation archiving workshops. ARL, CNI, and CLIR have all agreed to be sponsoring organizations for these workshops. We have asked for funding, which, if it comes, will be announced in April for a year-long series of about five workshops on specific topics, such as collection development: What are the redundancy issues? What are the collection development issues? Are they different now than in the print environment? Migration issues, authenticity issues, rescue and trusteeship and other intellectual property issues are other topic examples.
About 30 to 50 of the active archiving organizations in North America will be invited. There will be some overseas invited guests, as well— can see why, from the kinds of international activities I have talked about. We will try to bring in people who are active in these issues from a variety of communities: not only the libraries and archiving communities, but the computing community, museums, scholarly societies, vendors, publishers, and other organizations that are succeeding in this area, such as ICPSR. To my mind, one of the most important functions of this series of workshops will be to bring the players together who need to be brought together.
Our intent is to review the present state of understanding on these topics, and then gain consensus on what needs to be done and on who the proper players are to either volunteer or to be asked to take on a particular task. Of course, we will produce publications, both on the Web and in paper format, that report on the discussions. Research agendas are one thing, but action agendas are another. In my mind, the important thing to do at this point is get started. It’s late, and this country is behind, relative to other countries, in actually planning and accomplishing digital preservation.
As an action agenda goes, we need to take a lead from the computing community. They may have a good deal to learn from the library community about user orientation and about careful planning, but we have a good deal to learn from the computing people about getting moving, trying things, making mistakes and moving on. The approach of the Internet Engineering Task Force (IETF) is not to create standards over a long period of time and then implement them, but to implement things first, and worry about the standards later. Their explicit motto is, “rough consensus and working code.” It is a motto that we could well take on in the digital archiving environment, for it is an orientation toward action.
Another goal is communication between us all, which has been relatively lacking compared to what’s been happening in the UK and Australia. Don Simpson this morning made an announcement about their intent at the Center for Research Libraries (CRL) to take on an archiving role. In conversation with him, he told me that he is concerned he may be duplicating some effort elsewhere, or that CRL may be taking on tasks that others have already tried. The mutual communication issue is one that is before us, and we need to get moving on that to make sure that we’re not tripping over each other.
The sponsoring organizations met about a month ago, and we agreed that the first workshop will be held in June of 1998. Planning for that will begin in January. We are looking for comments and suggestions about the topics and the structure of these workshops. I will be glad to hear from you about that.
I look forward very much to hearing from Dick Rockwell, and from Don Waters and DLF, about what we can begin doing. I think that the iceberg is cracking and we will begin moving ahead. It is very important for us and for our future scholars and students.
Thank you very much.