Washington, D.C.
October 16-18, 1996
Kevin Guthrie, Executive Director
JSTOR
Elaine Sloan, University Librarian
Columbia University and
Chair ARL Committee on Scholarly Communication
MS. SLOAN: Good morning. It is a pleasure to welcome you and to introduce Kevin Guthrie, who will brief us on JSTOR.
Kevin graduated from Princeton University, where he majored in Chemical Engineering, and then he received an M.A. in Business from Columbia University. He is a co-founder of C.C. Sports Associates, a video products and computer software consulting firm. He was also a research associate at the Andrew W. Mellon Foundation and is author of the book The New York Historical Society: Lessons from One Nonprofit's Long Struggle for Survival.
A short time ago, JSTOR, an independent non-profit organization, was established with the assistance of The Andrew W. Mellon Foundation. In a relatively brief period of time JSTOR has moved in a remarkable way to become the service that Kevin will tell us of today. It is with great pleasure that I introduce Kevin Guthrie.
MR. GUTHRIE: I want to thank Duane Webster for inviting me here. Most of you know about the JSTOR (an acronym for Journal STORage) Project, so I won't spend a lot of time on the background, but I do want to provide some context by talking about the project's history, goals, and objectives. I then would like to spend some time talking about some of the things we have learned from the time this project started in early 1994. JSTOR has been listed internationally, and we have used the benefits available because of that; and I will talk a bit about the service we are prepared to provide and updates on where we are with publishers and pricing. That is an overview of what I hope to cover here.
As I said, JSTOR was started in early 1994, and is a product of The Andrew W. Mellon Foundation. Jim Blackman and Bill Bowen spearheaded the early stages to try to build a demonstration project that showed it was possible to have the backdrops for core journals and that there were opportunities for long term savings and shelf space. The idea they came up with was to build a prototype of archives with ten journals, five in economics and five in history, and to have those at a number of test sites. Originally there was a test at the University of Michigan. The test had 750,000 pages and the basic technology was an image-based technology, imaging the pages and then creating an OSR text file that would be fully searchable. We have taken the original works and put a lot more into it, but that is the basic starting point for the technology process.
As mentioned, we became an independent not-for-profit organization in August of last year. We are a 501(C)3 charitable organization dedicated to helping the education territory deal with the advances in technologies. We have an independent Board of Trustees; this is a list of the board members: Dick De Gennaro, Librarian of Harvard College; Cathleen Synge Morawetz, Professor of Mathematics at New York University and the President of the American Mathematical Society; Bruce Simmons, president of SEN; Ira Fuchs, Chief Technology Officer at Princeton, and also our chief scientist; Bill Whitaker, the former Provost Emeritus and Professor of Business Economics at the University of Michigan; Mary Patterson McPherson, President of Bryn Mawr College; and R. Elton White, who was President of NCR before it was acquired by AT&T, so he gives us some business and technology management expertise.
I think there is a fair amount of confusion out there about exactly what and where JSTOR is and how it fits into other things. Some people think it is in Ann Arbor at the University of Michigan, and others think it is still at the Mellon Foundation. In reality, our Board is in the New York office, above the SIBL branch of the New York Public Library, with whom we have a nice relationship. We now are working with two universities. We started collaborating with just the University of Michigan, but now we also work with Princeton University. We have a duplicative database in its entirety at Princeton, because we wanted to show from the very start that we could replicate the system. It spoke for the protection of the data so that people would feel comfortable, but also to prove that we can do mirror sites. We will need to do mirror sites overseas; I tried to use JSTOR from Hong Kong, and it was insufferably slow, so we decided to first create a mirror site in the United States in order to find where problems may lie. Furthermore, in addition to people at Michigan, we also have software developers at Princeton working on the project.
Let's just quickly cover some of the objectives of our enterprise. First, as I mentioned, our basic mission is to help the scholarly community deal with the advances in information technology. Also, a primary objective or mission is to develop trusted archives of the core of general literature, emphasizing the conversion of entire journal back files. In every case we go back to the very first issue.
At one point we were contemplating mounting current issues and back issues together. Over the course of time as publishers we moved our emphasis entirely to the backlog. We will ourselves carry out three demonstration projects that deal with current journal issues, but in general we are going to work with other publishers and link with them to create search capabilities, and let the publishers take care of the current issues, staking out our area in the data backlog.
One very important aspect of JSTOR is that it is set up with a system market perspective. We think that there are opportunities for a win-win situation among all the actors in the scholarly community with this particular material. We are sometimes caught in between scholars, libraries, and publishers, but to assure that win-win situation is a very important part of what we are trying to do.
I am going to talk about most of these objectives in more detail as we go on, but I want to emphasize this last one, which is to provide data and to study the impact of electronically accessing these materials. You know, it was very interesting for me to be here yesterday and learn more about librarianship and libraries; one of the things that is obvious is that there is a need for data and for projects that will do something about that need.
It is very important for JSTOR to put out published information about what we do in order to help you understand what is happening with our particular project. I am totally in awe of the range of things that you all have to deal with when we start to talk about the electronic information and multi-media projects that have been coming. We are focusing on just one little part of what you have to do, and we have found our task to be quite complicated, but we are hoping that we can help you by taking at least that one part off of your plate.
One of the things we have learned is that it is a lot harder than you might think to do something like what we have done. When it started, Bill Bowen talked to Ira Fuchs and it seemed simple enough to scan the pages, put them in a database and then have access. You can hear somebody talk about the concept, but when it comes to actually working it out, it is a whole different beast. So I will talk a little bit about some of the issues we ran into in the process. We do have over a million pages in the database now, and it has been quite a task.
Here are a couple of the highlights. The biggest difficulty for me has been that it was more than a full time job for the first eight to ten months of working on this project. It was dealing with publishers and encouraging them to participate, and you all know that everybody has a certain measure of fear about electronic technologies and how they may impact businesses. Even though we were dealing with backlog, a territory where publishers don't really get a significant amount of revenue, they are still worried about letting that aspect go. They are at the front of electronic technologies, and they think that just because the backlog doesn't create revenues now, that doesn't mean that it won't in the future. So trying to get publishers to participate has been a monumental task.
We then needed to answer the question, "What is a complete run?" Originally, again in the kind of simple mind way we thought about this project, we thought we could just go to a library, get past copies, and we would have the complete version. We had a relatively good relationship with Harvard, so we asked them if they could give us a complete run, to which they replied, "Sure, no problem." We bound them all up and sent them to where we were going to keep them, and, of course, when we did try to input the data, we discovered that they weren't complete. There were missing issues and missing volumes and pages torn out and all kinds of problems. That is not meant to criticize Harvard; it is only to say that there is nowhere to go where there will be all the pages. So then we went out looking for a publication record. But the record for The Economic Review, for example, didn't exist. So trying to figure out what a complete run is was another problem.
The fourth problem we ran across was how to organize information in a way that insures convenient accessibility. It is an incredibly labor intensive job, and that is just what we have learned. It involves actually organizing information, and getting it into a format that would be usable in an electronic system was extremely difficult. But, at the University of Michigan, we have developed an entire process for doing this. This is one of the areas where we can, I hope, be helpful as we learn more and more publishing because, for those of you who will be converting your own materials, I am sure it would be useful for you to know what we learned about and what kind of steps we took to get the material converted. For example, although you all have experience with microfilm, microfiche, etc., it is limited to organizing the materials to be sent out to someone else who does the converting. But we have an extensive quality control process on the back end of this project. Also, we have had to learn how to ensure quality performance once we do get good data into the system. How do you make sure that you have a system that is really usable, dealing with issues such as response time and printing? We have really made great progress in these areas.
We are in the process of developing a paper we can put forth regarding the production process, but in the meantime let me quickly go through the steps in that process. There is preparation, conversion, and then the actual deployment. In each step there are special tasks that need to be done. It is a much more complicated undertaking than we thought, as I have said. The first thing to do after we get a publisher to sign up is to have a final view run. Usually, we try to get it from the publisher, because, quite frankly, the publisher's run usually hasn't really been used, and so there is less likelihood of marked or torn pages than from a run that is publicly accessible in a library. We then go through every page of every journal. This is mostly done by students in the library school at Michigan who literally page through the entire journal looking for missing pages, pages that have marks, or pages that are otherwise unsuitable for the database. They then make note of these pages so that we will find replacements.
Then guidelines are needed for organizing information for the scanning process. It is very difficult to actually translate a table of contents, for example, from the paper world to the electronic world. What we needed to solve was how to make something, whether it is an article or an information table, usable when it is on a screen and completely out of context of the entire issue. For example, a general discussion, or sometimes even the title of the issue, can be meaningless without a context to work from. So you have to figure out a way to organize this material so that once it is whisked out of context the user has some sense of what he or she is seeing for that particular citation. So that kind of work goes into preparing a set of guidelines for the people who are scanning the information, in order to create the usable database.
For the scanning process itself, we scan the pages in, creating images. Now, these images are just one of the databases that are a part of the JSTOR system. The second database is created when OCR rounds those images to create an unstructured text file. That text goes through two steps of error correction for quality control; we contract with the scanner to make sure there is at least a 99.95 percent accuracy. It scans 600 documents per each resolution. We do all this to make this text usable for searching. It is not to make this text usable to display on the screen; that is a whole other use, and it is much more expensive to get an accuracy that will allow it to be displayed on screen. So we have created a text file which is useful, but it is definitely a complement.
Once we have images that are 100 percent correct -- perfect -- then comes the searching text. There is a third keyed-in database, an electronic table of contents; it is all citation information, and if there is an abstract, it also gives the abstract. So, when a user searches words in a title, author name, or abstract, it searches the table of contents database, which is linked word for word to every article in the database. For instance, if you search for citation information, when you find the item, JSTOR knows exactly what page it is on and jumps over to the table of contents and asks to display the pages.
These images all end up on CDs and are shipped back to the University of Michigan, where they are uploaded into a main database. Right now they live on magnetic disk drive servers with a rate development organization scheme.
We learned several things during the process of this conversion. We may, for instance, run across difficulties in publication records, such as what happened with the Review of Economics and Statistics. In their first year it had issues one through four, and two supplementary issues. The next year there were four issues and one supplement. But then, in 1923, they had issues every month. So, if you were looking at the back file, didn't know its publication record, and saw that there were 12 issues in 1923, you may think that there were also 12 issues in 1922, but that we just didn't have them, or you may begin wondering whether you were looking at the correct information. If the record shows four issues for every year, people are comfortable. When it shows four, seven, twelve, and then four, it is a problem. So what we have had to do in this case is go back to the publishers and ask after their complete publication records. The record we now have goes back to 1973, but we know for a fact about the erratic record of the 1920s, and so we have to start asking libraries and try to compare. That is the kind of work that goes into this entity.
Another issue that comes up regards complex structures. For example, the American Economic Review may have the proceedings from a meeting in Akron, Ohio. There will have been six to eight papers presented at this meeting, but in the table of contents only the meeting title is listed. The papers are not indexed individually. This is why having a librarian specialist look at the information from the start to do a kind of intellectual preparation of the materials is so important, so when the librarian goes through the record he will find this kind of situation and tell us that, while typing in the index, we have to watch out in these issues for these papers requiring special treatment. In this case, the article is the entire proceedings, but we want to break down the article into the individual papers so that each one will be searchable by author and title. So there are indexing levels that are created that will allow these papers to be indexed as articles, and searchable as such. That is the kind of thing we have to do up front to get this right.
We now have a site on the Net: http://www.jstor.org. In the system you have options to search, browse, get help, etc. Browse just gives you an option to go in and look at your volume and issue numbers and see all the issues in sequence. You have the capability to search on the full text, the author, the title, or the abstract. We have tried to make the search form as easy to navigate as possible and to put as much self-documentation on the screen as possible. You can identify dates that you want to look at and put in what fields you might want to search. We have a listing at the top of the screen that says what you have searched for in each journal. Back on the shelf you would have to work your way through the issues to find the information; now you can put in a full text search on all the titles within the database, as far back as 1920 for the American Economic Review, or 1890 for the Journal of Political Economy.
Michigan also developed a process for making these journals very readable on-screen without the blotchy look that is so often seen from scanners. This is nice and smooth and clear. When you scan images at 600 bpi all the graphs, charts, subscripts, etc. come up and are readable, and if you print out the document it actually looks better than the original, especially if the original has been around awhile. The print-out is very crisp and very clear and very, very useful.
What does that offer for people? Well, I will answer that from a system-wide perspective considering scholars, publishers, and libraries, to talk about what we think are the benefits. First, obviously, there is the new level of searchability to which I just referred. You are getting access to information that was never accessible in this kind of unique form before. We are going to pay close attention to how this affects scholars. Yesterday there was a lot of discussion about whether or not undergraduates might benefit from these kinds of conversion processes, and I think they will, also. The availability of complete runs, the fact that these are all standard information windows in the desktop world, and that there are no missing issues are all benefits. Then our experience will benefit all advocates, if you will, who are all trying to collect titles in a discipline so that scholars can search across titles. Furthermore, as JSTOR goes forward and has some success getting publishers to sign up, there will be the capability to search across many different fields, all the way back to the first issues.
The benefits to the publishers are important. We believe that there is value in the publishing process, and we believe that there is also value in the branding, if you will, that exists right now. In scholarship, if somebody writes an article and knows, based on the quality of his or her work, that the article won't get into a particular journal, he or she will send it to a second-level journal or what have you. So there is a supply side efficiency, and there is obviously the real downside to this that no one can read the full volume of journals out there. But whatever happens to the publishers in the future, no one can question that, looking back, there were important titles out there that were published by publishers, and so we want them to participate.
The dissemination of published literature for which there is currently little economic return is something that we are doing, and something that no publisher would do. That is because it doesn't make sense to take this risk, except that we have all these resources, and we don't have to get a return on our investment. There it is: we are doing what publishers cannot do, both in that respect and in the fact that we are going backwards; we are doing a retrospective conversion. Furthermore, we can fulfill the mission to disseminate the material right off, whether or not we get any economic return, because all of this material is becoming available in a way it hasn't been before. We believe that there is a move towards the electronic publication of current issues. We are using that foundation to build upon. Basically, the concept is that we are giving the publishers a look at electronic publication, but with a set of materials that makes them less nervous than if we were using current issues. So I think we can get some compromising from publishers that we wouldn't be able to get otherwise.
In terms of benefits to libraries, I have talked a lot about access, the availability of entire runs, and the ability to build a block of collections. A lot of libraries have intermittent runs, caused by periods of time, like the war years, when libraries services were down. Also, small libraries may not have some runs at all. As I mentioned earlier, learning about new technologies for conversion and about what has been happening in electronic literature is very important, and we are committed to being open about this information, sharing with you and others in the community what we are learning.
We also think -- and this is important -- that collaboration will provide many opportunities for lower costs. If 2,000 libraries are all converting the same core titles, there will, of course, not be gigantic economies of scale. They all have converted the same material. But, if we can do that job once and share resources to do so, there will be savings, particularly long-term savings, and everybody prospers. We believe very strongly that we have an opportunity to start a project that is really about archiving electronic information.
Admittedly, we are very new, and so everyone asks me how I can be certain that we will be here ten or 20 years from now. I can't answer that. So we have to prove over time that we will sustain ourselves, and so we will just wait for you, but if this is going to work we do need, obviously, people to jump in with us in the beginning. It is very important to emphasize that there are long-term savings in storage boxes, shelf space, operative costs, and in other areas, but we won't see them until we do the work. We will work very hard to demonstrate the dimensions of these long-term savings.
Demonstrating and actually getting out there, doing something, learning, and getting real data is very important. We have struggled for a long time to negotiate with publishers, and, with the emphasis on the back file. This is a quick count of journals that have signed up: seven in economics; eight in mathematics, including all the journals from the American Mathematical Society; six in sociology, including the journals in the American Sociological Association; about six in history; and four in the population of political science and eco-studies; the list goes on. We have now 42 titles. Essentially, we have broken through the barrier. We have shifted from going out to publishers and pleading with them, to publishers coming to us and telling us that they are interested in participating in the program. They want to be associated with the other titles we have and with this market, so our concern has shifted from how to get publishers to sign up to how can we get the material online.
There is no question that we parallel production processes at other places, and we are committed to becoming part of those in a union. There will be opportunities to do that at many places, and institutions will have a chance to learn more about this process by participating in this project.
So what is it that we are actually now unveiling? What are we doing? We are trying to get our libraries and institutions to commit to coming on board. We have gotten very enthusiastic responses. What is it that we are trying? Phase I of JSTOR is to have a minimum of 100 titles that will be converted and available within three years.
We are going to start out on the first of January, and I want to briefly talk about the one-time JSTOR development fee. We have a one-time database development fee, and then a continued fee for maintenance. Some libraries have asked why we don't just have an annual cost like every other vendor does, and the answer is that we are not like every other vendor. We are trying to convert something that nobody else is converting. Considering the economics, one sees that if there is a one-time fee for content and yet content is added all the time, the one-time fee isn't so bad. We will have different pricing for large, medium, and small libraries, because our goal is really to get as many people in the process as possible. A great number of people have told us that our future target of 750 libraries is too ambitious, but that is where our annual access fees will pay for the enterprise's recurring costs.
What exactly are the prices? These are the prices for the large libraries. The formula is the number of undergraduates enrolled in FTDs plus two times the number of graduate students enrolled. The idea is to reflect some of the research aspects of the institution, and there are cutoffs. We will continue to work on that formula and publish what we find. For charter publishers and libraries we offer a discount, trying to get people to participate. That would be a discount not only on Phase I of JSTOR but for any other future phases. Just to highlight one example of such a discount is Science magazine. We are not going to put Science magazine into the main JSTOR because it is a gigantic task that will bring us to a halt, but we have been talking to the AAAS about linking in an electronic version of Science. If you think about what this is, even if JSTOR didn't have the other benefits for archiving and for giving information to the community, if you just talked about access, it is still a fair price. It would cost more for you to convert the complete runs of just two titles to microfilm.
Not to end on a difficult note, but the final thing I want to mention is that it is difficult being drawn into the vortex between publishers and libraries on issues of copyright, intellectual property, etc. We have to find compromises and opportunities; minor solutions to certain types of issues can't work for us, and so we try to figure out new ways to approach problems and other questions. Right now the access will be IP address distributed. Hopefully, though, as technology evolves, we will help to see this access develop in more secure ways. There are many difficult issues to deal with.
We are going to start with a working term of three years, but we do want to assure you of our permanence. If JSTOR were to cease to exist for whatever reason, the data would still be delivered to the library in whatever the prevailing format was. We would not want libraries to move paper off the shelf and have it replaced or gotten rid of and then have nothing if JSTOR were not to succeed. As part of that, in all of our negotiations with publishers we have made sure that, although they have the right to withdraw from participation in this project, they cannot withdraw data from the participating libraries. In other words, whatever material we have converted and made accessible, according to the contract with our publishers, even if they withdraw, will still be accessible by anybody who has already purchased rights to it.
That was a very quick overview, so if any of you have questions I will gladly answer those now.
Copyright � 1999 by Kevin Guthrie
(From the Audience): Two questions. One, in the fees, have you considered or will you consider multi-institutional consortia?
MR. GUTHRIE: Yes.
(From the Audience): The second question is, what is the JSTOR view on interlibrary loan?
MR. GUTHRIE: JSTOR is essentially one giant consortia of product information. We have created an economic model based on individual participants paying a certain amount, and that sum hopefully covers our cost. That having been said, we recognize that consortia exist and are being put together all the time. What we will do regarding that is sit down with the consortia and identify exactly what costs are consistently saved. Will there be one negotiation, and how much does that save everybody? How much does that save the system? Is there one set of people who would install the system at multiple locations? We will determine what the savings are and will pass them all along. We will be very open about how we determine that. We are happy to pass along the savings in the system, but we don't think that there are arbitrary ranges in discounts. We will work systematically through this question and offer the appropriate discounts.
We have often talked about interlibrary loan growth and we have learned a lot in the process. Basically, we would not want to be in a situation where one library would subscribe to the data and then ten other libraries would not participate in our collaborative enterprise because they were getting access to the material from that one library. Now, the libraries tell me that won't happen, that these are core titles, so I needn't worry. We are going to do whatever we can to encourage those people to come to our site and to go directly to the publisher, but having said that, we will also allow interlibrary loans and allow libraries to continue the processes that they have had. What we will try to do is to set up a period of, say, two years as a test when interlibrary loans are allowed, and ask you to help us collect data on how well it works. We are initially saying yes, but we will see what happens. What we don't want to happen is for individual libraries to serve other libraries; it will just end up hurting them because we will then have to raise prices to cover the costs. So we will work through this together, I hope, and test the proposition as to whether or not interlibrary loan will have a negative impact.
(From the Audience): Where does the division between current issues and the back file lie? Is it the same in every case, and do you have commitments of back files from all the participating publishers?
MR. GUTHRIE: Let me talk about just the concept of the current issues and back file collaboration. We are talking to libraries about how can we make a search from current issues go back through the back files. We have talked a lot with OCLC, as an example, about their ECO project, making it possible for people who come in and search on ECO to search through the journal's current issues right back to the back file. I talked a minute ago about Science magazine. Johns Hopkins University was about to sign up in an agreement with them that would make the journal fully searchable. That requires a certain collaboration between current issues and the back file exchange. That, though, is a case where there are current electronic issues; most of the people we are dealing with right now don't have that.
So what we have is a non-standard approach, unfortunately. Different publishers want different amounts of protection, but most of them end up putting "current" issues in a period of either three or five years. It is not a fixed date, as it is in the test projects, where 1990 is the stop date. Rather, it is a moving wall, so we can guarantee that archives are being taken care of as time progresses.
Enterprises, like OCLC, the University of Chicago Press, or Highlight Press, that are doing electronic conversions are not so keen on the moving wall concept because they are building a larger database of information. Basically, in the old days when there was only a paper inventory, only the current year was sold and converted. After that, libraries took over, for the most part. That is archiving. I think of archiving as an economic construct; it is data or information that you don't sell and you don't get return on.
That changes in the electronic realm. Publishers will say they are in the archiving business. Well, don't believe that; they are really in a deeper inventory business. Inventory is bigger in the electronic realms, and so what we then say to the publisher is that it is fine if they want to keep maintaining, as long as they continue to provide access to the material for a true, accessible archive. We will then agree to hold the wall at a certain point, but if they don't provide access for whatever reason, we ask that they turn it over to us, and we will be responsible for the archives. We will continue to be an archiver of this material.
Finally, one of the factors in the OCLC discussions is that we will take the whole database and package it into a different mechanical media, not an accessible one, but one used as a kind of "escrow" site for all the data to provide another level of protection.
(From the Audience): Kevin, it is reasonable to assume that, as people become used to using JSTOR, they will find that it is, indeed, a superior mechanism for journals than hard copy is. Therefore, this sliding wall between the paper and the electronic will not only become irritating, but the publishers will have to determine if they would like to publish in electronic form. How would you deal with that? Do you have any plans to enter the publishing business with this so that there eventually will be an electronic journal from the outset, or do you see there being a potential conflict down the way if the publishers go electronic, which will then draw from you?
MR. GUTHRIE: This is one of our big challenges. We have to work through those questions of seamless searchability with the "current" issues. We already have irritations, if you will, from users in the communities that get JSTOR asking why we don't have 1993 issues, for example. We are working toward that end. We have three demonstration projects where we will link current issues to back files, but we are not going to be in the current issues publishing business. So we are working with the publishers to seamlessly link these two sessions of data.
(From the Audience): Kevin, restricting access by IP address basically limits the access to campus use. Are you working to provide even an extended domain access, which is still probably not as much as we would need, but would at least be better?
MR. GUTHRIE: Yes, but we are also working on other units of individual access to databases, and we will work to help broaden these types of projects to deal with the issues that I talked about a minute ago. For instance, a faculty member who is an authorized part of the Stanford community didn't have access to JSTOR from his location. That individual feels deserving of access to JSTOR, so we have to figure out a way to then be able to do that from different locations.
(From the Audience): What is the monitoring process to assure one print-out, one electronic copy per person? How is that governed?
MR. GUTHRIE: We will have some software thresholds that, when you receive it, it would be obvious if somebody is systematically downloading the database and trying to do something with it. This does happen, so we will pay attention to it. We won't, though, be running around trying to find every single copy.
Thank you.