Some new evidence is presented suggesting that the costs of digitization, or at least compression, may be shrinking. In order to prepare this book for publication, the editors artifically cut off a discussion that still continues at the time of final proofs (May 1995) and shows no likelihood of ending for a long time.
From: Stevan Harnad harnad@ecs.soton.ac.uk Date: Thu, 19 Jan 95 09:23:53 GMT Message-Id: 21840.9501190923@cogsci.ecs.soton.ac.uk To: ginsparg@qfwfq.lanl.gov Subject: One other thing Cc: B.Naylor@soton.ac.uk (Bernard Naylor LIB), amo@research.att.com (Andrew Odlyzko)
Hi Paul, a graduate student with many prior years in publishing just told me something interesting about the economics of electronic publishing for paper publishers that made me whack my head and say "Why didn't I think of that?" It's about the contentious issue about the true per-page costs. He said that the reason the electronic-only, nonpublishers' estimates are so much lower than the paper publishers' estimates for the same electronic page is not ONLY because of what I had said (i.e., that theirs is based on subtracting electronic savings from a production system that is still designed for paper), or rather, what I said is naive and could be said much more explicitly:
TWO THIRDS OF THEIR PER-PAGE ESTIMATE IS BASED ON THE OVERHEAD FROM THEIR PAPER OPERATIONS! That is, they are not reckoning the true saving for the electronic journal considered in isolation, but for the electronic journal within the context of the continuing costs of all the other paper journals! If the overhead were figured (as I would it think is should be) proportionately to the actual costs of the electronic journal itself, our page-cost figures would be in much closer agreement.
There may be a general economic lesson/problem in all this (though I, and I'm sure you too, would not be prepared to become an economist TOO, and not just an amateur publisher, in the service of hastening the electronic day for us all -- and for our own research lives in particular): The question is, How does one make a transition from an expensive technology to a cheap technology for accomplishing much the same thing? Railroads tried to protect the coal-stoking, IBM tried to sustain mainframes, so I suppose the advent of diesel engines and universal PCs was slowed by this economic inertial force. Maybe it's natural. Is there any benign way to help hasten things without radically jeopardizing people's livelihoods?
It's easy to mask the sympathy that I know you too feel about this side of it, with an emphasis on the stereotype of profiteering, but I wonder how fair and realistic that is, if one looks at scholarly publishing more closely. (I'm not pretending to know, just confessing ignorance, and some worries that one does not want, historically, to become a scholarly capitalist, saying "People and environment be damned, the Darwinian market forces will converge on the optimal material outcome on their own.")
Just a sample of some moral vacillation that comes over me now and again. (Ironic, considering that at the moment, it's certainly electronic publishing that is the economic underdog...)
Chrs, Stevan
From: amo@research.att.com Date: Sun, 19 Mar 95 21:48 EST To: ann@cni.org, ginsparg@qfwfq.lanl.gov, harnad@ecs.soton.ac.uk Subject: conversion to electronics
Here is an interesting observation which suggests that conversion of old scholarly material to digital form may be feasible very soon, even sooner than I expected. I learned from Hal Varian, an economist from the Univ. of Michigan, that economists are setting up a project to digitize all the articles in the top half dozen or so journals in their field. Their costs (which are pretty reliable, as they are based on signed contracts) come to 40 cents per page, and "that includes scanning and OCR conversion to a very high (about 1 error/page) level of accuracy." The scanning takes place in the Dominican Republic, and involves destroying one copy of each journal issue. The scanning is done at 600 dpi, and (after application of lossless compression) leads to a file of about 40KB per journal page. If this file is kept, then as better OCR systems become available, they can be applied to obtain better text.
The cost of $0.40 per page has interesting implications. The estimates in my "Tragic loss ..." article were that the total mathematical literature consists of about 20 million pages. To digitize that would then cost $8M. This would be a one-time cost. By comparison, the total revenues of all publishers from mathematical publications are about $200M per year. Similar figures must apply to other fields. Thus all it would take to move towards converting everything to digital form is somebody to take on the task of organizing the effort. One can imagine all sorts of ways to paying for it.
One major problem in undertaking such a digitization project is the issue of copyrights. The economists do not have a problem. The journals they are digitizing are all run by learned societies, which feel that placing old issues in the public domain would be good advertising for them. What about commercial publishers? The only data point is derived from a conversation I had this past Friday with a math editor for a large commercial house. He told me that his view, which he feels others at his company share, is similar, namely that there is no money to be made on back issues, and that releasing old copies into the public domain could only be helpful. It would be nice to get more data on this topic.
Best regards,
Andrew
From: Paul Ginsparg 505-667-7353 ginsparg@qfwfq.lanl.gov Date: Mon, 20 Mar 95 17:04:04 -0700 To: amo@research.att.com Subject: Re: conversion to electronics Cc: ann@cni.org, harnad@ecs.soton.ac.uk
andrew,
...40 cents per page, and "that includes scanning and OCR conversion to a very high (about 1 error/page) level of accuracy."
unlikely. what kind of ocr conversion are they talking about? is it to plain ascii, or does it preserve page markup and font information? (1 error/page is conceivable for single font clean text, but currently far from achievable for anything involving equations or in-line mathematics).
the only ocr i've played with that preserves page markup and font information is adobe's (the beta version of green giant), which translates to postscript (with little bitmapped inclusions of anything it can't identify) -- it does quite a good job on the overall page markup and font preservation, but reading carefully made too many errors in equations (specifically greek subscripts/superscripts, due to smaller size effectively scanned at lower resolution). but it had some tunable parameters so conceivably could be optimized better for any specifically application (e.g. not even try to identify anything smaller than a certain size, and leave instead as bitmap which still left readable sub/superscripts)
on the other hand, as you point out once one has the bitmaps the ocr can be done at any later time and is relatively straightforward (perhaps involving an additional level of interaction with a system that prompts to resolve marginal cases), so the scanning alone will be a useful enterprise. (though i preferred you're alternate "suggestion" to retypset it all in tex, modulo the difficulty of getting deceased authors to correct the new proofs...)
The scanning is done at 600 dpi, and (after application compression) leads to a file of about 40KB per journal page.
perhaps they have a compression algorithm optimized for bitmapped pages, standard gzip will not do nearly this well for 8.5 x 11 pages at 600dpi.
so i question some of hal varian's info, but we know he's serious (i spoke to him a bit at msri in december, and had corresponded with a few times previously and since), so we can wait and see what exactly they do, and at what functionality and cost.
He told me that his view, which he feels others at his company share, is similar, namely that there is no money to be made on back issues,
the lack of commercial value to the copyright of the older material is of course what stevan has been harping on all along -- except they may soon find that there is significantly less money to be made on future issues as well...
pg
From: amo@research.att.com Date: Tue, 21 Mar 95 22:45 EST To: ginsparg@qfwfq.lanl.gov Cc: ann@cni.org, harnad@ecs.soton.ac.uk Subject: Re: conversion to electronics
Paul,
The quotes were for straight ascii text. I did not get the details, but I doubt there was any worry about preserving page markup. (Hal Varian was not involved in the negotiations. In fact, on some details his memory was hazy. When I checked with him on the accuracy of my message, he found that the scanning actually takes place in Barbados, not the Dominican Republic. Also, the journals are not destroyed, only cut up and then rebound. However, the $0.40 price per page is correct, and includes the cutting up and rebinding.)
I do not know what compression algorithm is used, but 40KB per page of text at 600 dpi is reasonable, I have been assured by one of our local experts. Ordinary FAX compression (CCITT Group 4) is often superior to Lempel-Ziv, which is in gzip. It is a matter of tuning the algorithm for text.
I agree with you that it would be better to TeX all the old literature. However, that takes an order of magnitude more money. I got some precise figures from Keith Dennis, who is the new Executive Editor of Math. Rev. There are outfits in India and Poland that will do technical TeX for about $5/page. They use two typists, and have error rates of less than one error per 5 pages. (The corresponding figures for doing this in the US are around $20/page.) Thus processing the entire mathematical literature would cost around $100M. This is again a one-time cost, and is only half a year's cost of the present math journals. However, it is a large sum, and it might be hard to assemble it. On the other hand, $8M is within reach of an organization such as the AMS, and indeed I have managed to get several people there interested in trying to do this.
Basically, if you have a project costing $8M, you can think of doing it on speculation. You would only need 200 libraries to cough up $40K apiece to recover your costs. Now $40K is what a good library spends on math journals in 3-4 months, and much less than it spends annually on space for all the old journals. Thus you can easily make a compelling economic case for a library to buy the digitized version. If you increase the price by a factor of 10, though, it is a different story.
What excited me most about the information I sent out this past weekend was that we finally had reliable figures for the total cost of the entire process of taking old bound volumes, transporting them, cutting them up, scanning them, etc. The resolution of the scan, the storage requirement, and the quality of the OCR output are not as important. Those are going to improve dramatically in the next few years. If a terabyte is too expensive to store today, it will be half that price in 18 months, and a quarter in 3 years, and so on. On the other hand, the need to handle the original printed articles puts an effective floor on the cost of the project (at least until robots capable of going through books page by page are developed). To know that everything can be done for $0.40 per page is heartening.
In some cases the basic physical processes are expensive and likely to stay that way. In all the hoopla about the Information Superhighway you will sometimes see very high figures for the cost of rewiring the world with optical fiber. As it turns out, most of that cost is for general plant work. To wire up the average household in the US costs about $1,000. It does not matter much whether you use optical fiber, coax, or ordinary fiber. Of this $1,000, about $400 is for the high-tech equipment, such as the cables, the switches, etc. These costs are going down rapidly. However, the other $600 is just for the stringing or burying of the cables, and this cost is going down very slowly (as improved tools counteract increases in wages).
Best regards,
Andrew
From: amo@research.att.com Date: Wed, 22 Mar 95 21:40 EST To: ginsparg@qfwfq.lanl.gov Cc: ann@cni.org, harnad@ecs.soton.ac.uk Subject: Re: conversion to electronics
Paul,
Hal Varian checked more carefully with his sources, and it turns out that the 600 dpi scans take about 100KB per page, using standard Group 4 Fax/TIFF compression. Thus while there may be algorithms around that do get down to 40KB, they are not going to be used in this project. Sorry for the confusion.
Regards,
Andrew
Date: Thu, 23 Mar 95 01:43:28 -0700 From: Paul Ginsparg 505-667-7353 ginsparg@qfwfq.lanl.gov To: amo@research.att.com Subject: Re: conversion to electronics Cc: ann@cni.org, harnad@ecs.soton.ac.uk
Date: Wed, 22 Mar 95 21:40 EST To: ginsparg@qfwfq.lanl.gov Cc: ann@cni.org, harnad@ecs.soton.ac.uk Subject: Re: conversion to electronics
Hal Varian checked more carefully with his sources, and it turns out that the 600 dpi scans take about 100KB per page, using standard Group 4 Fax/TIFF compression. Thus while there may be algorithms around that do get down to 40KB, they are not going to be used in this project. Sorry for the confusion.
that sounds more realistic, though still a tight squeeze at that resolution depending on what the pages actually look like. (as you know from a simple counting argument there is no lossless compression scheme that can guarantee to compress at all every file of size N bits: since 1 + 2 + 2^2 + ... + 2^{N-1} = 2^N - 1 , the total number of files of size less than N bits is less than the total number of possible N bit files. in the case of bitmapped pages, however, these are anything but generic binary files, and the compression scheme can be tuned to take maximal advantage of advance knowledge that there will be mainly whitespace. but as a simple exercise, consider a maximally idealized compression scheme that encodes only white<->black transitions, and the changes in their location from one scanned line to the next, and see a) how small a compressed file that is likely to give, and b) how its size scales with the scanning resolution )
pg
From: amo@research.att.com Date: Sat, 25 Mar 95 19:22 EST To: ginsparg@qfwfq.lanl.gov Cc: ann@cni.org, B.Naylor@soton.ac.uk, harnad@ecs.soton.ac.u Subject: Re: conversion to electronics Status: OR
Paul,
Certainly one cannot compress every file. The issue is how well can you do with standard software on printed text. The chaps who are doing the economics journal project have demonstrated 100KB per page with the usual fax algorithm at 600 dpi 1-bit per pixel (and apparently about half that at 300 dpi 1-bit per pixel). That already makes storage requirements manageable. If one is willing to give up on lossless compression, one could surely do better by applying various cleanup methods to the scanned image, but there does not seem to be any need for that.
Andrew