Association of Research Libraries (ARLĀ®)

http://www.arl.org/resources/pubs/symp3/blair.shtml

Publications, Reports, Presentations

Gateways, Gatekeepers, and Roles in the Information Omniverse

Will it Scale up? Thoughts About Intellectual Access in the Electronic Networks

David C. Blair

Computer and Information Systems
University of Michigan

"I am convinced, Yorick, continued my father, half reading and half discoursing, that there is a northwest passage to the intellectual world; and that the soul of man has shorter ways of going to work, in furnishing itself with knowledge and instruction, than we generally take with it."
- Lawrence Sterne, Tristram Shandy

I would like to examine, briefly, the prospects for intellectual access on large electronic networks; in particular, I am interested in how a searcher might find information that has a particular intellectual content or subject. I will not be considering searches for information by specific authors or with precise titles. I consider searches like this to be comparatively straightforward. Here I am concerned with the much more problematic searches for information that have a fairly specific intellectual content, but no precise authorship. Let me begin with a brief anecdote that may make this clearer:

Although I grew up on the East Coast, I went West for college, and like many before me I acquired an Easterner's fascination and love for the frontier. In particular, I became interested in those hardy individuals who preceded me, when Route 66 was just a series of Indian trails. Histories of the Westward expansion were interesting, but they were always told with the historian's viewpoint and lacked the immediacy that was apparent in a first hand narrative. Of course I read Lewis and Clark's journals, but I also wanted to read about the journeys of less heroic proportions -- of those virtually anonymous individuals who went West in the early years of our country not because Thomas Jefferson asked them to, but because they, for lack of a better phrase, just wanted to try something new. Perhaps you can understand my enthusiasm when, by chance, I came across Osborne Russell's Journal of a Trapper [1834-1843] published by the University of Nebraska Press. It was an articulate, first-hand account of a fur-trapper who spent much of his time in and around Yellowstone and Jackson's Hole, Wyoming, an area that I knew well.

During the late 1960's and early 1970's, I chanced on other frontier narratives: Powell's Exploration of the Colorado River and its Canyons [Dover Books], Langford's B [University of Nebraska Press], and Tabeau's Narrative of Loisel's Expedition to the Upper Missouri [edited by Annie Abel, University of Oklahoma Press], to name a few. But my selection of books was not systematic; it relied on the pure chance of my browsing the gift shops of National Parks and the American history sections of random bookstores. My impression was that such first-hand narratives were rare items, so I treasured the few I had, and did not expect to ever find many more. A few years ago, though, as I was moving some of my books, the dust jacket to Tabeau's Narrative... came off and I found on the inside of that dust jacket, to my great surprise, a list of almost a hundred first-hand frontier narratives published by the University of Oklahoma Press. Here was the treasure trove of publications that I had dreamed of years before. Sadly, by this time, the urgencies of adult life had squeezed out most of the time I had for such recreational reading, I had been overtaken by what Zorba the Greek had called the "full catastrophe" of adult responsibilities---I only wish I had chanced to read the back of that dust jacket earlier.

The point of this anecdote is that such experiences, I think, are not uncommon; and it is not just a personal disappointment, but a commercial opportunity that was missed. I was an enthusiastic reader who would have gladly purchased and recommended many first-hand narratives of our Westward expansion -- the University of Oklahoma Press had published many of the kinds of books I wanted. Yet, neither of us knew the other existed. True, I had one or two of their books, but I did not know then that university presses often concentrated on certain types of books, so I did not pursue my interest with them; and certainly even if I had found out about Oklahoma's publications, there were likely to be other similar publications by other similarly unobtrusive publishers. I had almost no chance of tracking them down. I was like a hapless pioneer who wanted to go West, but didn't know where the Oregon Trail started. Bookstore assistants are often helpful and friendly, but their knowledge of the publishing world rarely goes beyond the best-seller lists and comprises information that is available to most of us, anyway.

Realistically, university presses and other not-for-profit publishers cannot expect to have the kind of publicity that major publication houses have. But I am convinced that, like me, there are many enthusiastic customers for the specialty books that these small presses offer. In light of this, the rapid expansion of the loose "ad-hocracy" of networks that we call the Internet offers an enormous opportunity for both small publishers to become more widely visible, without incurring dramatic marketing costs, and for the enthusiastic readers to "cruise" these networks in search of what Lawrence Sterne, in his great 18th century novel Tristram Shandy, called the "northwest passage to the intellectual world". For Sterne, the "northwest passage to the intellectual world" was through the "auxiliary verbs"---for us, it may be through the Internet.

But it's easy to be enthusiastic at the beginning of any grand endeavor. The pioneering users of the Internet have returned with wonderful tales of access to vast intellectual riches. And this intellectual universe expands before us in much the same way that the frontier must have to the young Osborne Russell growing up in a small 19th century town in Maine. But can we map the topography of the intellectual resources on the Internet as easily as, say, the topography of the Pacific Northwest? Or, perhaps more pointedly, can someone who frequently loses his car keys be able to find his way to the information that he wants on a large electronic network? In order to understand this problem we need to make some basic distinctions. First of all, we need to have some understanding of the size of these publicly available networks; and by size, what I mean is how difficult it is to find what you want on them. The problem of size , then, is really a problem of access. But even here we must make another distinction---between physical access and intellectual access. Physical access is concerned with how you can get your hands on some information whose address you already know. Intellectual access is concerned with finding the address of some information that has a desired intellectual content. In terms of a library, finding where on the shelves the book with call number QA76.A1A84 is, is a problem of physical access; finding out whether the book with that call number is the one that you want, is a problem of intellectual access.

Clearly, the problem of intellectual access must be solved before the problem of physical access. But this is not the way the access problem of electronic networks is presented. In short, the dramatic physical access speeds of electronic networks seduce us into believing that our speeds of intellectual access will be commensurately fast. Such is not the case. But it's hard to understand the magnitude of the intellectual access problem. Perhaps an analogy will help. Suppose that we wanted to find a book that is one of several hundred accessible to us. This is rather like finding a particular individual in a crowded room of modest size -- this room, for example. Not a particularly difficult problem, even if our description of the book or person we are looking for is fairly general. But suppose we wanted to find a book in a small library of 50,000 books. Although we have all been to libraries of this size, it may still be difficult to imagine the magnitude of the task. Consider a similar problem: Many professional baseball parks in the United States hold around 50,000 spectators, so we might be able to better visualize our search task if we imagine our goal is to find a single individual attending a sold-out game at, say, Fenway Park or Tiger Stadium. But now our task is more formidable. Suppose also, that our guidelines for finding the person we want are fairly general: that he is middle-aged, has dark hair, dark eyes, is 5'10" and slim. Our search is more difficult still. Now suppose we are searching for a book in a moderately large library of a few 100,000 books. Here, the analogy would be to finding someone at a Rolling Stones concert in New York's Central Park. But even now, I don't think that we have yet to comprehend the magnitude of the intellectual search space on the Internet. Searching through the millions of intellectual resources that are currently available through the Internet, utilizing only the search tools also currently available, is analogous to searching through New York City for a specific person with only the general description that he has dark hair, dark eyes, is middle-aged and slim. Even if we could see a different individual every second -- that is, our physical access to these individuals was optimal -- our search would likely end in failure. Why? For two reasons: First, there are too many other people in New York City who have the same general description as the person we are looking for; and, second, as a searcher, there is a limit to our searching persistence---we can't, or won't, search forever. This is called the searcher's "futility point".

On large text retrieval systems, good physical access methods do not necessarily improve our overall search prospects. Faster computers and faster networks just get us more quickly to the wrong places---and on a network the size of the Internet, there are too many "wrong" places to go. Our patience with the search will run out long before we have exhausted the places to look.

If we look closely at the problem of intellectual access, we can see that the success of any search is critically dependent on how the desired information is represented or described. These representations are abstractions of the intellectual content of information. They may consist of titles, abstracts, keywords or other similar devices, and they may be applied either automatically or through some manual indexing process. Regardless of the methodology for creating such representations, they remain the key link in the process of intellectual access---no search can be better than the representations on which it depends. What can we say about these representations? To be effective, these representations must satisfy three criteria:

  1. They must accurately describe the intellectual content of the information they represent.

  2. They must clearly distinguish the content of the information they represent from the content of similar, but different, accessible information.

  3. They must uniquely describe and retrieve a small enough number of information items that the searcher can examine them without reaching her futility point and giving up.

Traditional subject descriptions usually satisfy one, but on large text retrieval systems they do not typically satisfy two or three. This is what we call the "scaling problem" in Information Retrieval. Using subject descriptions to find specific information in a large collection is like trying to find a specific individual in New York City using only general physical descriptions. Anyone who has tried to find fairly precisely defined books through the subject catalogue at the Library of Congress has an understanding of the true magnitude of this problem. One of the ironies of the publishing industry, though, is that the need for reliable, specific access to texts with precisely definable intellectual content is greater for the smaller publishers than it is for larger, mass-market publishers. Smaller publishers such as university presses, have a fairly narrowly-defined segment of the book market, and must be accessible by customers with fairly precise interests. If these precisely definable publishers and customers cannot find each other, then we have a situation like the one I described at the beginning of this paper. No small publisher has enough economic slack to endure too many such missed opportunities.

What should we do? If the representation of texts is the key to how accessible those texts are in a large retrieval system or network, and if we have already said that traditional subject descriptions are inadequate to this task, then what are we to do? Let's go back to Fenway Park. We stated that under normal circumstances, it was futile to look for someone in a sold-out baseball stadium with such general descriptions as the color of his hair and eyes, his height and his weight. But if we also know that the individual we want always sits along the first base line, takes his 8 year-old twins with him, has a handlebar mustache, and wears a bright yellow jacket, then we have a much better chance of finding him. What we have done is to partition the search space---we have reduced a large search space to a smaller one. This is the key to intellectual access on large text retrieval systems. So now we can say that there are essentially two kinds of representations: those that describe the intellectual content of specific texts, and those that describe the partitions in the search space. The representations that describe the intellectual content of specific texts are already familiar to us; they are the titles, keywords, and subject descriptions that are commonly used to represent textual information. They are frequently imprecise, but are only a problem when used to represent information in fairly large document or book collections. On small systems they work reasonably well.

Representations which partition the search space differ from the previous kind of representation in that they must describe information more precisely. Back to our baseball analogy: the representations that describe the intellectual contents of specific texts are like the general physical characteristics that describe the person we are looking for. The representations that describe the partitions in our search space are those like the fact that our baseball fan sits along the first base line. Such a description narrows our search space considerably and allows us to concentrate in a specific part of the stadium -- greatly increasing our chances of success. Descriptions such as "dark hair", "yellow jacket", or "brings his 8 year-old twins" don't help us nearly as much since we would still be committed to searching the entire ball park for individuals with these characteristics. Representations that partition a search space must be able to delineate a fairly precisely definable "region" in the search space---a region which almost certainly contains what you're after, whether a person or a book. If the information that you want is not in that definable region, then the partition can do more harm than good. In our baseball example, if we are lead to believe that our friend sits along the first base line, but in fact, he prefers the bleachers, then we are likely to expend all of our search effort in a completely unproductive area.

So what's the "bottom line" in all of this? Can we really plot a route to the Northwest Passage of the intellect? I'm not sure, but we can certainly do better than we are doing today, and if we are to provide effective intellectual access across the large electronic networks we are now building, we will have to do better in a hurry. Most published book descriptions rely on the vague and overworked category of subject descriptions---representations that are not precise enough to accurately distinguish even a modest number of books, much less the total production of all the small presses that might have access to the Internet. It is possible that we might be able to make subject descriptions better than they are now, but it is unlikely that they will get much better than they are now; and even marginal improvements will require substantial expenditures of effort. Subject descriptions were simply never meant to make fine intellectual distinctions among the texts in large collections. This is a fact of language. Does this mean that subject descriptions are useless? No. But they are only useful in making distinctions between small numbers of items. They need to be supplemented by better ways than we have now for partitioning large collections into smaller collections. For example, searching for a book on "computers" doesn't help much if we are looking across the listings of a large number of publishers. But if we know that what we want is very likely to be published by one or two small publishers, then, within that partition, the generally vague term "computers" may be useful. Here we can see one way in which publishers might be able to provide better access to their publications: if at all possible, they should describe the kind of material that they publish, and these descriptions, like the abstracts of journal articles, should be searchable as a separate category on a large network. Here the searcher looks first for a publisher who is intellectually compatible with his search criteria, then he tries to describe the intellectual content of the specific publications he wants.

But there are other ways of partitioning the intellectual universe of publications: some are obvious, some are less so, and some remain to be discovered. For example: types or forms of publication may be a useful way of partitioning a large collection of publications---diaries, letters, essays, collections, festschrifts, oral histories, to name a few. All of these could be searchable categories. Time and geographic partitions could be useful too, for both fiction and non-fiction works. Time and geography might also have more than one dimension: they could refer to the time or region in which the work was produced, or the time or place that the work is concerned with. In some cases, both dimensions might be useful. In other cases, expanding the context of publications might make useful partitions: for example, by making the institutional affiliations of authors a searchable field inquirers could get access to publications written by authors connected to institutions which deal with issues they are interested in. In some instances, publications may be part of a related series of publications. It might be useful to make the name and some description of the series accessible by searchers also.

These are just some examples of the kinds of partitions that could be made in the intellectual search space that publishers may find themselves in (and I'm sure that publishers can come up with much better candidates for partitions than I can). If we see the representation of the intellectual content of publications as a problem of simply describing their content, then it's hard to see the need for such partitions. But if we see the problem of intellectual access, as I do, as critically dependent on the number of items accessible, then we must not only describe the accessible texts faithfully, we must represent them in a way that makes them distinguishable from many other texts that have similar representations. This is the purpose of the partitions.

There is one more issue of intellectual access that I would like to mention---the issue of "closeness". Currently, we don't have a good sense of how close our near misses are. On small retrieval systems, this is not so much of an issue, but as the retrieval spaces get larger and larger as they have become on electronic networks, it is not enough to know whether our searches have failed, we need to have some sense of how far off the mark we have been. If not, then searching these large intellectual spaces becomes like pitching horseshoes in the dark---we can hear the "ringers", but if we miss, we have no idea how to correct our aim. It is possible that the partitions that I have discussed will give us a sense of how close our misses are, but clearly building this kind of feedback into our search mechanisms will be a challenge of major proportions.

Improving intellectual access will not be easy. But it will be necessary, I think, if we are to have adequate intellectual access to information on large networks. The Northwest Passage to the intellect may be, like the real Northwest Passage, just a vision, but, like the efforts of the pioneers who preceded us, our quest for it will, I think, improve our ability to get where we want to go.

References

Blair, David C. Language and Representation in Information Retrieval. Amsterdam: Elsevier Science, 1990.

Blair, David C. "Indeterminacy in the Subject Access to Documents." Information Processing and Management, 22, no. 2 (1986): 229-241.

Blair, David C. and M.E. Maron. "An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System." Communications of the ACM, 28, no. 3, (March 1985): 289-297.

Blair, David C. "The Challenge of Document Retrieval: Major Issues, and a Framework Based on Search Exhaustivity and Data Base Size." Working paper, University of Michigan, February 1993.