Jim Self
Director, Management Information Services
University of Virginia Library
December 2001
This document is designed to demonstrate the use of sampling to obtain accurate information about a library collection. Two research projects are explicitly described; the sampling methodology as described can be adapted for use in other situations and projects.
In a research library with millions of volumes, a complete census examining every volume is not a very practical possibility. To learn about a sizable collection, we need to sample it.
Sampling is a means of obtaining information about a population without examining every single member of the population. Sampling saves time and money at a cost of precision and accuracy.
We can use sampling to estimate:
The size of a sample necessary to produce a good estimate is a matter of judgment. A larger sample will of course bring greater precision and certainty than a smaller sample, but larger samples cost more than smaller samples. After considering the consequences of error and the costs of sampling, the researcher must decide how much time and money to invest in a sampling project.
This brief guide will not specify sample size, rather it will inform the researcher just how much precision and certainty can be obtained from a given sample size. In addition, it will offer practical advice and suggestions as to how to conduct a sampling project in a library collection.
For a statistical procedure to be valid, the sample must be unbiased. Every member of the population should have an equal chance of being chosen. Extracting an unbiased sample from a large library collection is a challenging task. A typical research library contains a number of stack floors of varying size and arrangement. Some floors may be tightly packed with books while others contain many empty shelves. No one can make a good estimate of the collection by walking through it and looking at the shelves.
For research libraries, with their numerous and separated spaces of irregular layout, systematic sampling is generally the best procedure. The researcher counts all the shelves in the collection and selects every nth shelf for the sample. The same number, henceforth called the sampling index value, is used for the entire collection under study. Thus a sample might include every 17th shelf. Another researcher on another project might decide to sample every 37th shelf.
The sampling index value should be selected to yield a sample of appropriate size - not too large, not too small. It should be a number that does not fit a regular pattern of shelves per section in the library. If there are 7 shelves in most sections of a collection, a sampling index value of 7 would not work well, but 10 or 17 or 23 would be fine.
The sample size is determined by the need for certainty and precision. The level of confidence indicates the certainty of an estimate. The confidence interval indicates the precision.
As an example we might be estimating the "fullness" of the shelves in a library. The project might find that the shelves are 74% full with a confidence interval of (plus or minus) 3.5% at a confidence level of 95%.
This means we are 95 percent sure that the true percentage of "fullness" is somewhere between 70.5% (74-3.5) and 77.5% (74+3.5). It also means that in five percent of the cases the true percentage would be less than 70.5 or above 77.5.
In most library work a 95% level of confidence is sufficient. In other fields (e.g., biomedical research or airplane manufacture) greater certainty is required.
Described below is the sampling methodology used in two research projects. The same methodology may be used or adapted for other library projects.
Project I: Estimating the amount of used and unused shelf space in a collection
In this scenario a researcher determines the percentage of the shelf space that is actually in use. As a practical matter, if an active collection has 85% of its shelf space filled, the collection is "full." Shelving books is very inefficient and time consuming when space is so limited.
Carrying out this project involves several steps:
| Confidence Interval | Approximate Sample Size for a 95% Level of Confidence | Approximate Sample Size for a 99% Level of Confidence |
| 1% | 9800 | 12,800 |
| 2% | 2450 | 3200 |
| 3% | 1089 | 1422 |
| 4% | 613 | 800 |
| 5% | 392 | 512 |
| 6% | 272 | 356 |
| 7% | 200 | 261 |
| 8% | 153 | 200 |
| 9% | 121 | 158 |
| 10% | 98 | 128 |
Calculating Confidence Intervals
|
|||||
| Sampling Index Value | Number of Shelves in Sample | Sample Results: % of Shelves Occupied | Plus or Minus Sample Error | Minimum Value in Confidence Interval | Maximum Value in Confidence Interval |
| 11 | 105 | 11.0% | 5.7% | 5.3% | 16.7% |
| 8 | 235 | 48.4% | 6.0% | 42.4% | 54.4% |
| 13 | 400 | 50.0% | 4.7% | 45.3% | 54.7% |
| 27 | 350 | 64.0% | 4.9% | 59.1% | 68.9% |
Table 2. Calculating confidence intervals at a 95% level of confidence. Double-click the table to activate the data-entry and calculation functions (Editor's note: the tables in this document cannot be activated. Download the original Microsoft Word file with interactive tables.).
Project II: Estimating the number of volumes in a collection
This project requires the researcher to find two statistical values: the mean of the sample, and the standard deviation of the sample.
The mean is a measure of central tendency; it is calculated by taking the sum of all values in the sample, and dividing by the number of observations in the sample. The term "average" usually refers to the mean. The standard deviation is a measure of dispersion; it is the square root of the sum of the squared differences of the mean and the values. Fortunately, it is easy to use Excel to calculate these statistical values; the procedure is outlined below.
To estimate the number of volumes in the collection, start with steps 1-4 as described in Project I, then proceed as follows:
Estimating Collection Size From a Sample |
||||||
| Sample Index Value | Number of Shelves in Sample | Mean Volumes per Shelf | Sample Standard Deviation | Estimated Number of Volumes In Collection | Minimum Number of Volumes | Maximum Number of Volumes |
| 17 | 800 | 27.30 | 10.60 | 371,280 | 361,588 | 380,972 |
| 2 | 670 | 22.63 | 9.72 | 30,324 | 29,627 | 31,022 |
The results in Table 3 are calculated as follows:
Multiplying the "sample index variable" and the "number of shelves sampled" gives the total number of shelves in the collection (not displayed in the chart). The total number of shelves is multiplied by the "mean volumes per shelf" to produce the "estimated number of volumes in collection."
Four variables are used to calculate a mean error of the sample: "number of shelves sampled," total number of shelves in the collection (not displayed), "mean volumes per shelf," and "sample standard deviation." The mean error of the sample is used to calculate a confidence interval for the "mean volumes per shelf" variable, as well as the "minimum number of volumes" and the "maximum number of volumes."
All calculated variables may be viewed if the table is activated, and the columns are "unhidden." Also, the calculation formulas are visible if one clicks on the appropriate cell. The formulas assume a sample size of at least 100 shelves; if fewer than 100 shelves are sampled, the estimates will be somewhat less reliable.
Other Projects
As noted above, we can use sampling to estimate the incidence of a particular characteristic in a collection (e.g., damaged books). Such an undertaking would be a variant of Project I. Table 2 would calculate the estimates; one simply changes the header from "Percentage of Shelves Occupied" to an appropriate description such as "Percentage of Books found to be Damaged."
To find the number of linear feet (or meters) of filled and available shelf space, we can use a modification of Project II. Instead of counting volumes, we measure (in inches or centimeters) the available and filled shelf space, and adapt the procedures for Project II.
To determine a collection growth rate, we can carry out sampling projects at regular intervals and compare the results over time. Alternatively, if we know the number of volumes added per year, we can adjust the results from Project II to estimate the number of years before a facility is completely filled.
If it is done properly, sampling gives us the opportunity to obtain reliable information at a comparatively low cost. But to insure reliability, we must select the sample in an unbiased fashion, and we must choose a sample of appropriate size.
Jim Self
Director, Management Information Services
University of Virginia Library
December 2001
self@virginia.edu