{{ site.title }}

UCLA Library/OCLC Non-roman Script Project Bridges Access Gaps to Research Materials

Last Updated on July 9, 2022, 9:45 am ET

person wearing a headscarf and earbuds seated at a library carrel desk using a laptop computer
image by Elena Zhukova for UCLA Library

With resources available in 414 languages and 12 million print and electronic volumes, UCLA Library attracts more than 15 million information seekers from around the world annually via its website.

For a certain segment of UCLA Library catalog users—speakers of languages that are written in non-roman scripts like Russian—the search may be arduous, especially if they are searching among the millions of bibliographic records that were created before 1980. Instead of text in the original Cyrillic Russian alphabet, they’ll find text that has been transliterated into the Latin alphabet, sometimes with errors that were inadvertently introduced.

The problem is not with UCLA Library’s catalog alone; this issue is common to library databases globally. The problem is rooted in the changes that the US Library of Congress and others made as the catalog evolved from cards handwritten in the original script to machine-printed cards with roman script to fully computerized systems.

Today, thanks to technological advances and Unicode, computerized records now can display text in both the original script and transliteration. But, unfortunately, millions of older records offer only the transliteration.

How extensive is the problem? In OCLC’s international database, WorldCat, there are 67 non-roman languages with any sizable amount of non-roman script and only 70 percent of those records contain script. That is 52 million records and, on average, 30 percent of them lack their original script; for many non-roman languages in WorldCat that percentage is much higher. As of this writing, for example, the vast majority of Russian records in WorldCat—73 percent—are only available in Latin script. For works in Urdu, it is 80 percent; for Hindi 93 percent; and for Greek 86 percent.

This situation has caused a great deal of difficulty and frustration for some catalog users trying to find works written in non-roman script.

While a few local library systems have attempted to add original non-roman script to their local copies of bibliographic records, it wasn’t considered a high priority until 2019.  At that time, UCLA Library, in collaboration with OCLC, embarked on the first large-scale project to augment legacy records for Russian and Armenian works.

A UCLA Library team of catalogers, language experts, and a computer programmer has now augmented about 100,000 bibliographic records for Russian and Armenian works in the library’s catalog. Using the same process, OCLC has enhanced up to 1 million of its Russian records in WorldCat. More than 1.4 million Russian-language records newly display Cyrillic script; 43,000 Armenian titles have been similarly improved.

“For those of us who work in non-roman scripts, it’s been something we’ve always wanted to bring back to the catalog,” said Peter Fletcher, Cyrillic catalog librarian and leader of the international team at UCLA Library. “Other libraries have done it for their own collections, but no one was thinking of doing it nationally or internationally.”

There were other problems with the transliterations in WorldCat records, said OCLC senior program officer Karen Smith-Yoshimura, now retired, in a 2020 Hanging Together blog post. The romanization in Russian records predominantly follows rules set by the American Library Association (ALA) and the Library of Congress. But libraries outside the Anglo-American sphere follow the ISO standard, she pointed out.

“This has been a severe handicap to users who look for Russian titles by Russian authors and expect that, of course, the titles would be available in the script the material was written—Cyrillic,” Smith-Yoshimura wrote.

Another complication stems from the fact that in some cases the same transliteration tables were incorrectly used for different languages that shared the same script.

“The problem catalog users have with transliteration is that you have to know what system the library world is using,” said John Riemer, UCLA Library’s head of Resource Acquisitions & Metadata Services, who led the project with Fletcher. “The Library of Congress uses one standard and the Europeans use another. So when you are searching for a name like Tchaikovsky, do you search for ‘Tch’ or ‘Ch’ when you type in the roman alphabet characters? It’s much easier for speakers of Russian to type in the Cyrillic characters.”

But the difficult job of going back to add the original script to older records was, at first, not a high priority for OCLC. So UCLA Library began planning to go it alone.  It was time, they decided.

“We had been asking OCLC for years to please consider enabling the addition of Armenian script to WorldCat records because we have a large Armenian collection,” Riemer said.

Russian would make a good first test case, the UCLA Library decided. The Cyrillic characters in Russian have a one-to-one relationship with corresponding Latin characters. Moreover, the Russian script is similar in that regard to Armenian. If they succeeded with Russian, Armenian would be next.

Before the project was launched, the UCLA team needed to make sure it had the support of an international community of library catalogers who work with Russian and Armenian.

With that assurance, the team then had to identify the older records that needed to be improved. With millions of records to scan, they estimated it would take two and a half years to find them in the WorldCat database at the scanning rate of one second per record.

Fortunately, support for the project began building at OCLC. With dedicated resources available, OCLC could quickly locate the records the project needed. And OCLC could also clean up inconsistencies in the Latin-alphabet transliteration.

“Whatever approach we decided to take,” said Riemer, “it had to lend itself to batch mode. It couldn’t involve human review of hundreds of thousands of records.”

Fletcher, who did his undergraduate degree in Russian and East European studies, worked with a Russian language expert to spot patterns of errors, such as incorrect diacritical marks applied to transliterated characters. Fletcher also served as a Russian language consultant for OCLC.

The positive feedback they’ve received has motivated OCLC and UCLA to move forward with other languages. While OCLC wants to focus on other Cyrillic languages, such as Serbo-Croatian and Bulgarian, the UCLA team hopes to work on Arabic, Persian, and Hebrew records.

OCLC staffers said they don’t consider this a one-and-done project, since new records lacking the non-roman script continue to arrive in the WorldCat database.  They are looking forward to working with UCLA on future endeavors.

“We hope to enlist OCLC’s support in cleaning up the transliterations in other cataloging records,” said Riemer. “We also plan to write an article for libraries documenting our approach with the hope of inspiring them to augment the metadata for other languages.”

The project team has already heard from someone at Yale University Library who wants to apply the strategy to Lao, the official language of Laos.

“Communities using the more obscure languages get excited when they hear about the project because their languages are very dear to them,” Fletcher said.

The UCLA Library aspires to eventually add script to all of its transliterated records, opening up broader access to millions of resources for users worldwide.