Association of Research Libraries (ARLĀ®)

http://www.arl.org/resources/pubs/symp2/hockey-2.shtml

Publications, Reports, Presentations

Scholarly Publishing on the Electronic Networks

SGML and the Text Encoding Initiative: What and Why?

Susan Hockey

Director, Center for Electronic Texts in the Humanities
Rutgers and Princeton Universities

Electronic texts which contain encoding or markup tags are much more useful than those without markup. For on-line searching, markup is needed to specify the document structure, so that text which is retrieved can be identified by its location within the text. That location may be chapter, page, verse, act, scene, title or whatever the units are which make up the text. Markup is also needed when sections of the text are to be searched, for example all the documents by author Smith or all those published later than 1980. For printing, markup is used to format the document or to indicate codes which drive a typesetter.

The history of encoding and markup shows the development of many different schemes. For the kinds of text analysis which scholars in the humanities have been performing for years, e.g. concordances and text retrieval, schemes were developed to denote the canonical reference structure of scholarly texts. Perhaps the best known example is COCOA, which was first devised for an archive of Older Scottish Texts in the 1960's and was later adopted by the COCOA concordance program and its successor OCP (the Oxford Concordance Program). The text retrieval program WordCruncher uses a markup scheme for references which gives three-level hierarchical structure within a text, reflecting Wordcruncher's first use for the Book of Mormon and the Bible. The Thesaurus Linguae Graecae (TLG), some 60 million words covering all the major works of Ancient Greek literature and based at Irvine, also developed its own markup scheme called beta code which has been used by some other biblical and ancient texts. At least ten other schemes have been in use in humanities computing.

For formatting and printing, a parallel group of markup schemes was developed, most notably that used by the typesetting program TEX, as well as Scribe, TROFF and some typesetter-specific schemes. The internal format used by some word processing programs is not unlike these markup schemes, all of which specify how text is to be printed, not what its intrinsic structure is. The inevitable result of this plethora of encoding schemes has been time wasted on conversion, lack of adequate documentation, texts unusable for any purpose other than that for which they were originally created, and the inability to extend the coding in a text.

A Standard Format -- SGML

In the fully electronic world, a standard format which can be used for many different purposes provides a way forward which will pay back the investment in it and yield texts which will be usable for many years. The same text should be used for different applications such as on-line searching, hypertext, and display and/or printing. The Standard Generalized Markup Language (SGML) now provides a means of creating such a text. SGML became an international standard in 1986. It is not itself an encoding scheme, but is a meta-language in which markup codes or tags can be defined. The principle of SGML is 'descriptive' not 'prescriptive', that is, it provides a means of describing or marking the components of a text. It does not prescribe what processes are to be performed on the text. That is the function of the particular computer program which operates on the text. Typical components of a text may be title, author, paragraph, chapter, act, scene, speech or features such as quotations, lists, names, addresses, dates etc. The components are marked by encoding tags within the texts and what is called an 'SGML application' provides the set of tags for one application area.

At one level, SGML considers a text to be composed simply of streams of symbols, which are known as 'entities'. An entity is any named bit of text and an entity definition associates a name with a bit of text. One use for entities is to encode characters which are not on the keyboard. For example the entity reference β could be used for the Greek letter beta, or &alef; for the Hebrew letter alef. Although, this may seem a clumsy way of encoding non-standard characters, it is needed for transmission across all networks and a computer program can convert from other machine-specific character formats to entity references. A second use is for expanding abbreviations, for example &TEI; for Text Encoding Initiative.

At a higher level a text is composed of objects of various kinds, which are known as 'elements'. These identify the various components of a text, which are whatever the compiler of the text chooses to encode. Each element is marked by a start and end tag. For example

Here the title of the novel is tagged as a title. Angle brackets delimit the tags with the end tag beginning with </.

Attributes may be associated with elements to give further information about the element. For example, for the tag <chapter>,

to give the number of the chapter, or for the tag <name>

type   type of name
normal  normalized form

<name type=personal normal='SmithJ'>Jack Smyth</name>

This would enable an index of personal names to be made in which Jack Smyth would be listed under SmithJ. Attributes can also be used extensively for cross-references, which are resolved into concrete references only when the text is processed.

The SGML Document Type Definition (DTD) defines the elements which make up a document, giving relationships between them. This is called the content model. A very simple example could be a play which has a content model of acts which are composed of scenes which are in turn composed of speeches. The DTD is used by an SGML 'parser' to validate a document, by checking that it conforms to the model which has been defined. SGML is based on a content model which consists of a single hierarchic structure or tree.

Within SGML it is not easy to build a content model which permits more than one tree in a document. Such multiple hierarchies are often necessary for humanities texts which have more than one canonical referencing scheme. For example a printed edition of a scholarly text may use references (folios, line numbers etc) from the original manuscript and also generate its own structure of sections, pages etc. These structures run in parallel throughout the text. One is not a subset of the other, as is needed for a hierarchy,

Text Encoding Initiative

The Text Encoding Initiative (TEI) is a major international project to develop an SGML tag set for use by the humanities and language industries. It is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics and the Association for Literary and Linguistic Computing. Following a planning meeting in November 1987, funding was provided by the National Endowment for the Humanities, the Commission of the European Communities and the Andrew W. Mellon Foundation. The TEI published the first draft of its guidelines in July 1990 and this is now being substantially revised and expanded. The second draft is initially being published in fascicles on the TEI's electronic discussion list TEI-L@UICVM. It is not expected to be complete until mid-1993, but sufficient fascicles have now appeared for most texts, which do not contain any unusual features, to be encoded. DTDs are also available from the TEI-L fileserver.

The TEI is managed by a Steering Committee consisting of two representatives from each of the sponsoring associations. Two editors, one each in USA and Europe, co-ordinate the work and are responsible for drafting the guidelines. Fifteen scholarly organizations participate in the TEI by representation on the project's Advisory Board. The TEI has also set up arrangements with a number of affiliated projects which are testing the guidelines and reporting back on problems encountered.

For the first draft of the guidelines the TEI set up four Working Committees. The Committee on Text Documentation, with expertise in librarianship and archive management, was charged with the problems of labelling a text with in-file encoding of cataloguing information about the electronic text itself, its source and the relationship between the two. The Committee on Text Representation addressed issues concerning character sets, the logical structure of a text and the physical features represented in the source material. The Committee on Text Analysis and Interpretation devised ways of incorporating analytic and interpretive information in a text which can include more than one interpretation on the same component of a text. The Committee on Syntax and Metalanguage Issues worked on producing recommendations on how SGML should best be used.

For the second draft of the guidelines, a number of small work groups addressed specific areas in more detail. These included character sets, text criticism, hypermedia, formulae and tables, language corpora, physical description, verse, performance texts, literary prose, linguistic analysis, spoken texts, historical studies, dictionaries, computational lexica and terminological data.

The TEI proposals give guidance both on what features to encode and how to encode them. Although the TEI will propose between 250 and 300 different tags, very few indeed are absolutely required. The basic philosophy is `if you want to encode this feature, do it this way'. Sufficient information is provided for the DTDs to be extended by users. TEI conformant texts consist of a TEI header, which provides the documentation or labelling, followed by the text.

TEI Header

The TEI header is believed to be the first systematic attempt to provide in-file documentation of an electronic text. An outline of the header, showing the four major sections, follows:

<TeiHeader>

</TeiHeader>

Within the header, the <fileDesc> element is the most important. It gives the bibliographic description of electronic file and contains at least the following elements:

<titleStmt>    title of work and those responsible for its intellectual content
<publicationStmt> publication or distribution of an electronic text
<sourceDesc>    bibliographic description of the source text

The <encodingDesc> element documents the methods and editorial principles use for encoding the text, including, for example, canonical references and any sampling techniques used etc. The <profileDesc> give the non-bibliographic aspects of a text, for example languages, dialects, classification, setting (for spoken texts). The <revisionDesc> gives the revision history in a similar format to that used within computer program source code.

Prose Texts

The basic structure of a prose text is

<text>

</text>

The body of a prose text is divided into units which, for convenience, the TEI has chosen to call divisions, using the tag <div> for unnumbered divisions, or <div0>, <div1>, <div2> etc, where it is useful to number the depth of the divisions within the hierarchy. The type of division is denoted by an attribute. This example shows the body of a text which is composed of two parts, each of which is composed of two chapters.

<body>

</body>

Note that the type attribute need not repeated when it has the same value as the previous instance of the tag.

The front matter is also composed of <div>s. The following example gives a title page with dedication and preface:

<div type='dedication'>
<p>To my parents, Ida and Max Fish
</div>
<div type='preface'>
<head>Preface</head>
<p>The answer this book gives to its title question is <q>there is and there isn't</q>.

...
<p>Chapters 1-12 have been previously published in the following journals and collections:
<list>
<item>chapters 1 and 3 in <title>New Literary History</title></item>
...

<item>chapter 10 in <title>Boundary II</title> (1980) </item>
</list>. I am grateful for permission to reprint.
<signed>S.F.</signed>
</div>

Most of the elements used here are self-explanatory. <p> is a paragraph. <q> is a quotation. A <list> is divided into <item>s. This example also shows how end tags (</p> in this case) can be omitted, if they are so defined in the DTD. SGML can appear to be verbose (and even more so in examples which only show portions of a text), but it is also readable. There are various shorthand mechanisms in the syntax.

SGML Software

SGML-aware software is needed at several stages. There are various software tools to aid conversion from non-SGML encoded text. An SGML parser validates the tagging to ensure that it is processable. Some of the best known software includes Author/Editor from Softquad in Toronto, Omnimark from Software Exoterica also in Toronto, and MarkIt from Sema Group in Belgium. Avalanche Technology of Colorado now has tools to aid conversion. Dynatext from Electronic Book Technologies of Rhode Island is the only true SGML browser/searcher that I am aware of. PAT from Open Text Corporation of Waterloo searches text which contains SGML-like tags but does not require them to conform to a DTD. CONCUR, the SGML function which allows multiple hierarchies in a document, is implemented by very few of these. The forthcoming Document Style Semantics and Specification Language (DSSSL) standard for searching SGML documents works with tree structures rather than on SGML syntax. Other efforts to create SGML browsers are based on SQL and are developed from the relational database model used by SQL. More software development is needed to handle multiple hierarchies effectively.

The Acceptance of SGML

SGML provides a method of encoding which addresses many of the intellectual issues which previously used encoding schemes have not. It also provides links to material which is not ASCII text, for example images and sound. These would normally be stored in separate files and an SGML tag used to mark an image and indicate its format and filename. This means that mixed text and images can be described in SGML and even for TEI headers to be used for files which are in image format. The Hypermedia/Time-based Document Structuring Language (Hytime) was recently adopted as an SGML-based international standard for hypermedia structures.

SGML was adopted early on by the US Department of Defense and the Commission of the European Communities and, within the humanities, the Perseus project, coordinated from Harvard, began to encode Ancient Greek texts in SGML in 1987. It has been adopted by major publishers of which one of the first was Oxford University Press for the New Oxford English Dictionary. The movement to SGML has accelerated in the last year or two. Random House have also moved entirely to SGML and Mead Data Central also announced recently that they are planning to convert the 180 million word LEXIS and NEXIS databases to SGML. It is expected that more software will become available to ease the move to SGML as more organizations realize the benefit of re-usable multi-purpose electronic text.

References

Martin Bryan, SGML: An Author's Guide to the Standard Generalized Markup Language, Addison-Wesley, 1988.

Lou Burnard, What is SGML and How Does it Help?, TEI document TEI ED W25, October 1991, available from TEI fileserver.

Susan Hockey, The ACH-ACL-ALLC Text Encoding Initiative: An Overview, TEI document TEI J16, June 1991, available from TEI fileserver.

International Organization for Standards, ISO 8879: Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML), ISO, 1986.

International Organization for Standards, ISO/IEC DIS 10744: Hypermedia/Time-based Document Structuring Language (Hytime), ISO, 1992.

C.M. Sperberg-McQueen and Lou Burnard (eds), ACH-ACL-ALLC Guidelines for the Encoding and Interchange of Machine-Readable Texts, draft version 1.1, Chicago and Oxford, 1990.

Oxford University Computing Service, Micro-OCP Manual, Oxford University Press, 1988.