TEI Lite History and Evaluation

New and disparate ways of digitally encoding texts were developed as computing became available to scholars of the humanities in the 1980s. The encoding of textual objects into a digital form creates opportunities for examining old and rare texts simultaneously and without the risk of wear or damage to the original object. Additionally, an encoded object permits new ways of interacting with the text, such as concurrent views of different versions and viewing subsequent editorial or annotations. The lack of standard methods for encoding and describing texts made it difficult for researchers to exchange objects and diminished the benefits that the digital format offers.

The Text Encoding Initiative (TEI) was conceived in this disjointed digitization environment. TEI is a successful and influential metadata encoding standard that is primarily concerned with the encoding of textual objects, but is flexible enough to apply to many other types of information objects. The standard is customizable and extensible. One such customization is TEI Lite, a subset of the TEI specification. In this essay we will examine the development and history of TEI Lite as well as the role it plays in documenting the lifecycle of digital objects. TEI Lite's relationship to other metadata initiatives will also be explored. Finally, an evaluation will be made of how well TEI achieves its purpose and some of the problems the specification faces.

History of TEI Lite

TEI Lite shares its formative history with its superset, TEI. Work on TEI formally began in 1987 with the meeting of a group of 32 scholars from North America, Europe, and Asia held at Vassar College in Poughkeepsie, NY. The initial meeting was convened by the Association for Computers in the Humanities and funded by the National Endowment for the Humanities with the purpose of beginning work on the problems facing digital text encoding (Mylonas & Renear, 1999, pp. 3-4).

At the close of the conference, the group issued a closing statement to provide direction for the development of guidelines. The statement, known as the “Poughkeepsie Principles,” directed that the forthcoming guidelines should (Burnard & Sperberg-McQueen, 2002, p. 1):

  • suffice to represent the textual features needed for research;

  • be simple, clear, and concrete;

  • be easy for researchers to use without special-purpose software;

  • allow the rigorous definition and efficient processing of texts;

  • provide for user-defined extensions;

  • conform to existing and emergent standards.

After the meeting in 1987, three organizations participated in forming the guidelines: the Association for Computers in the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics (Mylonas & Renear, 1999, p. 3). Draft versions of the TEI Header and Guidelines were completed and distributed in 1990 (MIT Libraries, 2004). After several years of refinement, the final draft (version P3) was released in 1994. “Guidelines for the encoding and interchange of Machine-Readable Texts” spanned 1300 pages and defined over 600 elements of Standardized General Markup Language (SGML). The TEI specifications defined an extensible set of elements that could be customized by user communities for their specific needs. One of these customizations is TEI Lite, which defines a subset of TEI meant to serve as a “starter set” of core elements to assist in learning the extensive TEI set (Burnard, 2000).

The P3 guidelines underwent several minor revisions between 1994 and 2001, mostly to clarify varying interpretations and practices (Burnard & Popham, 1999, p. 39). During this time, however, the success of TEI as a metadata specification informed and influenced the development of the eXtensible Markup Language (XML) (DeRose, 1999). TEI was subsequently converted to XML and released as version P4 in Summer 2001 (MIT Libraries, 2004). Development of TEI Lite continues in parallel with TEI, the next revision of which (P5) is expected to be released at the end of 2004 (TEI Consortium, 2003, How to participate - Next version).

Formal structures were developed to guide future development as participation increased. An Executive Committee formed in the mid-nineties that included representatives from each of the three sponsoring associations and two influential researchers, Michael Sperberg-McQueen (University of Illinois at Chicago) and Lou Burnard (Oxford University). By 1996, a Technical Review Committee was established to conduct the development and maintenance of the guidelines in a manner similar to the International Standards Organization (ISO) (Burnard & Light, 1996, pp. 25-26).

In 1999, the Executive Committee was petitioned to create an international membership organization that could better handle the TEI's increasing administration and development responsibilities. The petition resulted in the formation of a non-profit corporation (Burnard, 2000). Membership in the consortium includes dozens of agencies from the humanities, education, computing, linguistics, and librarianship. Members elect a technical council that oversees development of the guidelines and funding for the organization. The consortium's first Council was elected in 2001 and met for the first time in 2002. Members may also participate in the various special interest groups or workforces that develop the guidelines (TEI Consortium, 2004, How to participate). The Consortium relies on its members to expand TEI's user base and has chartered a special interest group for training to support their efforts (TEI Consortium, 2004, How to participate – Special interest groups).

TEI is hosted by four universities and is sponsored by the three associations originally responsible for initial development of the guidelines. Significant support is provided by the U.S. National Endowment for the Humanities (NEH), Directorate XIII of the Commission of the European Communities (CEC/DG-XIII), the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada (TEI Consortium, 2004).

The Functional Role and Structure of TEI Lite

TEI and TEI Lite intend to define a framework for the encoding of texts that facilitates the interchange of digital objects. The specification defines a common and extensible language that different software platforms can understand and use to render the digital object in consistent ways. Although the development of TEI has focused on the encoding of texts, particularly capturing non-digital texts, the framework is applicable to the description of non-text objects such as images and sound (Burnard & Sperberg-McQueen, 2002, p. 1).

The description and interchange goals of TEI implies fidelity to the structure and content of the object being encoded. As such, much of the focus of TEI is on the structural description of textual objects, while the TEI header supports most of the lifecycle metadata functions (see Table 1). Elements of the TEI header provide creation, appraisal and descriptive metadata and, to a lesser extent, transfer/authenticity and preservation metadata. Accession and usage metadata are much less apparent, but may be augmented by the information system that stores the digital object. Rights metadata is simply represented in regard to the original object. In fact, the distinction between metadata about the digital encoding and metadata about the original object is difficult to discern from the element definitions and likely results in differing practices. In general, the file description (fileDesc) describes attributes of the original text while the encoding description (encodingDesc) concentrates on aspects of the digital implementation


Creation

Appraisal

Transfer/Authenticity

Accession

<fileDesc>
<titleStmt>(all)
<profileDesc>
<creation>
<revisionDesc>
<fileDesc>
<editionStmt>(all)
<seriesStmt>
<sourceDesc>
<encodingDesc>
<projectDesc>
<samplingDecl>
<editorialDecl>
<fileDesc>
<extent>
<publicationStmt>
<publisher>
<distributor>
<authority>
<fileDesc>
<publicationStmt>
<authority>
<notesStmt>
(also defined by containing system)

Descriptive

Preservation

Usage

Rights

<fileDesc>
<titleStmt>
<title>
<editionStmt>
<edition>
<seriesStmt>
<sourceDesc>
<profileDesc>
<textClass>
<fileDesc>
<extent>
<encodingDesc>
<tagsDecl>(all)
<refsDecl>
<profileDesc>
<langUsage>
<revisionDesc>
(also defined by containing system)
<fileDesc>
<publicationStmt>
<publisher>
<availability>
<distributor>
<availability>
<authority>
<availability>

Table 1: Metadata life-cycle roles of TEI header elements. (derived from Burnard & Sperberg-McQueen, 2002)

A complete TEI Lite document contains a header and text body (see Figure 1). The header, as indicated above, contains metadata related to the digital object and the original information object. The header is separable from the encoded body, which allows it to serve as a description for non-text objects stored separately from the header. The TEI header is analogous to the title page of a text. It has up to four parts: a description of the electronic file (fileDesc), an encoding description (encodingDesc), a non-bibliographic description of the text (profileDesc), and a revision history (revisionDesc)(Burnard & Sperberg-McQueen, 2002, p. 6). Of these, only the file description is required, the elements of which can be related directly to MAchine Readable Cataloging (MARC) fields. Unlike MARC, elements of the TEI header are not required to conform to a controlled vocabulary such as described by the Anglo American Cataloging Rules (AACR), although such rules may be applied at the encoder's discretion (Pouchard, 1998).




Figure 1: Structure of a TEI Lite document (HTML Writers Guild, 2001)



For textual objects, structural encoding is defined within the text element. A TEI text may contain a single, unitary work, or a group of works as realized in a series or anthology. For the latter case, the text element may contain an arbitrary number of group elements, each containing a text body with optional front and back matter. Additionally, multiple TEI objects may be grouped as a corpus, analogous to a collection of texts (Burnard & Sperberg-McQueen, 2002, p. 7).

The range of elements and the relatively relaxed markup rules allow for varying granularity depending on the intended usage. The body of an encoded text is structured by p and div elements, similar to those in HyperText Markup Language (HTML), that represent chapters, sections, and subsections of a text. Text within these structures may be further encoded using a myriad of markup that indicate layout and appearance. Furthermore, elements are available for defining alternate appearances or versions of text and editorial markup or annotations as applied to the original object. Additionally, elements such as unclear allow for the indication of unintelligible or damaged areas of text (Burnard & Sperberg-McQueen, 2002). Such elements enhance textual analysis by allowing the encoding of multiple version of a text within the same electronic file.

Relationship to Other Metadata Initiatives

TEI was one of the first metadata initiatives, predated only by MARC and the International Standard Bibliographic Description (ISBD), AACR, and SGML standards (Burnard & Light, 1996). The TEI header and the descriptive fields of later versions of MARC closely resemble the functional structure of the ISBD. Despite the structural similarities with MARC and ISBD, however, TEI does not require the use of controlled vocabulary and as such does not readily convert to either standard. Early TEI development eschewed strict cataloging requirements in the expectation that non-catalogers would use the specification. The decision to conform to standard cataloging practices is left to the creating agency (SCHEMAS Registry, 2002). Such a flexible approach favors ease of use over uniformity in order to facilitate a wider adoption of the standard (MIT Libraries, 2004) - an approach that Dublin Core has also uses.

Metadata initiatives developed subsequent to TEI have benefited from TEI's success and derive structures from TEI Lite. Encoded Archival Description (EAD) borrowed TEI's header concept (Burnard & Light, 1996, p. 13). Other metadata initiatives are domain specific applications of TEI. The Consortium for the Computer Interchange of Museum Information (CIMI) uses the TEI framework for the description of museum resources (Burnard & Light, 1996, p. 15). Another derivation is the Spoken Text Markup Language (STML), a text to speech markup language inspired by TEI (Sproat, 1997). Similarly, the Music Encoding Initiative (MEI) was based on TEI (Roland, 2002).

Most notable of TEI influences was in the development of XML. TEI represented the first and most precise SGML implementation at the time of XML's development. As a result, developers of TEI were closely involved in defining XML. Especially useful to the nascent XML specification was TEI's extended pointer language which served as a prototype for XLink and Xpointer (DeRose, 1999).

Evaluation of TEI Lite

TEI Lite was created to present a useful subset of TEI that provides the elements necessary for most common encodings. The 140 elements of TEI Lite represent only a fraction of the hundreds of elements available in TEI and its extensions. The majority of the subset, besides those in the header, define basic structural and perceptual attributes necessary for textual objects, but not so many as to become overly granular. Additionally, the use of a lesser number of elements restricts the size of the metadata vocabulary that different agencies need to have in order to understand conventions used during encoding and markup. The subset represents a lowest common denominator of sorts that is compliant with and upgradeable to the full TEI specification.

There are a number of criticisms with TEI encodings. First, as mentioned previously, the header lacks a controlled vocabulary for bibliographic elements. There is a compromise between usability from the perspective of creation and accessibility in terms of resource location. Free text bibliographic descriptions, however, could prove to be more useful for scholars of ancient texts which, by their unique character, require more detailed descriptions than those afforded in library cataloging (Pouchard, 1998). The upcoming P5 version of TEI will allow external metadata and namespaces to be included in TEI documents (TEI Consortium, 2004, Guidelines - P5 status). Embedding MARC encoded data may offer a solution to controlled vocabulary problem, although it is uncertain how such features will cascade into TEI Lite.

Second, texts may overlap semantic and organizational structures. XML and TEI are hierarchical languages that require inelegant procedures to represent such overlapping structures. The overlap problem is especially pertinent to representing variant structures beyond the word or character level such as macro-level versions and variations (Smith, 1999).

Third, the reduced set of elements available in TEI Lite reduces the chance of over-granular structure, but divergent encoding practices are still possible. The basic structural elements (p and div) and their attributes may be used differently and result in confusion when encoded documents are exchanged. Numerous “best practices” standards have been created to help alleviate variation within institutions (TEI Consortium, 2004, Tutorials). The loosely prescribe structuring rules, however, demand that TEI rendering tools be just as flexible and not beholden to a particular encoding practice.

Finally, the basic assumptions underlying the use of structural elements creates problems for representing the physical structure of a work. TEI is based primarily on encoding the intellectual structures of a text, such as chapters, acts, volumes, and other semantic containers. Such assumptions preempt encoding structures based on physical attributes of the container, such as the sequence of formes in early printed texts (Bauman & Catapano, 1999). The scope of this problem may be beyond the capabilities of TEI Lite and require use of the larger element set of TEI.

Despite these criticisms, TEI Lite successfully achieves its goal of providing a readily adaptable point of entry to TEI. Furthermore, TEI Lite sufficiently addresses the domain problems that TEI was meant to solve. We can judge the 1987 Poughkeepsie Principles in terms of the current implementation of TEI Lite: TEI Lite provides for simple, clear, and concise representations of textual objects; Expression in XML allows for efficient processing, the use of non-specialized software, and conforms to existing standards; Structural definitions in TEI Lite are not as rigorous as TEI, and the user may not extend TEI Lite freely, however, upward compatibility between the specifications provides a solution.

In addition to basic principles, we may judge success by the degree to which TEI Lite has been adopted. The Oxford Text Archive and the Electronic Text Centers at the University of Virginia and the University of Michigan use TEI Lite to encode their holdings. The TEI Consortium uses TEI Lite in its technical documentation (Burnard & Sperberg-McQueen, 2002, p. 2). Additionally, a significant number of the projects listed on the consortium Web site use TEI Lite (TEI Consortium, 2004, Projects using TEI) and a cursory Web and journal search reveals that TEI Lite is frequently used for encoding projects and research.

Conclusion

TEI Lite is an introductory subset of TEI, one of the earliest metadata initiatives. The encoding standard blends a flexible implementation with established descriptive principles. The result is a metadata set that is easy to apply and capable of describing many types of objects. The success of TEI, representing the efforts of scholars worldwide, has informed the development of many subsequent metadata standards and influenced the development of XML. Development of the standard continues as does its increased use in projects for a variety of domains.

References
(see also pathfinder & annotated bibliography)

Bauman, S. & Catapano, T. (1999). TEI and the encoding of the physical structure of books. Computers and the Humanities, 33(1/2), 113–127.

Burnard, L. & Sperberg-McQueen, C. (1995, updated 2002). TEI Lite: An introduction to text encoding for interchange. Retrieved on 18 September, 2004, from http://www.tei-c.org/Lite/teiu5_en.pdf.

Burnard, L. & Light, R. (1996). Three SGML metadata formats: TEI, EAD, and CIMI: A Study for BIBLINK Work Package 1.1. Retrieved on 18 September, 2004, from http://www.ifla.org/documents/libraries/cataloging/metadata/biblink2.pdf.

Burnard, L. & Popham, M. (1999). Putting our headers together: A report on the TEI header meeting 12 September 1997. Computers and the Humanities, 33(1/2), 39–47.

Burnard, L. (2000). Text encoding for interchange: A new consortium. Ariadne, 24(21 June 2000). Retrieved on 16 September, 2004, from http://www.ariadne.ac.uk/issue24/tei/.

DeRose, S. (1999). XML and the TEI. Computers and the Humanities, 33(1/2), 11–30.

HTML Writers Guild (2001). An introduction to the Text Encoding Initiative (TEI), DTD. Retrieved on 26 November, 2004, from http://gutenberg.hwg.org/teidtds.html.

MIT Libraries (2004). MIT metadata reference guide: TEI (Text Encoding Initiative) metadata. Retrieved on 16 September, 2004, from http://libraries.mit.edu/guides/subjects/metadata/standards/tei.html.

Mylonas, E. & Renear, A. (1999). The Text Encoding Initiative at 10: Not just an interchange format anymore – But a new research community. Computers and the Humanities, 33(1/2), 1–9.

Pouchard, L. (1998). Cataloging for digital libraries: The TEI scheme and the TEI header. Katharine Sharp Review, 6(Winter 1998). Retrieved on 18 September, 2004, from http://alexia.lis.uiuc.edu/review/6/pouchard.html.

Roland, P. (2002). The Music Encoding Initiative (MEI). Musical Applications using XML (MAX) 2002 Conference. Retrieved on 21 November, 2004, from http://dl.lib.virginia.edu/bin/dtd/mei/maxpaper.pdf.

SCHEMAS Registry (2002). Activity reports: Text Encoding Initiative. Retrieved on 16 September, 2004, from http://www.schemas-forum.org/registry/desire/activityreports.php3
?field=filename&value=TEI_D29D35(RDF).rtf
.

Smith, D. (1999). Textual variation and version control in the TEI. Computers and the Humanities, 33(1/2), 103–112.

Sproat, R., Taylor, P., Tanenblatt. M. & Isard, A. (1997). A markup language for text-to-speech synthesis. 5th European Conference on Speech Communication and Technology, Rhodes, Greece, September 22-25, 1997. Retrieved on 21 November, 2004, from http://www.talkingheads.computing.edu.au/resources/documents/serge/
Sproat/A%20Markup%20Language%20for%20TTS%20Synthesis-Sproat.pdf
.

TEI Consortium (2004). Text Encoding Initiative. Retrieved on 16 September, 2004, from http://www.tei-c.org.

AttachmentSize
TEI-Lite_pathfinder.pdf235.58 KB