infoSpace - Digital Archives

Guarding the Guards: Archiving the Electronic Records of Hypertext Author Michael Joyce

tkiehne — Fri, 28 Jul 2006 18:27:45 +0000

In June of 2006, Thomas Kiehne and Catherine Stollar were selected to present the results of work performed the previous year at the Harry Ransom Center in Austin, Texas to a colloquium assembled by the Society of American Archivists and the National Archives and Records Administration. The following is the case study that was presented.

Update: This text was re-published by SAA in the proceedings from the "New Skills for a Digital Era" colloquium.

Abstract

In 2005, the Harry Ransom Center at the University at Austin acquired the fonds of hypertext author Michael Joyce. The major emphasis of the Ransom Center's collections is the study of literature and culture in the late 20th and early 21st century of the United States, Great Britain, and France. Michael Joyce's groundbreaking work in hypertext poetry and fiction make his papers a desirable addition to the Ransom Center holdings.

The Michael Joyce Papers are mostly composed of electronic records with an additional 60 manuscript boxes of paper -based materials This is the first mostly electronic archive the Ransom Center has acquired and new strategies for preserving digital content were employed. This case study discusses the techniques and skills utilized to preserve the electronic records of Michael Joyce as a model for processing future digital manuscripts at the Ransom Center.

Scenario

Established in 1957 by University of Texas Vice President and Provost Harry Huntt Ransom, the Harry Ransom Humanities Research Center at The University of Texas at Austin incorporated a strategy for collecting older rare books and manuscript collections with a new initiative to collect literary, photographic, and theatrical works by modern artists. Some of the authors whose works are included in the Ransom Center's collections are Norman Bel Geddes, Don DeLillo, T.S. Eliot, James Joyce, Ernest Hemingway, Norman Mailer, D.H. Lawrence, Ezra Pound, Anne Sexton, Isaac Bashevis Singer, and Tennessee Williams. Michael Joyce's work as perhaps the most influential hypertext poet and author fits nicely into the Ransom Center's contemporary author collecting policy.

Our case study to preserve Michael Joyce's digital manuscripts resulted from collaboration between the School of Information at the University of Texas at Austin and the Harry Ransom Center. Three students, Thomas Kiehne, Vivian Spoliansky, and Catherine Stollar, from Dr. Patricia Galloway's Problems in Permanent Retention of Electronic Records course offered at the School of Information undertook a semester long project to develop a strategy for archiving an initial accession of electronic materials saved on 371 3.5" floppy disks (totaling 211 KB) from author Michael Joyce. Upon completion of the project, a second accession of electronic and paper-based materials, including the contents of three hard drives (totaling 8.38 GB) and 60 manuscript boxes, was acquired by the Harry Ransom Center and is currently being processed by staff archivist Catherine Stollar according to the strategy developed during the class project. Our case study discusses strategies for file recovery, migration, preservation, arrangement, and description developed working with both accessions of Joyce's materials. The electronic records are currently maintained in a DSpace repository administered by the School of Information, however, in the future the Joyce records will move to a DSpace repository controlled by the Ransom Center and the General Libraries of the University of Texas.

The lack of Ransom Center staff with skills in digital archivy provided the impetus for the Ransom Center to partner with the School of Information on the Michael Joyce Papers. Although the Ransom Center employs talented archivists and IT professionals, no staff member possessed skills necessary for archiving digital manuscripts. The Ransom Center sought advice from Professor Galloway and agreed to use the Joyce materials as a case study in the Problems in Permanent Retention of Electronic Records course.

Some audio and video migration preservation projects were already in progress at the Ransom center in the Department of Photography and Visual Collections to preserve audio and video works, but there were no concerted efforts to preserve born digital manuscripts. Policies and Procedures for migration of audio and video content to new media were unsuited for born digital manuscript preservation and policies for preserving digital manuscripts were inadequate to capture the complete behavior of the original digital record. Previously, the few electronic manuscripts and correspondence already in the Ransom Center's manuscript collections were printed and organized in boxes like paper records. Because digital records are entirely unlike paper-based records, a preservation strategy based in printing records preserved very little of the original document. Electronic media were saved, but researchers were prevented from viewing original disks and no access copies were created.

The main component of our preservation strategy is to ingest electronic records and associated metadata into an institutional repository. DSpace, created from a joint project between MIT and Hewlett-Packard, is the institutional repository we used and will continue to use for electronic record preservation. At the heart of DSpace, like most open archival information systems (OAIS), is a database populated by individual digital objects supported by content, context, and structure metadata. We used DSpace, instead of FEDORA or another institutional repository, because it was already the established as the repository of choice for the School of Information. Although we had issues with the web user-interface for ingesting, viewing, and accessing materials within the repository, we plan to work with a talented ISchool student with Java programming skills to make our installation of DSpace more user-friendly.

Partnerships were key components to our case study's productivity and success. The initial group of students participating in the first part of the case study each represented a different background. Thomas Kiehne brought a wealth of Information Technology skills to the project, including programming and operating systems knowledge. Catherine Stollar shared her knowledge of archival theory and practice during the case study. Vivian Spoliansky viewed the case study through the lens of preservation and shed light on aspects of authenticity and desired levels of service for object preservation. Working with a variety of subject specialists on the project enabled participants to learn key skills from the others that will be useful on future digital record preservation projects.

Processing as Digital Archeology

A number of digital materials, including emails and published articles, within the archive had a paper-based counterpart, demonstrating that he created both digital and analog records while performing the same activities. Both formats of records were created synchronously, and at an institution like the Ransom Center that preserves not only works that have influenced the arts and humanities fields, but also preserves the context in which those works were created, we determined it would be desirable to reflect synchronous creation in the arrangement. We did not originally understand relationships between Joyce's digital and paper materials because our first portion of the case study only dealt with electronic records from the first accession of 371 floppy disks. One of the more unique aspects of this project involved the processing of 371 3.5â€ floppy discs that contained the digital objects of the first accession. The provided floppy discs were mostly from the Macintosh â€œclassicâ€ era, some of which date as far back as the mid to late 1980s. The assumption at this stage is that the original storage media is not stable or reliable and the information that they hold must be moved quickly and efficiently. Otherwise, little was known about what to expect in terms of specific technological issues or challenges.

At the outset of the project, we had only a general idea of the process of moving the digital files from the source media to a repository, and as such, we could not express specific requirements for software tools and utilities that might be needed. In order to minimize project overhead in terms of time and resources, we desired to use only open source, shareware, or freeware tools that are readily available in order to assist with the extraction process. This approach allowed us to assess the suitability of tools that are currently available and their ability to interoperate. In the absence of suitable free tools, we intended to find commercial software or create our own programs or scripts to perform the required tasks as we identified them. In the course of processing the first accession of discs, we quickly elucidated a more detailed procedural framework that can be abstracted and applied to future projects.

The general process implemented during the processing of the discs is as follows:

Receive and identify physical media
Catalog the physical media
Copy files to newer physical media
Perform initial file processing
Create an item-level index of all recovered files
Create and process working copies of all files while retaining the original bitstream copies

Technical metadata is collected at each step in the process not only to facilitate the work in progress, but to support provenance and authenticity. Each operation performed on the bitstream â€“ every copy and access â€“ provides the opportunity for inadvertent loss or alteration, so careful recordkeeping is as essential as careful handling. Additionally, all personnel involved in processing must thoroughly understand the procedures involved in order to prevent duplication of effort or discontinuities in results. In many cases, software can automate these processes, thus reducing the chance of errors, but the extent to which software can mitigate such risks is limited by the assumptions made by the creators of the software and how well the personnel making use of this software understand these limitations.

Given that time was of the essence, we opted to use text entries in Microsoft Excel spreadsheets to create the initial disc catalog and the associated metadata. This approach allowed us to leverage existing proficiency with spreadsheets and the availability of the software to eliminate the time needed to create a custom database application or to learn project management software. Unfortunately, the conspicuous absence of relational or workflow aspects in the spreadsheet format made us vulnerable to recordkeeping errors, making quality control a primary concern.

The copy functionality of the computer operating systems involved were sufficient to perform the movement of digital files from floppies to hard rives and removable media. Unfortunately, the differences between Macintosh and Windows in the management of file system metadata became significant. Creation dates are handled differently between these two operating systems such that a copy made in Windows takes on the date of the copy operation, not the creation date of the original from which it was made. Additionally, file system metadata for Macintosh files are stored as separate, invisible resource forks that are notorious for becoming corrupted. As a result, we often could not trust the dates ascribed by the operating system and had to refer to external resources, such as Michael Joyce's curriculum vitae, to confirm or provide date metadata at a later time. Issues with Macintosh resource forks also affected file downloads from DSpace after ingest.

At many points during the processing, we encountered technical difficulties in the form of file or disc errors. These errors can occur for a number of reasons, including damaged media, exposure to magnetic or other hazards, dirty data surface areas, and so on. In the case of dirty surface areas, several attempts were needed to overcome a copy error. It is suggested to have a drive cleaning kit available and use it periodically to prevent build up of debris on the drive head. For other errors, it was necessary to have available software utilities that can attempt to recover from file copying errors. Windows provides such capabilities within the operating system (e.g.: Scandisk), but Macintosh does not. For our purposes we were able to discover an older version of a commercial program, Norton Utilities, which allowed us to recover many files that could not be copied initially. Virus checking was also a preeminent concern. Errors and crashes must be met with persistence as they are often surmountable, which implies at least a minimum degree of technical knowledge.

In moving the digital files to other media, we created a filesystem hierarchy that mimicked the physical arrangement of the discs. Such hierarchical arrangement allowed us to use file system tools to generate some of the metadata automatically. There are a number of freeware, shareware, and commercial applications for Macintosh that will catalog a file volume and produce reports. We used a shareware utility called CatFinder to index the copied files and export a report to a delimited format that was imported into Excel. This report formed the basis of our item-level metadata, including fields for filename, file size, kind (document or folder), Macintosh file type (analogous to the Windows file extension), Macintosh creator code, creation date, and modification date. To this basic report we added a comments field for use during appraisal and to collect technical notes.

MD5 file hashes were also generated for each file. Having an MD5 hash for each file allowed us to do two important things: to identify and/or eliminate redundant files, and to support provenance auditing during the repository ingest process. A freeware PERL application called Integrity automatically created MD5 hash calculations and exported the results to a delimited text file. Unfortunately, integrating the MD5 hashes into the CatFinder index was not trivial due to differences between the two applications in file name recursion and handling of hidden files.

Having created a unified index of filesystem metadata, augmented with processing notes and MD5 hashes, we were able to more accurately assess the extent of the digital files and facilitate arrangement and appraisal. Unfortunately, the index was in no way tied to the digital files and presented us with a significant information management problem. For example, any movement of files was not automatically noted in the index, nor was any change or deletion in the index reflected in the filesystem. We can envision a workflow-oriented system that stands between the filesystem and a metadata database that would greatly increase the speed and reliability of processing large bodies of digital documents.

Arrangement

(Note: More detail about our project can be found in a forthcoming article by Catherine Stollar about processing the Michael Joyce Papers in Provenance.)

After recovering most of the unique digital content from the first accession of floppy disks, we began the process of archival arrangement. In the beginning, we asked ourselves some questions. Can and should digital files be arranged like paper-based records? Should we heed traditional archival arrangement practices or follow newer theories of arrangement based on item-level metadata? Do electronic records have a natural hierarchy that can be expressed in a traditional arrangement? Should physical housing for digital materials be kept? If so, where? Our answers to these questions are not definitive, but we came to a compromise incorporating basic tenets of archival theory with features of on-demand, flexible file arrangement using item-level metadata.

A number of digital materials, including emails and published articles, within the archive had a paper-based counterpart, demonstrating that Michael Joyce created both digital and analog records while performing the same activities. Both formats of records were created synchronously, and at an institution like the Ransom Center that preserves not only works that have influenced the arts and humanities fields, but also preserves the context in which those works were created, we determined it would be desirable to reflect synchronous creation in the arrangement. We did not originally understand relationships between Joyce's digital and paper materials because our first portion of the case study only dealt with electronic records from the first accession of floppy disks. We initially arranged the files into 5 series: Works, Academic Career, Correspondence, Storyspace, Third-party Works, and Personal. After surveying the paper-based materials and the second accession of electronic materials, we had to alter our originally arrangement to include the newly accessioned materials. The final arrangement we created is Works and Related Materials, Academic Career, Correspondence, Storyspace, Journals and Appointment Books, Personal, Works by Other Authors, and Published Materials.

Institutional repositories like DSpace can facilitate digital object arrangement into our specified series by using the community, sub-community, collection, sub-collection, and item level hierarchies. DSpace's hierarchies relate to traditional archival hierarchical levels: communities equate to archival fonds, sub-communities to series and sub-series, collections as other layers of granularity within a series, and item-level entries relate to digital objects. In an additional level of granularity, items composed of multiple sub-components or related files, i.e.: websites with multiple linked HTML files can be ingested as bundled files.

After determining how to arrange the paper and digital materials, we decided how to arrange the physical housing (jewel cases, magnetic media, paper holders, plastic cases, etc.) from Joyce's electronic works. Previous policies and procedures at the Ransom Center dictated that electronic media should be physically housed in Hollinger boxes separate from the rest of the paper-based materials. This separation policy apparently arose out of concern for potential damage to other materials caused by degrading electronic media and to limit access to the electronic materials by researchers. No studies on electronic media degradation have found any instances of off-gassing or other damaging effects of filing electronic media with paper-based materials, so we determined physically integrating paper-based material and digital media would be the best policy for physically arranging the Michael Joyce Papers. The Ransom Center will still limit access to files saved on original media because researchers will have access to the files via DSpace.

Although we integrated Joyce's digital objects into a functional group arrangement similar to his paper-based records, we also took advantage of the flexible, non-linear nature of digital object arrangement by enabling on-demand, user-controlled arrangement by item-level metadata. Preservation of digital objects depends on item-level metadata used to document, migrate, emulate, and preserve the objects. Item-level metadata recorded for preservation in DSpace's database also enables flexible arrangement of digital objects. Digital arrangement allows archivists, and users, multiple options for organizing objects depending on the parameters set by the user interface, such as file name, title, author, date created, subject, or other metadata element. Arrangement is limited only by the skills of the programmer developing the user interface used to access the OAIS database and the precision of metadata recorded for each object.

Arrangement is also affected by how we ingested objects into DSpace because our method of ingest affected what metadata fields we included. Although manual metadata assignment of all files within the Joyce archive was laborious, certain metadata fields were impossible to record automatically. Content metadata, such as subject and title of work, had to be entered by hand because automatic tools to accurately extract content were not available.¹ We found it difficult to use file names within the archive to associate files with published titles because the file names were not specific or standardized.

We incorporated methods for traditional archival arrangement and strategies for on-demand item-level arrangement while processing digital objects within the Michael Joyce Papers. Together, both methods allow users to browse records according to functional series and create new arrangements based on any metadata available for individual objects.

¹ Literary text comparison tools designed for use with small numbers of digital works were not sufficient for our large collection of files. Apparently text-mining tools could serve our purposes to compare large bodies of records with each other. We have not utilized any text-mining tools to date.

Challenges

In addition to the challenges we encountered developing a strategy to preserve Joyce's text and graphic files, we faced unique challenges associated with preserving Joyce's most influential creative works written using specialized software called Storyspace. Storyspace, created by Michael Joyce, Jay David Bolter, and John B. Smith, as a format type presented (and continues to present) challenges for media migration, ingest, and file use. Hypertext works written in Storyspace are composed of multi-faceted texts linked by guard fields (words within texts that enable direct links to other nodes, usually under specific conditions) and can only be viewed using Storyspace software. To complicate matters, we originally thought the latest version of Storyspace was backwards compatible and could read works written in the first version of Storyspace. Unfortunately, this is not entirely the case as older Storyspace documents do not degrade gracefully. For example, the text from files written in Storyspace 1.5 can be read in Storyspace 2, but the individual nodes and links are missing, making the Storyspace 2 rendering of a older work vastly different from the original.

New Skills

A thorough grounding in the various operating systems. The profusion of technical difficulties and operating system inconsistencies required an intuition about the various platforms that can only be gained by direct experience. While processing digital files it is essential to have an understanding of the environment in which they were created. Computer literacy in more than one platform and with networked environments will be ideal traits for archivists of the future.
A basic understanding of the structure of digital documents. Knowing how a digital file is created and stored, including such basics as the difference between binary file formats and textual formats such as ACSII and UTF helps provide an understanding of what happens during processing. Furthermore, an intimate knowledge of the types of formats and how they might be identified (e.g.: file type extensions or creator codes), accessed, and converted is essential. Understanding digital formats offers clues to where to find item-level metadata (e.g.: document properties embedded in word processing files, ID3 tags embedded in MP3 audio files, etc.) and suggests migration paths for long-term preservation.
Proficiency with and trust in new tools. Familiar means of handling physical documents are not present with digital documents. Software tools and operating systems augment the functions of our senses in the digital world and mitigate some of this loss, but not completely. Integrated toolkits and processing systems are needed and must be developed so that they can be trusted to conform to the expectations of archival practice.
Establishment of new workflows and procedures. The intangible nature of digital information makes documentary evidence crucial to processing. Many institutions have established procedures for document processing, including audio/visual materials, but these cannot be assumed to be sufficient for digital objects. Operating systems alone cannot document processes, so new systems that function according to sound processing policies are necessary.
Ability to monitor current trends in digital preservation including metadata standards, crosswalks between encoding standards, available tools, storage systems, file format repositories, national and international research initiatives, user expectations, and published best practices guides.
A thorough understanding of traditional archival theory and practice. Archivists who work with digital records should be able to extrapolate traditional theory and apply it to electronic record preservation, but must be flexible enough to create new standards for archival practice. What we do as archivists will change (practice), but why we do it will not (theory).

Discussion Questions

Would automatic content management be more time consuming or less time consuming that manually arranging the digital manuscripts? Would it result in a better arrangement?
Is it feasible to devise a one-size fits all processing toolkit?
How can authors, who may deposit their materials in an institution like the Ransom Center, implement a digital preservation strategy at home, closer to the point of document creation?
How desirable is it to keep most files in proprietary formats that are the current de facto standard? (i.e. Microsoft Word, Adobe PDF, etc.)
Should files be arranged at all or should they be indexed and sorted using search engines using item-level metadata?
Is DSpace a viable option for smaller repositories and organizations?
How will DSpace integrate into existing points of access? (i.e. OPACs, website, EAD consortium sites)
How do archivists best obtain the skills we are advocating they have? Classes? Projects? Workshops? Conferences?

Attachment	Size
4_Stollar_Kiehne.pdf	134.9 KB

From Floppies to Repository: A Transition of Bits

tkiehne — Tue, 14 Jun 2005 07:47:35 +0000

A Case Study in Preserving the Michael Joyce Digital Papers at the Harry Ransom Center

Thomas Kiehne, Vivian Spoliansky, Catherine Stollar

The University of Texas at Austin

In January of 2005, three students at the University of Texas at Austin School of Information undertook a project to preserve the electronic files of hypertext author Michael Joyce for the Harry Ransom Center, an art and humanities focused archives located on the UT campus. Thomas Kiehne, Vivian Spoliansky and Catherine Stollar, the students involved in the project, spent five months preparing, arranging, describing and ingesting—into an Open Archival Information System (OAIS) called DSpace—Joyce’s digital files for preservation. The following is a report on the methods, problems and suggestions we developed while participating in the Joyce Project.

Background

The Harry Ransom Center

The Harry Ransom Center, custodian of the Michael Joyce Digital Papers, is an archives focused on advancing the study of the arts and humanities. Fulfilling the core of its mission the Harry Ransom Center:

Acquires original cultural material for the purposes of scholarship, education, and delight
Preserves and makes accessible these creations of our cultural heritage through the highest standards of cataloging, conservation, and collection management
Supports research through public services, symposia, publications, and fellowships
Provides education and enrichment for scholars, students, and the public at large through exhibitions, public performances, and lectures.

In acquiring the Michael Joyce archive, the Harry Ransom center has the opportunity to preserve rare and unique electronic files that document the creation and evolution of hypertext fiction. The earliest of Joyce’s files were created in the mid-1980s, thereby necessitating the speedy creation of a digital preservation strategy to prevent the loss of some files to medial failure or software inoperability.

Michael Joyce

“Michael Joyce is the author of afternoon, a story, perhaps the most celebrated hypertext fiction written to date, and of Twilight, A Symphony. His first novel, The War Outside Ireland (1982), was a Small Press Book Club selection, won the Great Lakes New Writers Award in fiction, and was featured in the USIA international traveling exhibit, "America's Best." He holds an MFA from the Iowa Writers Workshop, where he was a Teaching Writing Fellow; and he has been a Visiting Fellow at the Yale University Artificial Intelligence Project (1984-85). With Jay Bolter and John B. Smith, he developed Storyspace.” (from http://www.eastgate.com/people/Joyce.html, accessed 05/08/05)

Michael Joyce has also played an important role in the evolution of hypertext as a teacher. He began his teaching career in 1975 at Jackson Community College in Jackson, Michigan where he served as Associate Professor and Coordinator of the Center for Narrative and Technology until 1995. Currently, he is a faculty member at Vassar College where he continues to teach students the potential of the relationship between narrative and technology. Hypertext offers its users new tools and methods for textual collaboration, education, learning and entertainment.

Just as hypertext facilitates new relationships between narrative and technology, digital preservation requires relationships to form between traditional archival practice and technology. Thus making hypertext an ideal narrative form, and Michael Joyce an ideal author, with which to begin digital preservation at the Harry Ransom Center.

Data Recovery and Preparation (Digital Archeology)

Many of the procedures that were used to extract and identify the electronic records were dictated by the characteristics of the storage media. The assumption at this stage is that the original storage media is not stable or reliable and the information that they hold must be moved quickly and efficiently. Otherwise, little was known about what to expect in terms of specific technological issues or requirements.

The provided floppy discs were mostly from the Macintosh “classic” era, some of which date as far back as the mid to late 1980s. Recent and current Macintosh hardware no longer have floppy drives installed. During our exploratory tests using a Mac OS X computer with an external USB floppy drive we had some difficulty in accessing the discs. Many of the floppies came to us labeled “unreadable” we suspect because of this very problem. Fortunately, older Macintosh hardware with integrated floppy drives were readily available and proved to be perfectly able to access the contents of the discs. From this experience it was decided to try to migrate the content of the discs using older hardware and Macintosh operating systems that are somewhat contemporary with the computing environments originally used to create the information. This decision would also prove to be helpful in the appraisal phase of the project when trying to access documents extracted from the discs – something that would not have been possible in many cases using recent Macintosh operating systems without conversion or migration utilities.

In addition to the hardware and operating system environment, we desired to use only open source, shareware, or freeware tools that are readily available in order to assist with the extraction process. This decision is characterized by several factors. First, we wished to eliminate the time required to create new programs to perform specific tasks or groups of tasks. Second, we wished to assess the state of existing tools for use in file management and assessment. Finally, we wanted to mitigate the effect of not knowing in advance what tasks we might perform and what tools might be necessary to perform them.

The Digital Archeology Process

In general, the process of extracting information was performed in the following steps:

Receive and identify physical media
Create a cataloging system for the physical media
Copy files from physical media and record metadata
Perform initial file processing
Create an item-level listing of all recovered files
Create working copies of all files and protect the original copies

The process is not necessarily linear, as is evident in the discussion of each step below.

Receive and identify physical media: We initially received packages of floppy discs that were arranged by a student intern working for Michael Joyce. We had to assume that these groupings reflected, in some way, the order in which the discs were originally stored or arranged. Each bundle was labeled in various ways, some having to do with some sort of semantic ordering ('STARWHITE”, etc.), and others with functional indicators (“UNREADABLE”). We could infer no explicit ordering other than that of the labeling, so the order in which we encountered the disc bundles became our original order.

As the discs were unpacked and placed into suitable containers, a sequential number was written in pencil on the outside of the disc housing that reflected the bundle and sequence of the disc within that bundle, e.g.: 13.1 for the first disc in the thirteenth bundle. This number is later referred to as the disc's catalog number, or more simply, disc number.

Create a cataloging system for the physical media: In order to begin gathering metadata, a Microsoft Excel spreadsheet was created that was be used to inventory and describe various attributes about the discs. The following fields were chosen: disc number, bundle name, written (physical) disc label, virtual disc label, disc creation date, disc checked date, contents copied date, and notes.

The purpose for the various date fields was to ensure that several individuals could work on processing the discs independently and not duplicate effort. The notes field was used to communicate problems that were encountered to other group members and to record technical notes about any corrective actions taken.

Copy files from physical media and record metadata: Once the media were organized and a method for collecting metadata was established, the contents of the discs were copied to a directory on the workstation’s hard drive. The procedure for each disc was as follows:

Select the next disc from the container
Enter the disc number and written label into the spreadsheet
Insert the disc into the drive
If the disc is accessible (i.e.: formatted and without errors), record the virtual label (that which appears with the disc icon) into the spreadsheet.
Create a new sub-directory in the working directory and label it with the disc number
Drag-and-drop the disc icon into the new sub-directory. On a Macintosh, this creates a directory bearing the name and creation date of the disc. Were we using a Windows workstation, the directory would have to be manually created and the contents copied separately.
Update the creation date, date checked, and date copied in the spreadsheet.
If there are any problems, such as a file copy error, note them in the notes field of the spreadsheet.

The result of this process is a well organized grouping of electronic files that mirrors the organization of the physical discs, and which contains some metadata in the directory structure, specifically, the virtual label, the creation date of the disc, and the date on which the disc was processed. Should the spreadsheet become corrupted, lost, or otherwise unusable, much of the information can be regenerated from the directory structure itself.

At the end of each work session, a copy was made of the working directory and spreadsheet to removable media to protect against loss due to hard drive or system failure on the workstation.

Perform initial file processing: There are two types of initial file processing that occur during this step: virus checking and file recovery. Virus checking should be performed interactively, that is, any actions attempted by the software should require intervention by a user to make any changes to a file. Ideally, virus checking should happen automatically upon copying files from disc to hard drive, but barring this, a virus check should be performed on all copied files prior to any interactions with them to avoid infection of other files. It should also be noted that the virus checking software should be able to identify viruses that are contemporary with the information being processed – in our case, the mid-1980s to mid-1990s. Additionally, prudence suggests that the processing workstation be isolated from interactions with networks as much as possible during this phase to protect against newer, network based hazards.

The second type of file processing occurs when copying is prevented due to disc or file errors. These errors can occur for a number of reasons, including damaged media, exposure to magnetic or other hazards, dirty data surface areas, and so on. In the case of dirty surface areas, several attempts may be needed to overcome a copy error. It is suggested to have a drive cleaning kit available and use it periodically to prevent build up of debris on the drive head. For other errors, it is necessary to have available software utilities that can attempt to recover from file copying errors. Windows provides such capabilities within the operating system (e.g.: Scandisk), but Macintosh does not. For our purposes we were able to discover an older version of a commercial program, Norton Utilities, which allowed us to recover many files that could not be copied initially.

Detailed information about actions taken to recover from a file copy error or virus were recorded in the disc catalog. This included identifying any files that could not be recovered, and, in several cases, when directory structures had to be manually recreated due to unavoidable directory copying errors. This metadata is used later during the repository ingest process to augment the provenance statement for the affected files.

Create an item-level listing of all recovered files: At this point, there will be a complete copy of the contents of the discs and an accompanying document of metadata at the disc level. Since digital preservation operates at the item level, it is essential to generate metadata for each item. Fortunately, our hierarchical arrangement allows us to use file system tools to generate some of the metadata automatically. There are a number of freeware, shareware, and commercial applications for Macintosh that will catalog a file volume and produce reports. We used a shareware utility called CatFinder (http://www.mindspring.com/~shdtree/newsite/id9.html) to index the copied files and export a report to a delimited format that was imported into Excel. This report formed the basis of our item-level metadata, including fields for filename, file size, kind (document or folder), Macintosh file type (analogous to the Windows file extension), Macintosh creator code, creation date, and modification date. To this basic report we added a comments field for use during appraisal.

MD5 file hashes were also generated for each file. Having an MD5 hash for each file allows us to do two important things: to identify and/or eliminate redundant files, and to support provenance auditing during the repository ingest process. A freeware PERL application called Integrity (http://therockquarry.com/integrity.htm) provided batch MD5 hash calculations and report exporting capabilities. Unfortunately, integrating the two reports was not trivial due to differences between the two applications in file name recursion and omission of hidden files.

Create working copies of all files and protect the original copies: Before proceeding with appraisal, it is necessary to create a copy of the copies in order to avoid having to recover files from the original media in the event of inadvertent file corruption or damage during appraisal. The wisdom of this decision was demonstrated once during the project when we attempted to access a disc that had already been copied and were presented with disc format prompt! Remember: we are assuming that the original media are not stable or reliable, so we do not wish to access them any more than necessary. The original bit copies should be made read-only or locked by the file system to ensure that the original copies will remain unmodified. This is also a good time to create a copy of all the files on removable media along with a copy of the file and disc level indexes.

Observations on the Digital Archeology Process

The process just described is effective and reliable so long as all participants in the process are diligent and consistent in their actions. The use of spreadsheets for file indexing and metadata collection is highly effective in that it allows team members to track issues that arise, presents a familiar and intuitive user interface for data collection, and can be easily exported and manipulated. Unfortunately, once the appraisal process begins, the file hierarchy changes as files are removed, moved, or altered in accordance with collection development. The end result is that the original file index no longer represents the working directory structure and must be regenerated at the completion of the appraisal process. This is not a problem in itself, but creates a challenge in trying to carry forward the comments and other notes from the original file index. In our case it was relatively simple to manually copy these notes, but in a case where a larger number of files or discs are involved, or if an institution must frequently perform digital archeology, a more robust solution must be implemented to track the process. A database or Web-based application could easily be developed to track each file from its original physical location through appraisal and ingest. Data entry for a database application might demand more effort initially, but will reduce error and effort later in the process.

A number of choices may be made about the depth and veracity of metadata generation. For example, we captured only the prevailing physical labeling on each disc. It is possible however to go further with physical description, including notes about scratched-out or previous disc labels, disc types and capacity (e.g.: double-sided, 800 Kb), and other such externalities. An especially eager archivist may decide to scan or photograph the disc itself and include an image in the repository for posterity. Furthermore, the operating system may include additional attributes that cannot be captured by a simple file listing. For example, Macintosh files and directories may have customized icons or priority (color) labeling. The depth of description is ultimately a judgment call on the part of the archivist, who should consider the time and resources available for the project against the added benefit of richer description in the context of the collection's purpose.

A second metadata issue involves file and disc creation dates. As indicated earlier, Macintosh's Hierarchical File System (HFS) supports disc creation dates while Windows' File Allocation Table (FAT) does not. Additionally, when a Macintosh file is copied, the creation date of the original is maintained while a copy performed in Windows resets the file creation date to the date of copy. This subtle difference forced us to manually adjust metadata for files copied from FAT formatted discs and files copied using a Windows workstation. We made note of the change in the comments field to inform provenance, but such a fundamental difference is sure to cause issues for similar projects performed in completely Windows-based environments. Solutions to the creation date problem must be found, possibly in the form of new copy utilities for use in Windows environments.

Other issues with dates were encountered that illustrated some very basic assumptions about digital information. Two types of date errors were encountered: dates that were obviously incorrect (e.g.: 01/26/1904, 04/27/1957). These anomalies most likely occur as a result of two problems. The first problem is corruption of the file's resource fork. Although the error can be fixed, the date cannot be recovered – the repair process merely resets the date to the current date. The second problem is an incorrectly set internal clock on the creator's computer. In extreme cases, such incorrect dates are easy to discover and discount, but they bring to bear a fundamental assumption in digital archeology: we assume that the date provided by the original user is correct. This is really no different than the analogous case in physical archives, where dates are written on paper as best the creator can recall and are subject to error. The lesson learned here is that although digitally assigned dates may be reliable in most cases, they are not immune to error and must be taken as a best estimate rather than indisputable fact.

Perhaps the most pertinent observation is that although there are tools available to automatically generate file metadata, they do not work especially well together. We used separate applications to generate file listings, MD5 hashes, check for viruses, and recover from file system errors. If an organization is going to frequently participate in digital archeology, it is advisable that tools be sought or developed that can perform many of these tasks simultaneously. Ideally, these tools would work with the database application suggested above to reduce human intervention, and therefore, time and error. Regardless of what form these tools take, a kit of such tools must be procured for each operating system environment in which digital archeology is to be performed, even if the operating system environment exists as an emulator. Given that numerous operating systems are currently bordering on obscurity, it is imperative that archives gather these tools sooner rather than later. The loss of access to such tools is equivalent to allowing data created in those environments to die, every bit as much as failing to take any action at all to recover the data.

A final observation concerns the affordances of the storage technology from which we recovered information. An interesting longitudinal study could compare the changes in information management practices as storage media capacities and types changed. Specifically, the 800 to 1400 kB storage capacity of the floppies used in the 1980s and 1990s enforced certain organizational practices that disappeared rapidly with the advent of large capacity hard drives. Subdirectories are much rarer on floppies than hard drives and other large capacity media; therefore, we saw little need to retain structural information beyond the disc ID and label. Digital archaeologists working with hard drives and large capacity storage media should collect path data as it will assist in file tracking and identification as well as in establishing the original order and contextual information necessary for appraisal.

Arrangement of the Joyce Files

While copying bits from floppies to a hard drive, we were able to quickly survey the types of materials and some file content to help us develop an arrangement for the files based on the functions in which the files were created. We tried to model the arrangement of the Joyce files after current methods of arrangement employed by the HRC.

Before discussing the arrangement we established, it is necessary to note the benefit of arranging digital files is flexibly of those files, or more appropriately access to those files, to move after deposit. Any arrangement established in an electronic institutional repository can be altered for varying purposes of access. In our case, the arrangement we developed to organized ingested files into DSpace could be altered, to some extent, by the end users sorting preferences. We will discuss this later in detail.

We approached the arrangement of the Joyce files in the same manner as we would approach arranging paper versions of the same files. We determined the files would fit neatly into 6 series including: Works, Academic Materials, Correspondence, Storyspace, Third-Party Works, and Personal. The following is the arrangement we developed for the Joyce files.

Figure 1. Arrangement of Joyce’s Files

Series I. Works

(Subseries for each title)

Series II. Academic Career

Subseries A. Scholarly Material

(Conferences, Presentations, Groups, Correspondence about conferences, Bios/CVs, Scholarly works, and Published papers)

Subseries B. Teaching Material

(Reading exercises,Class exercises, Learning objects)

Subseries C. Administrative Material

(Grant proposals, Departmental correspondence, Requests for fellowships)

Series III. Correspondence

(can be arranged by date, subject, or author metadata)

Series IV. Storyspace

Subseries A. Code,

Subseries B. Design

Subseries C.. Riverrun Ltd. (company that created the Storyspace reader)

Series V. Third-Party Works

(subseries for each author/work)

Series VI. Personal

(Address books, Expense reports, Other)

In DSpace terms, our arrangement looks like this:

Figure 2. Arrangement in DSpace Terminology

Community: HRC

Sub-community: Michael Joyce Papers

Sub-sub-community: Series I. Works

Series II. Correspondence

Series III. Academic Career

Series IV. Storyspace

Series V. Works By Others

Series VI. Personal

Collections:

Series I. Subseries A. Afternoon

Subseries B. Twilight

Subseries C. Writing on the Edge

(etc.)

Series II. Subseries A. Scholarly Materials

Subseries B. Teaching Materials

Subseries C. Administrative Materials

Series III. None

Series IV. Subseries A. Code

Subseries B. Design

Subseries C. Riverrun, Ltd.

Series V. Subseries A. Uncle Billy’s Funhouse, Pete Jones

Subseries B. Chaos, Lily Wilson (etc.)

Series VI. None

We attempted to create series that were specific enough to prevent an overlap of possible “homes” for files, but broad enough to encompass a significant portion of the archive. This arrangement is based on how files were created by Michael Joyce. As Joyce creates some files in his role as hypertext author and other files in his role as Vassar professor, we attempted to separate the files in the arrangement. This separation was most difficult when evaluating where written works should be placed. We eventually determined, after looking at Joyce’s own distinctions between fiction and academic works within his 1998 curriculum vita (http://faculty.vassar.edu/mijoyce/MJoyceCV04.htm), that fiction would be arranged in subseries according to title within Series I. Works, and his academic papers would be separated into subseries by cause for creation in Series II. Academic Career. Joyce uses these headings to delineate the types of items he has written: Fiction and Hypermedia; Scholarly Books, published lectures, etc.; and Scholarship. We equated our Series I. Works to his “Fiction and Hypermedia” title. Our Series II. Academic Career is the combination of his “Scholarly Books, published lectures, etc.” and “Scholarship.”

We faced some confusion when trying to sort files. The published titles of some files could not be ascertained from file name, so we had to read through most of the digital files at least once that we were trying to arrange. (As a side note, we often had to read through the files a second or third time when assigning content keywords to the files for ingest.) We also had difficulties separating files into Series I. and Series II. because some academic essays were published individually, then at a later date were published together in Othermindedness. This made it difficult to distinguish under which title we should arrange the files—the essay title or the compilation title Othermindedness. Additionally, some titles were not published at all and therefore not listed on Joyce’s vita. In the instance of unpublished work, we were forced to determine if the file was fictive or academic in nature. The line between narrative work (fiction) and academic material was difficult to distinguish, however, whenever we doubted the arrangement of files, we would turn to the author’s method of distinguishing his work in his 1998 vita.

Despite the size of Joyce’s archive—211 Mb or nearly 4800 files—we were able to arrange all of the files into six series. We found it was exceeding more difficult to arrange digital files in a suitable arrangement, as compared to paper archives, because of the initial disorder of the accessioned files, the amount and initial placement of duplicates and the lack of distinguishing features of files visible to the archivist. Processing digital files required at least three passes through the files to separate them into appropriate series. With each pass, we would break the files down into smaller and smaller groups until all of the files were in appropriate folders corresponding to published titles, groups or other context of creation as listed in figure 1.

Originally the files came into the HRC saved on 370+ floppies. Although we retained the original location, and thus original order, of files by recording disk numbers for each file, we found the original file order to be haphazard and insufficient for research access and file use. We wanted to enhance access and use to the Joyce archive by intellectually organizing the files in a manner that would reflect Joyce’s creation process. The file/disk order in which the files came to the HRC mainly reflected Joyce’s file preservation process. We know that most of the files reflect Joyce’s preservation process because the files on disks were saved as backups to the initial file copy, presumably on Joyce’s hard drive, because most of his disks were labeled “backup” or “b/u.” It is important to note that the files saved on Joyce’s disks reflect his own appraisal. He determined which files should be saved to disk. Since the files on the disks we received were not only created by Joyce and saved to the hard drive, but they were saved again, if not multiple times as backups, by Joyce, we are preserving only those files Joyce intended to preserve. Hopefully, when the HRC receives the files mirrored from Joyce’s hard drives we will find even more files that relate to his process of creation than the initial files we have preserved from his floppy disks. Unfortunately, due to Joyce’s computer upgrades through the years, the files from his early working years, circa 1980s, which were not saved to floppy disk may be lost forever if not migrated to his new hardware.

Differences Between Arranging Paper and Digital Files

Certainly some aspects of digital and paper arrangement are similar. Hierarchical relationships exposed by traditional archival file arrangement into series and subseries can still be utilized in DSpace. Using hierarchical groupings, Archivists may map traditional archival arrangements onto hierarchical groups within DSpace, specifically, onto communities, sub-communities and collections. However, while processing the Joyce files, we discovered a number of differences between arranging paper files and arranging digital files.

Digital arrangement is more flexible than paper arrangement: Flexibility is a key component of DSpace. Items may be ingested into multiple collections by mapping an ingested item from one collection to another. Additionally, since one cause for arranging paper files is to enable easy access, multiple methods of access to files constitute multiple arrangements. One can access an item by searching for subject keywords, performing full-text searches, organizing the display of files within a collection by author, date or title. One of the greatest aspects of digital archives is the flexibility by which files can be arranged. With digital archives, archivists can retain the original order of accessioned files and impose an order that facilitates greater intellectual access.

Digital archives require item level metadata: Item level metadata is required for digital arrangement, whereas with paper files, folder level metadata is the most detailed metadata recorded. Current archival practice dictates arrangement of archival items into related groups. The purpose of group arrangement is to cluster items that were created by a similar function together. Clustered items may provide more contextual clues than individually arranged items, and thus reveal more information to the end user. Only in instances where items are collected individually, due to rarity or famous association, are those items individually arranged. At the HRC, for instance, manuscripts created before 1700 are removed from the group in which they were accessioned, placed into the Pre-1700’s Collection and receive an individual access record. This is mainly to ensure the rare items are well preserved and to facilitate easier access. In DSpace, items are issued individual access records for the same reasons: preservation and access.

Item level metadata for digital files is the most important component of digital preservation. The more we know about a digital file, the more options we have to preserve it. If we know how and when a file was created, we can find the original or emulated software to read it. We can also find a translating program to read the text in its original format and produce another version readable by available software. Additionally, in the future, new technologies might emerge that could require any type of metadata. The more item level metadata we keep right now, the better chance to read the file in the future.

Item level metadata is also important for accessing digital files. Just as it is difficult to appraise files that are intangible, it is difficult to differentiate digital files from one another. If all files were placed in collection level groups, researchers would waste energy and valuable time trying to distinguish desired files from others.

Terminology differs between paper and digital arrangement: The first main difference between paper and digital arrangement is the use of the term “papers.” Within traditional archives, items created by one person over a period of time gathered together are called papers. Records are items created by an organization as evidence of business transactions. A collection is the compilation of multiple items that were created by multiple authors documenting various actions. Within the scope of these definitions, if Michael Joyce’s digital files were actually paper, we would have called his archive “The Michael Joyce Papers.” Can digital files be called “papers”? In the academic realm, when researchers present their findings at conferences their lectures are called “papers” even if the text of their presentation is born-digital. The University of Rochester’s DSpace repository has a collection titled “Warner School Conference Papers” where digital files with texts of conference papers have been ingested. Although our solution to this problem of terms was to substitute other words for “papers,” such as “digital files”, “material”, “files”, “items”, and sometimes “collection,” if the term “papers” carries a meaning that already implies more than tangible documents, it is appropriate to title our compilation “The Michael Joyce Papers.”

The second issue with traditional archival and digital arrangement terms, specifically within DSpace, concerns DSpace’s hierarchy terminology. DSpace documentation differentiates communities and collections. “Each DSpace site is divided into communities; these typically correspond to a laboratory, research center or department. As of DSpace version 1.2, these communities can be organized into an hierarchy. Communities contain collections, which are groupings of related content.” Based on these definitions, communities would correspond to the creator(s) of digital items and collections would correspond to the content of the items. In our arrangement (see Figure 4.) communities and sub-communities refer to differences in creator and content. Preferably, we would have mapped our series as collections on DSpace because our series were established due to content differences. Unfortunately, due to limitations of DSpace 1.2, collections are shallow and may contain items only, not sub-collections. We need deeper levels of description in collections instead of communities.

Suggestions to Facilitate Archival Arrangement

Traditional archival theories can and must be integrated into institutional repositories best practices. Both archivists and DSpace administrators have the same goal in mind: digital file preservation. We have a few suggestions to merge archival methods and digital preservation within DSpace.

First, DSpace hierarchy should be deeper at the collection level and allow for sub-collections to facilitate representative content distinctions between collections and sub-collections. Using sub-communities as content distinctions is not the imply intent within DSpace 1.2 documentation.

Secondly, collection administrators should be given options to alter templates for item ingest and display in the web-based user interface. The main benefit of digital arrangement is the flexibility by which users can arrange items. If, item record lists (found within collections) display only title, author and date issued, users have limited choices for arrangement. Title, description, author, date created, and date issued should be fields listed in item lines when displayed at the collection level. End users will want to sort by those fields.

Third, DSpace and EAD records can work together. EAD files can act as the liaison between traditional finding aids and DSpace. The use of an <extref> tag can link the lowest level of description within an EAD file (essentially at “folder” level) to the corresponding level of description within DSpace (in our case, the collection level.) See Appendix A. for an example. Linking EAD files to DSpace records gives end users more opportunities to access information recorded in DSpace hierarchies. It also provides a smooth access route for users who want item level information but cannot find such detail from traditional archival sources.

Finally, software tools to extract content and creation dates of files would greatly reduce the time spent on arranging digital files. If we had passed our files through a software tool that could have extracted subject keywords from the files, our subject metadata would have better reflected the content of the files. Although automatic keyword extraction is not the ideal way to assign keywords, it would have been better than the method we employed, which was to skim the file and assign keywords that we thought we pertinent. Our system was flawed because none of us assigned similar keywords, our subject evaluation was based on a cursory glace, our keywords were not based on a standard vocabulary, and often we assigned no keywords because we could not access the file at the time of ingest. Additionally, we would have saved time if the creation dates of our Mac files were automatically extracted instead of manually entering the metadata field for creation date and the value for the field after each file had been ingested into DSpace.

The Appraisal Process

In the Archival Terminology of the International Council of Archives (ICA, 1984) appraisal is defined as: "a basic archival function of determining the eventual disposal of records based upon their archival value”. Appraisal is also referred to as evaluation, review, selection or selective retention. If we consider this definition, the appraisal process would not present a difference between paper-based archives and electronic archives. In fact, the decision of the HRC to collect Michael Joyce’s materials was based in their “archival value,” and this value is related to HRC’s collecting policies. Nevertheless, the differences between paper archives and electronic archives appraisal only becomes apparent when the process is undertaken. Differences in following areas are discussed below: author identification, mode of creation, distinguishing files, disposal and rights management.

One of the first issues that we noticed, while working with Joyce’s digital records was the fact that many clues that one uses when working with paper documents are not present in the electronic environment. For example, when the author did not explicitly insert his name in the text and the document was not clearly perceived as being his, we could not count on handwriting analysis, letterhead, type of paper, ink color, ink type, smell, type of copy, and other clues that are generally applied in the paper contexts. We were completely dependent on the language that was used and the content of the file, which sometimes could be quite misleading. In some cases it took two of people working through the documents more than once to try to find more clues to identify not only the author, but also what the file actually was. Despite these disadvantages, there are advantages in terms of identification unique to digital files including: potential date stamps, the type of software used, e-mail headers, MD5 hashes (to differentiate copies), and creator metadata attached to files by newer software. These electronic identifiers may not be failsafe, however, they do provide alternatives to tangible differences between paper files.

The second issue is the mode of creation of digital files and the difference with paper documents. Given the possibility of keeping different versions of a record and also making backup copies on diverse media (hard-drive, floppies, etc.), people tend to have multiple copies of the same file or files differing in only in system metadata rather than content. The ease of copying digital files facilitates the creation of more duplicates than with paper files, and digital duplicates seem more widely disperse that paper duplicates. In paper files, carbon copies are usually found near the original copy. This is not true with digital files. In our project the discs received were usually backups^¹. Even though the amount of duplicates or near duplicates was not significant, it required work on our part to identify them, to keep all the different versions (even when the changes were minimal) and to “dispose” the exact duplicates. The random assortment of files in the floppies, also lead to difficulties identifying the pertinent “associations” of the files, which made this task very time-consuming. In addition, our team members’ current experience with paper documents made these tasks a challenging learning experience. Finally, an obvious difference is that digital files are “saved” multiple times over the course of creation, which destroys previous versions. In paper archives one might encounter not only several copies of the same document in different locations, but also different versions might be kept, which makes it possible to track changes and compare them with the final document. This process is usually lost with digital records unless all the changes are not saved separately or software versioning features are implemented.

The third issue is that of distinguishing and identifying files which is technologically dependent since a reader is required to access the contents of a digital file. When the files could not be opened because of the lack of the appropriate software, the information could not be accessed, which meant that some records could not be identified. This is something that with paper documents does not occur, since a glance can distinguish at least the type of document. With electronic records we initially relied upon the name of the file to begin our identification process. With older software, however, where filenames were shorter, the filenames frequently were either misleading or just not indicative of any particular content. This forced us to classify some records as “Unidentified.” Interestingly enough, once we had processed many files, we began to recognize Joyce’s file naming patterns^². Finally, the loss of file associations was another recurrent problem. Simply identifying the appropriate application to use for opening a file proved to be difficult in many cases since the workstation did not have the requisite software installed. For example, before we installed Storyspace on our computer, we would se a generic file icon instead of the Storyspace icon.

Disposal is an important archival process related to the life cycle of the records. Disposal is “the action taken with regard to non-current records following their appraisal and the expiration of their retention periods as provided for by legislation, regulation or administrative procedure. Frequently used as synonymous with destruction.” Related to disposal is a practice called weeding that is “the removal of individual documents or files lacking continuing value from a series.” (ICA, 1984). As we stated before when we were dealing with appraisal, the definition of disposal could be theoretically applied to electronic records, but in the practice we noticed there were some differences with a paper-based archive. These will be addressed in relation to disposition and right management issues.

Even though from a traditional archival perspective there are files that during appraisal could be disposed of from Joyce’s fond, the HRC will keep all the “original” materials contained in the floppies, as well as backups of all the materials that were produced while implementing this project. This brings up a different approach that we would have had in a paper-based collection, where if we dispose the materials, they are removed and destroyed. In the case of these electronic records even if we could have taken the same procedure, and delete the disposed records, the HRC will keep the “originals” and the back-ups, the disposal process instead becomes an access restriction for the selected materials. The “disposed” files were not included in appraisal directory and final arrangement, and consequently, not ingested into DSpace. There are three groups of materials that were disposed: software applications files, duplicate files, and files that contained student works. The software applications files were not kept because they were not part of what we considered Joyce’s fonds because he did not create them. The only software that was kept was related to Storyspace because of its uniqueness to this collection, since it is part of the author’s artistic and academic development, and because of Joyce’s involvement with the creation of Storyspace. It is likely that the HRC will acquire some rights to keep Storyspace software, the new version of which will be installed in the reading room computer in order to view Joyce’s hypertext novels.

The second group of disposed materials was composed of duplicates. These files were readily identified by comparing MD5 hashes generated during the creation of the file index. Further, we checked that the dates (creation and modified date), the format and the size of the file were all exactly the same in order to be assured that we were disposing exact copies. There were some cases were one word changed from one file to the other and we kept both versions. For example, the poems “Eislied: a melody in black and white” and “A melody in black and white” are identical poems with one word inserted in the first title. When we disposed the duplicates we kept the one that was first according to our disk folder number. In order to keep his working style for posterity, we considered documenting in the metadata the identification number of the duplicates to make the relation accessible to researchers, and possibly model the relationship in the digital repository.

The third type of materials disposed were third-party works that were not created, all or in part, by Michael Joyce. We disposed of student works because they protected by copyright law and it is HRC policy to remove student works from collections. As for non-student third-party works, we originally intended to remove those as well due to copyright restrictions. However, since third-party works are maintained in paper archives, we were instructed to retain those works as we would if they were paper. We created a separate series, Series V., for third-party works and will make it inaccessible until copyright permissions are gained by the HRC.

Appraisal and disposition practices are related to preservation issues. From a practical point of view, a good appraisal process is the first step in the preservation of documents as it ensures that the documents that are retained are well preserved. It is also important to note that according to the InterPARES Project (2002), the assessments of authenticity as well as the feasibility of preservation are criteria for appraisal decisions. Finally, from a theoretical point of view, the relationship between the appraisal and the preservation of cultural property implies a certain perspective on the ideas of uniqueness and permanence. In this sense, only some records will have an enduring value, and therefore will become archival documents that represent our cultural heritage, as in the case of Michael Joyce’s digital fonds.

Preservation

Paul Conway (quoted in Gilliland-Swetland, 2000) states that: “The digital world transforms traditional preservation concepts from protecting the physical integrity of the object to specifying the creation and maintenance of the object whose intellectual integrity is its primary characteristic.” Even though, we have to agree with Conway in the sense of realizing that the preservation of digital objects poses different issues than the preservation of physical objects, in the digital environment physical objects also have to be preserved.

At the beginning of the project, the HRC’s primarily concern was to retain the information and was not so concerned with retaining the original floppy disks after the materials were copied in a secure environment. From the perspective of the preservation of the bitstreams, it is unclear whether future developments will be better able to make these bitstreams more easily accessible. In that sense, if we keep the original objects we will be able to work on them again if we have access to new technology. We should also consider the importance of the original as a representative of a type of technology as well as an archival object and as a proof of authenticity of the files and documentation of their existence. Therefore, we highly recommend keeping the floppies as evidence of the “originals.” As physical objects they should be housed in archival quality boxes designed for this type of material and in an optimal environment, which will require insignificant expenditures and storage space.

Returning to the concept of future access, we must understand the ideas of William LeFurgy (2002), who is very optimistic regarding the progress of digital preservation, when he proposes a model of levels of service. LeFurgy poses that depending on the type of object that we are faced with and on the state of the art of digital technologies we will be capable of always improving our preservation systems and our levels of services for users.

The levels of service, according to LeFurgy, are related to the degree to which the digital materials can be managed independent of specific technology, in other words, their “persistence.” Persistence is directly linked to the conditions under which the records were created and described. Therefore digital collections can have different levels of persistence: optimal, enhanced and minimal, depending on their persistence characteristics. In a low level of service the formats are not recognized and only the bitstream can be preserved. In a medium level, even though the formats are known, bit preservation can be achieved but full support cannot be guaranteed. In a high level of service, formats are supported and therefore both bit preservation and functional preservation are achieved. Using migration or emulation techniques both types of preservation are possible. MacKenzie Smith (2003) states that bit preservation is achieved when digital files are preserved as they were originally created without any changes. Functional preservation is achieved when the “digital file is kept useable as technology formats, media, and paradigms evolve” and the functionality is maintained.

In the collection’s SIP agreement it is stated that a “medium” level of service will be provided, which entails the following:

Partially persistent materials that enable medium confidence. Preserves the content of the material with degradation of the form allowed. For this level of service, the repository will watch the format in order to try to maintain the data in an accessible format. They will, however, not create their own tools for this conversion unless absolutely necessary. For this level of service, off the shelf conversion tools will be used. Checks will be made to verify that the intellectual content is the same. The original bit stream will be maintained in addition to the converted file. Formats enjoying this level of service include compression schemes and open but proprietary standards.

Initially, the HRC was not very concerned about keeping the look and feel of the original files as the priority was on retaining the information. As the project progressed it was further discussed that, especially in the case of the hypertext novels where the aesthetics plays a major role, the importance of retaining as much as possible of the “look and feel” is of concern. We had access to the new version of Storyspace, so the hypertext novels at the moment are accessed with the limitations of the available version and technology. This leads us again to the issue of the levels of digital preservation, discussed by Smith (2003) because we were able to keep, to a certain point, the two levels of preservation because the new version of Storyspace allows us to retain the functionality of the files even if the “look and feel” is not exactly the same as when they were used originally. Emulation was considered to recover “look and feel,” but disregarded because the files could be accessed with the new version of Storyspace.

Kenneth Thibodeau’s (2002) definition of the digital objects as being physical, logical and conceptual objects is appropriate to summarize some issues related to preservation of digital objects in the long term. For this author: “A physical object is simply an inscription of signs on some physical medium. A logical object is an object that is recognized and processed by software. The conceptual object is the object as it is recognized and understood by a person, or in some cases recognized and processed by a computer application capable of executing business transactions.” He states that in order to preserve the digital object we must identify and retrieve its digital components. In this sense “The process of digital preservation then, is inseparable from accessing the object” and that is why for the author “the black box for digital preservation is not just a storage container: it includes a process for ingesting objects into storage and a process for retrieving them from storage and delivering them to customers. These processes, for digital objects, inevitably involve transformations.” As we have seen, our project has exposed us to compromising the type of transformations that would be acceptable, in order to keep as much as possible the “original” look and feel of Joyce’s materials.

According to the OAIS (2002) there are two types of transformation: reversible and non-reversible. In the preservation field this concept is controversial, because we know that even if we use reversible techniques and materials it is impossible to reverse a conservation treatment without changing the object in a certain way. We would argue that complete reversibility is impossible both with traditional physical objects and in digital objects. Migrating from one version of a format to the next carries some sort of difference between the original bitstream and the new one, therefore it is important not only to note that the change was made, but also to keep the original version.

During the project we considered migration as the preservation strategy that we could use in order to access some of the digital records. We migrated the MacPaint files to Portable Network Graphics (PNG) files using a freeware application that is capable of batch processing. We did not migrate Storyspace, Hypercard, and HTML because these file types are still accessible using current software. Finally, MacWrite files could neither be opened nor migrated because the original software cannot run on the newer Macintosh operating systems and conversion cannot take place using anything other than specific commercial software. Microsoft Word and Excel files comprised the majority of the document types recovered, but conversion one-by-one was deemed to be too labor intensive without the use of commercial software capable of batch conversion. Furthermore, these files are still accessible using the Macintosh version of Microsoft Office.

Collection Implementation & Ingest

Appraisal and the resulting metadata complete the necessary preparations prior to ingest of the documents into the digital repository. DSpace version 1.2 was used to create a digital repository of the Joyce works, starting initially with a small portion of the total documents in order to test and verify procedures to be used for implementation. Two major tasks were accomplished during this test: the development of the repository structure and access rights, and the ingest of documents.

A hierarchy of communities and collections must be created prior to ingesting materials into DSpace. Before these can be created, however, access controls must be designed for defining both access to collections and the submission workflow process. DSpace currently allows two levels of administration: one for collection level administration (Collection Administrator) and another for global administration (Site Administrator). Unfortunately, there is no administrator role for community level administration, so in order to create the communities and sub-communities within DSpace, Site Administrator permissions were granted to one of our group. The collection structure, workflow assignments, and permissions regime had to be established during the short window in which Site Administrator permissions were granted.

The first step in the administration process is to create the community-collection hierarchy. DSpace allows a nested community structure where each community may contain both collections and sub-communities. The nomenclature used by DSpace immediately came into conflict with the nomenclature used in archival practice, where a collection is a hierarchy of series and sub-series. It seemed intuitive at first to create a community to identify the institution holding the collection (HRC), then create a master collection for the Joyce works and subordinate categories for the series that had been defined during appraisal. Unfortunately, DSpace does not support a hierarchy that is analogous to a series, not even nested collections. To remedy this problem, each series was assigned to a sub-community which would then hold collections for each sub-series. If a further level of sub-series were required, another sub-community would be created under the series community. Thus, the nomenclature used in DSpace to model the collection redefined series as communities and inverted the relationship between series/communities and collections. This subtle distinction is likely to create confusion as DSpace is implemented in archival endeavors.

Workflow steps had to be defined concurrently with the collection structure. DSpace has a workflow and permissions structure that allows individual users (E-people) and groups of E-people to be assigned to specific workflow steps. Up to three roles may be assigned, including accept or reject submission, edit metadata, and edit metadata with accept or reject ability. When deciding which workflow configuration to use, two possible alternatives were devised: one assumed that Michael Joyce or a designated representative would be allowed to submit items to the collections, and the other considered that HRC staff would submit all items to the collections (see Figure 3). The former workflow process requires more steps for review of submitted documents and metadata review owing to the notion that the submitter may not completely describe the submitted item. The latter workflow process assumes that a minimum of supervision is required when trained staff is responsible for submitting items to the repository. The latter case was implemented.

Figure 3: Proposed DSpace workflow.

In addition to workflow establishment, access rights must also be assigned during the establishment of the collection. DSpace allows the assignment of read and/or write permissions on each level from bitstream to collection to community. The SIP agreement and HRC policies dictated that access to the bitstreams in the collections be restricted only to HRC patrons on the HRC premises. Additionally, submission of items to the repository would be restricted to designated HRC staff. For purposes of creation, all bitstreams were restricted to the collection administrators, while the public was allowed to read the listings of collections and items in the repository. Once the collection is completed, a special E-person will be created with certificate access from a designated workstation in the HRC reading room that will be allowed read access to all materials in the Joyce collection.

One series out of the entire collection – Works – was chosen to comprise a pilot ingest to test the assumptions and decisions made up to this point. Upon completing each of the previous steps, a hierarchy with permissions and workflow procedures was in place for each of the works (DSpace collections) within the Works series (DSpace community). A final set of decisions had to be made before ingest about how to handle items within each collection. The number of individual documents within each work varied from one to hundreds, which may be further graded by the different versions and instances. DSpace defines Items as the container for bitstreams within collections, which may contain Bundles of bitstreams or a single bitstream. Metadata is defined at the Item level with limited metadata for each bitstream (e.g.: file identifier, file size or extent, checksum, etc.).

During the digital archeology and appraisal processes, metadata was collected for each bitstream that goes beyond that which DSpace captures (e.g.: creation date, modification date, etc.). This left us with a quandary about how to represent each semantic item within DSpace and still maintain the maximal amount of bitstream-level metadata. One solution to this problem is to create an item for each bitstream that would allow the maximal amount of metadata to be captured for each bitstream. Unfortunately, this approach would require a separate item for each file, which could number into the hundreds for each work, and would not group the bitstreams in the context of other bitstreams that were originally grouped (such as Web pages). Alternatively, an item could bundle all applicable bitstreams, to include migrated or converted versions, within the same DSpace item. Although this approach maintains maximum context, the accuracy of the metadata suffers.

The solution chosen for this implementation was to create a separate DSpace item for each semantic grouping (version, etc.) while keeping converted “use copies” together with the originals to allow users a choice of which version to view. Descriptions can be attached to a bitstream that can differentiate conversions from originals, while using a similar filename. Additionally, the item's provenance description field can be modified to record the applicable migration actions. In this way, the metadata that describes the semantic grouping remains intact while providing new versions for users in an appropriate context. Unfortunately, one problem remains in that each version of a work (DSpace item) within a collection has the same title, and DSpace does not present enough information at the collection level to differentiate between versions. It is poor practice to alter the title metadata to accommodate the differentiation of versions within the DSpace user interface – other metadata fields should be used to convey the distinction.

Observations on Working with DSpace

DSpace has a very thorough and robust data architecture, unfortunately, there are a number of implementation issues that arose during the course of the creation of the pilot collection that should be noted. The vast majority of problems that were encountered can be attributed to the user interface and intervening business logic. This is not a critique in the guise of user interface design, since the appearance and layout (look & fell) of a Web service is ultimately a question of style and organizational requirements. Rather, the problems encountered appear to be the result of incomplete implementation of the data model and inconsistencies in the user interfaces used in administering collections and submitting content.

For example, some of the most inconsistent interfaces were those involved in the management of workflow and permissions. Site Administrators may create groups of E-people to ease administration of permissions and workflow. A group may be assigned which can be changed at will in one place instead of having to change permissions at each access point. This is analogous to time-tested procedures in systems administration. Unfortunately, when creating a new community or collection, one is forced into a Web form for setting permissions that does not allow access to these pre-defined groups. One must set an arbitrary assignment, and then go back to the collection or community edit dialog to invoke a different Web form that will allow the assignment of an existing group. This creates another problem in the form of multiplicitous “default” groups created in the interim that clutter group selection dialogs across all applicable user interfaces. This is a clear example of a logical data design that is not properly implemented in the user interface.

Other critiques of and major problems that were encountered with DSpace are described below.

Macintosh file issues: Most of the discs processed for the Joyce collection were Macintosh formatted. First, DSpace uses MIME type identification based on Windows file extensions. Macintosh files do not use file type extensions in the filename, but use a creator code that is embedded within the file, which prevents DSpace from automatically recognizing the file type. Second, Macintosh files, particularly executable files, have more than one bitstream; they are split into a resource and data fork. When uploading such a file via a Web form, the data fork is the only part that is sent, therefore, when verifying a checksum, the data fork is all that is necessary to pre-compute.

For most files, the only effect this distinction has is to remove filesystem metadata from the bitstream. If a file type is not manually set during ingest, the user may have no idea what to use to open the file since the identifying metadata was stripped in the process. For executables, however, functionality may be lost altogether. Thus, it is best when transferring Macintosh files over non-Macintosh networks to use some sort of file packaging scheme such as MacBinary, HQX, or Tar.

These issues are not so much a problem with DSpace as much as they are a persistent problem encountered as the result of working with Macintosh generated files.

Metadata interaction: Two major issues were observed pertaining to metadata generation and workflow. First, for the majority of the items in our collection, the main author is a constant; therefore, it is preferred to not have to enter the author field for every submission. Unfortunately, the interface did not carry default values set at the collection level into item level metadata.

Second, when establishing workflow steps, we were under the assumption that the metadata editor would be permitted to edit more than the basic metadata presented to the submitter. Unfortunately, the same exact forms were presented to the metadata editor. This amounts to nothing more than a proofreading step and not a comprehensive metadata check as implied in the documentation. Detailed metadata editing can only be performed by collection administrators from the item editor.

Item importer as an automation tool: The item importer or bulk ingest utility was used for a number of items that contained many bitstreams. Any sub-series containing more than 10 bitstreams was selected for bulk ingest. The item importer saves a significant amount of time for ingest, but unfortunately, it can only be invoked on a per-collection basis, once at a time. In other words, if one wishes to import many items into more than one collection, separate commands must be issued for each collection and the file import configurations and metadata must be separated. The item importer would be much better as an automation tool if there were some way to map items to different collections.

Item importer issues: Numerous idiosyncrasies were encountered when invoking the item importer. First, a metadata file, expressed in Dublin Core, is required for each item to be ingested. The system crashed upon execution if any of the elements in the metadata file were empty (i.e.: no text value provided). Since empty elements in XML are completely valid, this sort of behavior should be anticipated by the XML parser to prevent such errors.

Second, the item importer requires a per-item content listing. This feature seems to be something of an evolutionary holdover given that the Java framework could easily recurse the item directory to get a file listing, ignoring the metadata file, rather than relying on accurate generation of a contents listing by the user. Furthermore, if an item should happen to contain a bitstream with the filename “contents” (as occurred once during this project), then a conflict occurs that requires either reversion to manual ingest, or a change to the original file to accommodate the system's architecture.

Finally, the item importer will crash if it encounters a file in the import directory. The system is expecting to recurse a directory and throws an error when the bitstream is encountered. Java should be able to differentiate a file from a directory and ignore it to prevent such errors.

Web page rendering issues: DSpace can render Web sites and pages within the user interface by making a special handling exception for recognized HTML files. The interface also makes exceptions to allow navigation and link embedding within such pages if they are part of the same item. Unfortunately, the bulk ingest process seems to confound this process. Although the items are properly ingested into the same item grouping with the same handle, each bitstream is assigned an index number within the item. Therefore, each bitstream is effectively blocked from seeing the others, thus rendering linking and images non-functional. The effective rendering of archived Web sites remains a major hurdle for DSpace.

Semantic Divergence: Other problems arise because of the semantic divergence between archival practice and the DSpace framework as described earlier. The treatment of licenses in the DSpace system illustrates this problem. When an item is submitted to DSpace, a generic, site-wide license is appended to the item in the form of a text file. This license is identical for all submissions to the system, regardless of the actual terms that may be defined externally to the system by a SIP or other agreement. Additionally, publication date is preferred over such things as date of creation. Such behavior is understandable if one takes the idea of a single organization running DSpace for a single purpose – particularly, that of a publishing-oriented institutional repository. The Joyce files alone have three different rights regimes governing the various series, which renders inappropriate the idea of a single, site-wide license structure. This issue, in addition to the collection implementation challenges addressed earlier, are examples of how the DSpace assumption of “one size fits all” is problematic for translating archival practices into DSpace implementation. The software may be re-programmed to accommodate these issues, but it is unclear how such customization would affect software upgrades and interoperability between DSpace instances, as is required for succession of control between repositories.

In summary, these are all important issues to consider for improving DSpace as an archival platform, but it should be noted that the scope of the project that the DSpace federation has undertaken is massive. The work performed thus far is exemplary and the criticisms put forth here should not be taken as an invalidation of the efforts to date.

Conclusion

As Kenneth Thibodeau (2002) stated “The preservation of digital objects involves a variety of challenges, including policy questions, institutional roles and relationships, legal issues, intellectual property rights and metadata” and during the project we certainly proved that all these issues were involved. Participating in this project we discovered digital preservation is fraught with technical difficulties and unexpected problems. We are fortunate that much of the technology that is currently becoming obsolete is still to a great degree available and that there are people that still have the skills and knowledge to use it. Such luxuries will not persist as hardware and software for legacy systems becomes scarce and the required knowledge fades. As demonstrated, even using emulation or other simulations will not preclude the necessity for adequate tools for digital archeology and appraisal. Archives today are woefully unprepared for the massive change that is about to envelop them as digital preservation needs increase. We hope projects like this will reveal to institutions across campus and across the world the need to take action and preserve our digital heritage now.

Notes

1 It is interesting to highlight that when we began the project, we noticed that many floppy labels said: “b/u”. At first we did not understand what it meant and we even read “blu” because sometimes the handwriting was not clear enough. As the project progressed, we realized that “b/u” and “blu” meant “backup”.

2 It was interesting though, that for example the word “contour” was widely used for different types of files. Contour was part of the name of a published work, but as this was also an important concept, as in “contours of consciousness”, for him. Joyce used “contour” as a title to identify files where he discussed topics related to this concept, but we were unable to distinguish where the file belonged in our arrangement. We finally determined all files relating to contour or contours were relative to our Series II. Subseries A. Scholarly Material, as they were published works from his academic career.

References

Consultative Committee for Space Data Systems (CCSDS, 2002). Reference model for an open archival information system (OAIS). Retrieved from http://www.ccsds.org/documents/650x0b1.pdf .

Gilliland-Swetland A. (2000). Setting the stage. In Introduction to Metadata: Pathways to Digital Information. Retrieved from http://www.getty.edu/gri/standard/intrometadata/2_articles/index.htm

International Council of Archives (1984). Dictionary of archival terminology. New York: ICA.

LeFurgy, W. (2002). Levels of service for digital repositories. D-LIb Magazine (May 2002). Retrieved from http://www.dlib.org/dlib/may02/lefurgy/05lefurgy.html.

MacKenzie S., et al. (2003). DSpace: An open source dynamic digital repository. DLib Magazine(January 2003). Retrieved from http://www.dlib.org/dlib/january03/smith/01smith.html .

Thibodeau, K. (2002). Overview of technological approaches to digital preservation and challenges in the coming years. In The State of Digital Preservation: An International Perspective. Washington, D.C.: CLIR. Retrieved from http://www.clir.org/pubs/reports/pub107/pub107.pdf.

US-InterPARES Project (2002). Findings on the preservation of authentic electronic records. Retrieved from http://www.gseis.ucla.edu/us-interpares/pdf/InterPARES1FinalReport.pdf.

Attachment	Size
joyce_app_a.xml	18.36 KB

Digital Preservation Plan for the Texas Legacy Project

tkiehne — Fri, 13 May 2005 07:25:49 +0000

This plan was commissioned during the Spring of 2005 on behalf of the Conservation History Association of Texas (CHAT), a non-profit entity based in Austin, Texas. CHAT desired a comprehensive plan to ensure the long-term preservation of hundreds of hours worth of digital video and audio comprising the association's collected works. The plan includes a needs assessment and inventory of the assets in place and a review of the literature concerning digital media, storage hardware, software formats, and digital repositories.

Attachment	Size
CHAT-plan-complete.pdf	533.15 KB

An OAIS Ingest Metadata Specification

tkiehne — Tue, 14 Dec 2004 06:40:48 +0000

Problem Definition

For this exercise, we will prepare a digital object for submission to a digital archive for long term preservation. The digital object in question is an HTML text with an in-line image and links to several other HTML texts. The objects must be readable, but the specific look and feel of the rendered text is not important. We are to generate a metadata set that will conform to ingest requirements for an Open Archival Information System (OAIS) according to a Submission Information Package (SIP) agreement. Additionally, we will consider the process for converting a SIP into an Archival Information Package (AIP) and extend the metadata set with additional elements for the conversion process, as needed.

OAIS Ingest and SIP

The OAIS model outlines a system-agnostic framework to ensure reliable, long-term preservation of information. The process begins with the submission of an information package from a producer of information (publisher, author, researcher, etc.) to the archive. The information is submitted as one or more SIP objects that conform to the archive's SIP agreement. For a digital archive, the information is sent electronically along with descriptive metadata. The components of an OAIS information package are shown in Figure 1 (CCSDS, 2002, §4).

Figure 1: OAIS information package (CCSDS, 2002, p. 4-31)

As shown above, the information package is comprised of Content Information (CI) and Preservation Description Information (PDI). In the current exercise, the content information consists of the HTML code and image file, or pointers to these resources, and the representation information necessary to decode the content of the digital objects. The Packaging Information binds the CI and PDI by some means, including universal identifiers, encoding (such as used for a CD-ROM), or some other system-specific method.

Descriptive information is derived from the PDI or declared prior to submission. PDI contains the metadata needed for conversion into an AIP and further storage within and access from the repository. Four functional areas are defined in the PDI as summarized in Table 1

Provenance	Content	Reference	Fixity
Source Description and pointer to original object(s) Copyright/legal restrictions Access restrictions Authority to modify representation information or migrate Agreements with external organizations History Change history Pointers to other versions Custody since origination	Relationship to other objects Description Pointers to other objects Purpose/Reason for creation Reason for creation Reason for archiving Encoding environment Software Languages Character set(s)	Unique identifier(s) URI, ID Number, etc. Bibliographic description Creator(s), organization(s) Date of creation Title(s) Etc.	Authenticity indicators Checksum, CRC, MD5 hash Digital signatures Encryption Quality of service requirements Specification of integrity preserving mechanisms Error protection specifications

Provenance

Content

Reference

Fixity

Source

Description and pointer to original object(s)
Copyright/legal restrictions
Access restrictions
Authority to modify representation information or migrate
Agreements with external organizations

History

Change history
Pointers to other versions
Custody since origination

Relationship to other objects

Description
Pointers to other objects

Purpose/Reason for creation

Reason for creation
Reason for archiving

Encoding environment

Software
Languages
Character set(s)

Unique identifier(s)

URI, ID Number, etc.

Bibliographic description

Creator(s), organization(s)
Date of creation
Title(s)
Etc.

Authenticity indicators

Checksum, CRC, MD5 hash
Digital signatures
Encryption

Quality of service requirements

Specification of integrity preserving mechanisms
Error protection specifications

Table 1: Elements of PDI (CCSDS, 2002, pp. 4-27 – 4-29)

A Model SIP

When approaching the problem of metadata creation, two basic approaches may be used. First, we may generate a new metadata set that conforms to the archives requirements. Generating a new set allows the specific project requirements to be addressed in detail, but requires a significant amount of labor to produce the DTD or schema and the tools to use them. Furthermore, a new metadata set does not leverage existing standards and practices for interchange. The second, and preferred, approach is to use or extend an existing metadata specification.

A review of existing metadata specifications reveals that the Metadata Encoding and Transmission Standard (METS) provides a framework for many of the necessary elements for a SIP (Library of Congress, 2004). The first two columns of Table 2 show a basic mapping of METS container elements to OAIS ingest information. Since METS does not provide detailed descriptive metadata elements, other metadata schemes must be used to complete the SIP. Source metadata schemes and sets for detailed metadata are shown in the third column of Table 2. External metadata sets can be linked from the METS container using <mdRef>, or embedded within using one of the metadata types allowed in <mdWrap>.

OAIS Ingest Information	METS Element Location(s)	Source for Specific Metadata
Preservation Description Information	all elements nested under <mets> root
Provenance
Description of original object(s)	<amdSec> <sourceMD>	Metadata Object Description Schema: <relatedItem type=”original”>
Rights management information	<amdSec> <rightsMD>	Rights Declaration Extension Schema: <RightsDeclarationMD>
Access restrictions	<amdSec> <rightsMD>	Rights Declaration Extension Schema: <RightsDeclarationMD> <Context>
Agreements with creator(s)	<amdSec> <rightsMD>	Rights Declaration Extension Schema: <RightsDeclarationMD> <Context>
Agreements with external organizations	<amdSec> <rightsMD>	Rights Declaration Extension Schema: <RightsDeclarationMD> <Context>
Change history	<amdSec> <digiprovMD>	Metadata Object Description Schema: <originInfo> <dateModified>
Pointers to other version(s)	<amdSec> <digiprovMD>	Metadata Object Description Schema: <relatedItem type=”otherVersion”>
Custody since origination	<amdSec> <digiprovMD>	Metadata Object Description Schema: <note type=”bibliographic history”>
Content
Relationship to related objects	<amdSec> <digiprovMD>	Metadata Object Description Schema: <relatedItem type=”XXX”>
Pointers to related objects	<amdSec> <digiprovMD>	Metadata Object Description Schema: <relatedItem xlink=”XXX”> <location>
Reason for creation	<amdSec> <digiprovMD>	Metadata Object Description Schema: <note type=”bibliographic history”>
Reason for archiving	<amdSec> <digiprovMD>	Metadata Object Description Schema: <note type=”conservation history”>
Original encoding/technical environment	<amdSec> <techMD>	Schema for Technical Metadata for Text: <byte_order>, <charset>, <encoding>, <markup_language>
Reference
Unique identifier(s)	<dmdSec>	Metadata Object Description Schema: <identifier>
Bibliographic description	<dmdSec>	Metadata Object Description Schema: <titleInfo>, <name>, <subject>, <language>, <typeOfResource>, <genre>, <abstract>, <originInfo>, <physicalDescription>
Fixity
Authenticity indicators	<fileSec> <fileGrp> <file CHECKSUM=”XXX”>	N/A
Quality of service requirements	<amdSec> <techMD>	Schema for Technical Metadata for Text: <viewingRequirements>
Content Information	all elements nested under <mets> root
Data object(s)	<fileSec> <fileGrp> <file> <FLocat> <FContent> <binData> \| <xmlData>	N/A
Packaging Information	all elements nested under <mets> root
Relationships between data objects	<structMap> <div> <mptr> \| <fptr> <structLink>	N/A

Table 2: Basic OAIS ingest and METS container mapping (derived from: Tingle, 2004; Library of Congress, 2003; Library of Congress, 2004b)

Augmenting the SIP to Facilitate Conversion to an AIP

Figure 2 illustrates the complete OAIS ingest process. Once all of the component SIPs have been received and checked for conformance to the archive's ingest specifications, the SIPs are enhanced and reformatted according to the archive's technical standards, as necessary. Workflow procedures may require an administrative audit of the submitted information package. Once these processes are complete, the AIP is created, descriptive metadata and “browse products” are extracted, then the AIP is sent to data management.

Figure 2: OAIS ingest process (CCSDS, 2002, p. 4-5)

In the process of converting SIPs to an AIP, new metadata is created. To ease the conversion process, the SIP format should include element definitions for these items of information. Additional metadata for the SIP to AIP conversion are shown in Table 3 along with the METS container location and prescribed metadata element source.

Additional Information	METS Element Location(s)	Source for Specific Metadata
Ingest history	<amdSec> <digiprovMD>	Metadata Object Description Schema: <note type=”acquisition”> <note type=”conservation history”>
Quality assurance (QA) results	<amdSec> <techMD>	Schema for Technical Metadata for Text <processingNote>
History of formatting and encoding changes made during conversion process	<amdSec> <digiprovMD>	Metadata Object Description Schema: <note type=”conservation history”>
Administrative audit reports	<amdSec> <digiprovMD>	Metadata Object Description Schema: <note type=”admin”>
Additional descriptive or classification data	<dmdSec>	Metadata Object Description Schema: <classification>, <subject>
Relationship to archive objects or collections	<amdSec> <digiprovMD>	Metadata Object Description Schema: <relatedItem>
Internal identifier(s)	<dmdSec>	Metadata Object Description Schema: <identifier>

Table 3: Additional metadata required for SIP to AIP conversion (derived from: Tingle, 2004; Library of Congress, 2003; Library of Congress, 2004b).

Example SIP Profile

The appendix contains a METS Profile Schema document describing the requirements enumerated in Tables 2 and 3. At the end of the profile is an <Appendix> element containing an example ingest package encoded according to the profile specification.

References

Consultative Committee for Space Data Systems (CCSDS) (2002). Reference model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1 BLUE BOOK. Retrieved on 28 November, 2004, from http://ssdoo.gsfc.nasa.gov/nost/wwwclassic/documents/pdf/CCSDS-650.0-B-1.pdf.

Library of Congress (2003). METS news and announcements: Draft rights declaration schema is ready for review. Retrieved on 11 December, 2004, from http://www.loc.gov/standards/mets/news080503.html.

Library of Congress (2004). METS: An overview & tutorial. Retrieved on 7 December, 2004, from http://www.loc.gov/standards/mets/METSOverview.v2.html.

Library of Congress (2004). Metadata Object Description Schema (MODS). Retrieved on 11 December, 2004, from http://www.loc.gov/standards/mods.

McDonough, J. (2003). Schema for Technical Metadata for Text. Retrieved on 11 December, 2004, from http://dlib.nyu.edu/METS/textmd.html.

Tingle, B. (2004). METS 1.3 schema documentation. Retrieved on 9 December, 2004, from http://ark.cdlib.org/mets/schema_documentation.

Attachment	Size
mets_sip_example.xml	18.98 KB

Technologies of Access and the Cultural Record

tkiehne — Thu, 02 Dec 2004 05:44:57 +0000

"Celestial Jukebox" or Digital Dark Age?

A Question of Information Access

Technologies of access redefine the social and cultural aspects of information access. Areas directly affected by this shift include fair use of copyrighted works and the balance of control over statutory rights. Considered over the duration of copyright, the long-term effects of new access regimes could be more extreme. Assuming that technological controls prevail over the public interest in information access, several questions must be asked: Can public access be preserved as information becomes predominantly digital? If not, does our society face a scenario where knowledge and our collective cultural record will be preserved only to the extent that it is profitable?

Although we can safely assume that the printed book and other physical forms of information are not likely to disappear from our libraries, new ways of retrieving information via digital media will have a significant effect on access. Digital information objects such as e-books and online information services may be controlled in ways that are not practical for their analog counterparts. The premise of library shelves full of locked books or journals that suddenly vanish after a few readers have viewed its pages seems incredible. Yet with digital objects, protected by access restriction technologies, such occurrences are not so unlikely.

Libraries, Now and in the Future

Libraries cater to all sorts of clientèle, including children, adults, and scholars. Users rely on libraries to provide access to information for many purposes, whether for research or academics, or for personal improvement and fulfillment. No single library contains all possible works within its walls. Libraries are networked in such a way, however, that if specific materials are desired, and are physically retrievable, then access to those materials may be obtained through alternate means. First sale doctrine and interlibrary loan comprise the traditional services that a library provides to the public as explicitly permitted under copyright law (17 USC §108 & 109). Because of this, access to a library's holdings is not revoked if a book goes out of print for whatever reason or if copyright terms are extended. Once a library has an item, access to that item is only affected by physical factors such as distance, the condition of the objects, and the funds that a library or its users have available to facilitate access.

Physical limitations and changes in information seeking behavior encourage libraries to implement digital information services (Bertot, 2003; Moyo, 2004). Subscriptions to digital content aggregators and publishers increase the number of works that a library can make available without having to increase its physical capacity. Furthermore, digital information services shift the focus of information seeking from the container to the content within which, arguably, can be seen as serving the needs of users accustomed to finding information on the Internet. Whether these digital services reduce the procurement costs expended by a library over time is not fully known (Bertot, 2003, p. 222), but the benefits to the users are usually enough to justify their implementation.

Current digital services are subscription based (Bertot, 2003, p. 222), and are offered by publishers such as Reed Elsevier (www.reedelsevier.com) and online content aggregators such as netLibrary (www.netlibrary.com). As these services continue to mature, the aforementioned benefits will be accompanied by significant disadvantages. Licenses that govern digital services will be enforced by technologies that shift control over access from libraries to the entities that provide the services. Such a shift affects uses that are traditionally allowed by copyright law. To understand the implications of this shift we must understand the technologies of access and the laws that affect them.

Technologies of Access

Copying is an intrinsic property of digital information. When users view a text via a digital information service, they view a copy that was ostensibly derived from an authoritative original maintained by the service provider. Information services currently exercise very little control over what happens with the copy that is provided beyond informing users of the terms of contract and protecting the copy with basic access controls. Once a copy of a text is made, the copy can be removed from its licensing environment and thus from its contractual restrictions – the license only restricts the original user. It is this loss of control over the copies that compels content providers to pursue technological means of contractual enforcement.

Digital access controls are more broadly known as Digital Rights Management (DRM). DRM involves converting digital objects to an encrypted form that is configured to allow access only under certain conditions. Access policies may be established at very granular levels for a variety of tasks, such as read-only access or the ability to copy (Erikson, 2003, pp. 35-36). For example, if a library were to subscribe to a publication in digital form, the governing contract could specify that only a certain number of users at a time may view the publication, which would be monitored by software designed to access DRM-aware objects. Should the maximum number of concurrent users be reached, the system's licensing policy might allow additional access for an extra fee (Stefik, 1997, Section D).

As it stands now, such granular control over a digital object is not possible without imposing excessive costs on the participating agency. For DRM to perform as described requires more than object-level control – it is necessary that the systems that access protected content respect these controls. These so-called “trusted systems” include hardware and software that are certified to comply with DRM controls (Stefik, 1997; Erikson, 2003). It is conceivable that trusted systems could restrict transfers between software and devices, such as denying the ability to cut-and-paste text from a controlled work. Additionally, “watermarking” could be implemented to prevent capture of audio or video by devices external to the trusted system.

From the publisher's perspective, DRM is an ideal technology for controlling the use of digital objects. As a means of modeling the social expectations of copyright, however, DRM's binary architecture is not so ideal. Copyright is a deliberately “leaky” system that contains many, often loosely defined, exceptions to certain enumerated rights. Content provided under a DRM-controlled contract can readily overstep the boundaries of copyright law (Cohen, 1998, p. 472; Samuelson, 2003, p. 48). For instance, a system that is programmed to prevent copying will not know how to differentiate a fair use copy from an illegal copy (Felten, 2003, p. 58). One might say that if a copy is allowed by law, then any means by which it can be made should be allowed. Unfortunately, it is not that simple. Once control is removed for a legal use, how will unauthorized uses of that copy be prevented? A fundamental conflict arises between arbitrary copyright exceptions and rigid access controls.

The Digital Millennium Copyright Act (DMCA, PL 105-304) was designed to update copyright law in anticipation of technological changes. Some of the most prominent portions of the DMCA criminalize the circumvention of access controls and the development and distribution of tools that can do the same (17 USC §1201). The statute simultaneously states that nothing in the circumvention prohibitions affects the rights of fair use or any of the exceptions granted in the Copyright Act (17 USC §1201(c)). Assuming that technological controls mature and computing environments become complicit in enforcing these controls, these assurances are rendered virtually useless (Burk & Cohen, 2001, p. 54). The conflict between the code of DRM and the code of law is embodied in section 1201. As a result, copyright is defined by contract, enforced by code, and leaves no legal recourse to do what would otherwise be legal.

Failures of Access Control

Let us assume that the current trends in rights management technologies and the laws that affect them continue unabated into the future. We can envision a time when DRM and trusted systems lock down digital information in all its forms, including the subscription services used by libraries to increase inventory and better serve their users. Every conceivable action, including reading, copying, and printing, can now be audited by the service provider and billed incrementally to the library or passed through to the user, as defined by contract. Setting aside privacy concerns for a moment (see writings by Julie Cohen), we can already see how fair use is revoked in this environment (Erikson, 2003; Felten, 2003; Samuelson, 2003). But what other effects will such control have?

Should the majority of a library's digital offerings be provided in the form of service subscriptions, collection management decisions are delegated to the service providers. Decisions by the service provider that affect the type, quantity, and character of their offerings will directly determine what is available to the library's users. One might query that if all possible works are made available, much like Goldstein's “celestial jukebox” (2003), would collection management become irrelevant? Perhaps much further in the future, when digital storage is essentially free and the difficulties with preserving the reliability of and access to such an enormous volume of data are resolved. Until then, there are several points of failure that may reduce or eliminate access to digital works. These failures may be characterized as technical, economic, and social.

Technical failure is already at issue in current subscription services. If a library has a subscription to a periodical, and later cancels the subscription, patrons may still use the copies that were received before cancellation. Under the digital model, however, cancellation of a subscription may leave the library without access to any of the periodicals (Moyo, 2004, p. 229). Technical failures of this sort result from the characteristic differences between digital and print media.

Additional failures of access are characterized by economic instabilities. Assuming that a service provider must prioritize its holdings because of technical limitations, the relative value of the works will be influential. Only a small percentage of works has an economically viable life approaching the current term of copyright (Rappaport, 1998, p. 4). Likewise, it can be expected that many works beyond a certain age will fail to be of interest or use for other than historical purposes, especially in the case of scientific works or news. A service provider may audit usage data for their inventory and determine that certain works no longer meet the interest criteria to justify the expense of maintaining them. These works may simply be removed from the service (and, presumably, archived), or perhaps exchanged with other content aggregators. Unless a devalued work finds its way to another subscription service of the library, it will be inaccessible to the patrons. Permanent losses due to economic competition is unfortunate since recent findings reveal that usage trends become difficult to predict, deviating from profitability as selection increases (Anderson, 2004).

Similarly, none but the most well-established publishers are likely to operate indefinitely. Publishers are bought and sold or otherwise succumb to economic changes. One hopes that the works controlled by a failing publisher would be transferred or otherwise preserved in some way. Since the decision is one of market value and not of value to the public, however, preservation of the works is not assured (Kuny, 1998). In recent years, digital collections have nearly vanished as a result of corporate volatility. For example, the music archives of MP3.com were nearly lost when the company was sold to CNET Networks in 2003 (Bialik, 2003). USENET news archives dating back to 1981 narrowly escaped disappearance when Google bought them in 2001 (Google, 2001). These cases illustrate how collections of digital information are susceptible to commodification.

Another economic threat to access results from the consolidation of service providers. Scholarly journals are currently undergoing a transition, due in part to the fact that fewer companies hold more of the assets while charging increasing rates for access (Ganshorn, 2002, p. 1, 3). If such a trend manifests in other digital content services, libraries with smaller budgets could find themselves unable to gain access to some or all of the available holdings, thus perpetuating the digital divide.

A social failure of access is characterized by First Amendment concerns. Removing control of collections from a local agency to a centralized provider exposes the possibility that external pressures could force the removal of politically inexpedient works. The normalization of community standards could potentially affect all subscribers to the service. At the least, a service provider would be compelled to deny access to certain works in certain locales to satisfy complaints. If service providers tend to be risk-averse, such localized measures would circumnavigate the traditional barriers to censorship that community libraries currently employ.

Taken together, these failures result in the denial of access to works that library patrons seek. These losses may not be absolute; the market may provide remedies for some of these problems, or patrons could seek alternative facilities to find the information they desire. Furthermore, these failures are not completely foreign to traditional libraries, but the effects are more acute in the case of digital information. Unfortunately, the problem indicated by these points of failure is larger than that of mere convenience (Kuny, 1998). Libraries, from the Library of Congress down to the smallest local library, contain a vast amount of printed material that captures a significant portion of our cultural heritage in literature, music, and scholarly works. The library system, taken in its entirety, represents a massively redundant, fault-tolerant system for preserving the cultural record. Reexamining the digital situation just described, no such system exists for securing digital works beyond that of securing intellectual property. Commercial services have suddenly found themselves having to address issues previously relegated to public archives (Rosenzweig, 2003, p. 752). We may face a time when the only digital works that survive the coming decades are those that are the most profitable or popular.

Towards Preserving the Digital Cultural Record

The problem of long-term digital preservation is significant, even without considering the tension between access and rights. Research in digital archives addresses many of the key problems for ensuring the reliability of digital information across standards and exchanges. Digital archivists are wary of encryption for digital objects, often avoiding the problem altogether by not accepting encrypted objects into their repositories (Waugh, et al., 2000, p. 181). Such policies will not suffice for public information agencies that provide DRM protected objects. Because of the complexities, long-term digital preservation must involve all parties, public and private, in a coordinated effort to ensure that the balance of public access and private compensation enshrined in copyright is maintained for digital information.

The complexities of the problem and the relatively recent ascension of the digital preservation field discourage the formulation of generalized solutions. We may look to policy to provide the impetus for action. For printed information, the Copyright Act contains provisions that allow libraries and archives to make copies of works for preservation purposes (17 USC §108). The intent of Congress in this case is clear, if only for a relatively narrow definition of preservation. No such intent for digital information is implied in the statute. In fact, the anti-circumvention prohibitions of section 1201 of the Copyright Act seem to remove such concerns from the public interest entirely. If the public interest in preserving digital information is to be served this dichotomy must be resolved. Provisions for digital archiving that take the restrictions of DRM into account and allow libraries to act before the format becomes obsolete may provide a solution.

Alternately, content providers could be held accountable for ensuring the long term reliability of their information. If information can be seen as an asset worthy of copyright protection, then compulsory measures for information preservation are reasonable. A public digital deposit system using trusted third parties could assist these efforts. Congress enacted the National Digital Information Infrastructure and Preservation Program (NDIIPP) in 2000 to begin planning for long-term digital preservation (Friedlander, 2002). Unfortunately, such support from the Federal government is rare and the effect of this legislation has yet to be observed(Rosenzweig, 2003, pp. 752-754). In the private sector, Elsevier Science, a division of Reed Elsevier, is currently involved in trusted repository agreements with Yale University and the National Library of the Netherlands for preservation of electronic journals (Ayre, 2004, §5). Actions such as these constitute the beginnings of preservation policy.

If self-archiving and digital deposit fail to materialize or to provide adequate solutions, then the basis of current copyright statutes may need to be reexamined. Much of the reasoning behind recent copyright laws is the assumption that digital media eliminates content creator's income because of massive, near-perfect distribution. Should DRM become the norm, and remuneration be extracted at the most granular level, then content providers stand to make greater profits than ever. It follows in this scenario that the exclusive rights could be exchanged for greater control over content. A shorter term of copyright could reduce the impacts of format obsolescence and market instability by allowing the content to enter the public domain and the relative safety of unrestrained distribution.

Conclusion

Preserving the cultural record as it becomes digital is a significant challenge. Technologies of access and the transfer of control over information access from public to private interests increase the risk of information loss. The volatility of digital information should compel us to act in a decisive way, in both the public and private interest. Failure to do so will create gaps in our cultural record as digital objects become permanently inaccessible or lost completely.

References

Anderson, C. (2004). The long tail. Wired, 12(10), 170-177.

Ayre, C. & Muir, A. (2004). The right to preserve: The rights issues of digital preservation. D-Lib Magazine, 10(3). Retrieved on 15 November, 2004, from http://www.dlib.org/dlib/march04/ayre/03ayre.html.

Bertot, J. (2003). Internet-based library services. Library Trends, 52(2), 209-227.

Bialik, C. (Nov. 14, 2003). CNET to buy MP3.com assets from Vivendi's U.S. net unit. Wall Street Journal Online. Retrieved on 8 November, 2004, from http://online.wsj.com/article/0,,SB106882967943658100,00.html.

Burk, D. & Cohen, J. (2001). Fair use infrastructure for rights management systems. Harvard Journal of Law & Technology, 15(1), 42-83.

Cohen, J. (1998). Lochner in cyberspace: The new economic orthodoxy of "rights management.” Michigan Law Review, 97(2), 462-574.

Erikson, J. (2003). Fair use, DRM, and trusted computing. Communications of the ACM, 46(4), 34-39.

Felten, E. (2003). A skeptical view of DRM and fair use. Communications of the ACM, 46(4), 57-59.

Friedlander, A. (2002). The National Digital Information Infrastructure Preservation Program: Expectations, realities, choices and progress to date. D-Lib Magazine, 8(4). Retrieved on 30 November, 2004, from http://www.dlib.org/dlib/april02/friedlander/04friedlander.html.

Ganshorn, H. (2002). Workshop on alternative publishing: Summary report. University of Calgary: Calgary, Canada. Retrieved on 6 November, 2004, from http://www.ucalgary.ca/library/plans/altpub/altpub.doc.

Goldstein, P. (2003). Copyright's highway: from Gutenberg to the celestial jukebox. Chap 1. New York: Hill and Wang.

Google (2001). Google acquires Usenet discussion service and significant assets from Deja.com. Press release. Retrieved on 8 November, 2004, from http://groups.google.com/press/pressrel/pressrelease48.html.

Kuny, T. (1998, May). The digital dark ages: Challenges in the preservation of electronic information. International Preservation News, 17. Retrieved on 29 October, 2004, from http://www.ifla.org/VI/4/news/17-98.htm#2.

Moyo, L. (2004). Electronic libraries and the emergence of new service paradigms. The Electronic Library, 22(3), 220-230.

Rappaport, Edward (1998). Copyright term extension: Estimating the economic values (CRS 98-144 E). Washington D.C.: Congressional Research Service. Retrieved on 30 October 2003 from http://www.ipmall.info/hosted_resources/CRS_Index_1998.asp.

Rosenzweig, R. (2003). Scarcity or abundance? Preserving the past in a digital era. American Historical Review, 108(3), 735-762.

Samuelson, P. (2003). DRM {and, or, vs.} the law. Communications of the ACM, 46(4), 41-45.

Stefik, M. (1997). Shifting the possible: How trusted systems and digital property rights challenge us to rethink digital publishing. Berkeley Technology Law Journal, 12(1). Retrieved on 30 October, 2004, from http://www.law.berkeley.edu/journals/btlj/articles/vol12/Stefik/html/reader.html.

Waugh, A., Wilkinson, R., Hills, B. & Dell’oro, J. (2000). Preserving digital information forever. Proceedings of the fifth ACM conference on digital libraries, San Antonio, Texas, 175-184.

TEI Lite History and Evaluation

tkiehne — Tue, 30 Nov 2004 05:57:09 +0000

New and disparate ways of digitally encoding texts were developed as computing became available to scholars of the humanities in the 1980s. The encoding of textual objects into a digital form creates opportunities for examining old and rare texts simultaneously and without the risk of wear or damage to the original object. Additionally, an encoded object permits new ways of interacting with the text, such as concurrent views of different versions and viewing subsequent editorial or annotations. The lack of standard methods for encoding and describing texts made it difficult for researchers to exchange objects and diminished the benefits that the digital format offers.

The Text Encoding Initiative (TEI) was conceived in this disjointed digitization environment. TEI is a successful and influential metadata encoding standard that is primarily concerned with the encoding of textual objects, but is flexible enough to apply to many other types of information objects. The standard is customizable and extensible. One such customization is TEI Lite, a subset of the TEI specification. In this essay we will examine the development and history of TEI Lite as well as the role it plays in documenting the lifecycle of digital objects. TEI Lite's relationship to other metadata initiatives will also be explored. Finally, an evaluation will be made of how well TEI achieves its purpose and some of the problems the specification faces.

History of TEI Lite

TEI Lite shares its formative history with its superset, TEI. Work on TEI formally began in 1987 with the meeting of a group of 32 scholars from North America, Europe, and Asia held at Vassar College in Poughkeepsie, NY. The initial meeting was convened by the Association for Computers in the Humanities and funded by the National Endowment for the Humanities with the purpose of beginning work on the problems facing digital text encoding (Mylonas & Renear, 1999, pp. 3-4).

At the close of the conference, the group issued a closing statement to provide direction for the development of guidelines. The statement, known as the “Poughkeepsie Principles,” directed that the forthcoming guidelines should (Burnard & Sperberg-McQueen, 2002, p. 1):

suffice to represent the textual features needed for research;
be simple, clear, and concrete;
be easy for researchers to use without special-purpose software;
allow the rigorous definition and efficient processing of texts;
provide for user-defined extensions;
conform to existing and emergent standards.

After the meeting in 1987, three organizations participated in forming the guidelines: the Association for Computers in the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics (Mylonas & Renear, 1999, p. 3). Draft versions of the TEI Header and Guidelines were completed and distributed in 1990 (MIT Libraries, 2004). After several years of refinement, the final draft (version P3) was released in 1994. “Guidelines for the encoding and interchange of Machine-Readable Texts” spanned 1300 pages and defined over 600 elements of Standardized General Markup Language (SGML). The TEI specifications defined an extensible set of elements that could be customized by user communities for their specific needs. One of these customizations is TEI Lite, which defines a subset of TEI meant to serve as a “starter set” of core elements to assist in learning the extensive TEI set (Burnard, 2000).

The P3 guidelines underwent several minor revisions between 1994 and 2001, mostly to clarify varying interpretations and practices (Burnard & Popham, 1999, p. 39). During this time, however, the success of TEI as a metadata specification informed and influenced the development of the eXtensible Markup Language (XML) (DeRose, 1999). TEI was subsequently converted to XML and released as version P4 in Summer 2001 (MIT Libraries, 2004). Development of TEI Lite continues in parallel with TEI, the next revision of which (P5) is expected to be released at the end of 2004 (TEI Consortium, 2003, How to participate - Next version).

Formal structures were developed to guide future development as participation increased. An Executive Committee formed in the mid-nineties that included representatives from each of the three sponsoring associations and two influential researchers, Michael Sperberg-McQueen (University of Illinois at Chicago) and Lou Burnard (Oxford University). By 1996, a Technical Review Committee was established to conduct the development and maintenance of the guidelines in a manner similar to the International Standards Organization (ISO) (Burnard & Light, 1996, pp. 25-26).

In 1999, the Executive Committee was petitioned to create an international membership organization that could better handle the TEI's increasing administration and development responsibilities. The petition resulted in the formation of a non-profit corporation (Burnard, 2000). Membership in the consortium includes dozens of agencies from the humanities, education, computing, linguistics, and librarianship. Members elect a technical council that oversees development of the guidelines and funding for the organization. The consortium's first Council was elected in 2001 and met for the first time in 2002. Members may also participate in the various special interest groups or workforces that develop the guidelines (TEI Consortium, 2004, How to participate). The Consortium relies on its members to expand TEI's user base and has chartered a special interest group for training to support their efforts (TEI Consortium, 2004, How to participate – Special interest groups).

TEI is hosted by four universities and is sponsored by the three associations originally responsible for initial development of the guidelines. Significant support is provided by the U.S. National Endowment for the Humanities (NEH), Directorate XIII of the Commission of the European Communities (CEC/DG-XIII), the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada (TEI Consortium, 2004).

The Functional Role and Structure of TEI Lite

TEI and TEI Lite intend to define a framework for the encoding of texts that facilitates the interchange of digital objects. The specification defines a common and extensible language that different software platforms can understand and use to render the digital object in consistent ways. Although the development of TEI has focused on the encoding of texts, particularly capturing non-digital texts, the framework is applicable to the description of non-text objects such as images and sound (Burnard & Sperberg-McQueen, 2002, p. 1).

The description and interchange goals of TEI implies fidelity to the structure and content of the object being encoded. As such, much of the focus of TEI is on the structural description of textual objects, while the TEI header supports most of the lifecycle metadata functions (see Table 1). Elements of the TEI header provide creation, appraisal and descriptive metadata and, to a lesser extent, transfer/authenticity and preservation metadata. Accession and usage metadata are much less apparent, but may be augmented by the information system that stores the digital object. Rights metadata is simply represented in regard to the original object. In fact, the distinction between metadata about the digital encoding and metadata about the original object is difficult to discern from the element definitions and likely results in differing practices. In general, the file description (fileDesc) describes attributes of the original text while the encoding description (encodingDesc) concentrates on aspects of the digital implementation

Creation	Appraisal	Transfer/Authenticity	Accession
<fileDesc> <titleStmt>(all) <profileDesc> <creation> <revisionDesc>	<fileDesc> <editionStmt>(all) <seriesStmt> <sourceDesc> <encodingDesc> <projectDesc> <samplingDecl> <editorialDecl>	<fileDesc> <extent> <publicationStmt> <publisher> <distributor> <authority>	<fileDesc> <publicationStmt> <authority> <notesStmt> (also defined by containing system)
Descriptive	Preservation	Usage	Rights
<fileDesc> <titleStmt> <title> <editionStmt> <edition> <seriesStmt> <sourceDesc> <profileDesc> <textClass>	<fileDesc> <extent> <encodingDesc> <tagsDecl>(all) <refsDecl> <profileDesc> <langUsage>	<revisionDesc> (also defined by containing system)	<fileDesc> <publicationStmt> <publisher> <availability> <distributor> <availability> <authority> <availability>

Table 1: Metadata life-cycle roles of TEI header elements. (derived from Burnard & Sperberg-McQueen, 2002)

A complete TEI Lite document contains a header and text body (see Figure 1). The header, as indicated above, contains metadata related to the digital object and the original information object. The header is separable from the encoded body, which allows it to serve as a description for non-text objects stored separately from the header. The TEI header is analogous to the title page of a text. It has up to four parts: a description of the electronic file (fileDesc), an encoding description (encodingDesc), a non-bibliographic description of the text (profileDesc), and a revision history (revisionDesc)(Burnard & Sperberg-McQueen, 2002, p. 6). Of these, only the file description is required, the elements of which can be related directly to MAchine Readable Cataloging (MARC) fields. Unlike MARC, elements of the TEI header are not required to conform to a controlled vocabulary such as described by the Anglo American Cataloging Rules (AACR), although such rules may be applied at the encoder's discretion (Pouchard, 1998).

Figure 1: Structure of a TEI Lite document (HTML Writers Guild, 2001)

For textual objects, structural encoding is defined within the text element. A TEI text may contain a single, unitary work, or a group of works as realized in a series or anthology. For the latter case, the text element may contain an arbitrary number of group elements, each containing a text body with optional front and back matter. Additionally, multiple TEI objects may be grouped as a corpus, analogous to a collection of texts (Burnard & Sperberg-McQueen, 2002, p. 7).

The range of elements and the relatively relaxed markup rules allow for varying granularity depending on the intended usage. The body of an encoded text is structured by p and div elements, similar to those in HyperText Markup Language (HTML), that represent chapters, sections, and subsections of a text. Text within these structures may be further encoded using a myriad of markup that indicate layout and appearance. Furthermore, elements are available for defining alternate appearances or versions of text and editorial markup or annotations as applied to the original object. Additionally, elements such as unclear allow for the indication of unintelligible or damaged areas of text (Burnard & Sperberg-McQueen, 2002). Such elements enhance textual analysis by allowing the encoding of multiple version of a text within the same electronic file.

Relationship to Other Metadata Initiatives

TEI was one of the first metadata initiatives, predated only by MARC and the International Standard Bibliographic Description (ISBD), AACR, and SGML standards (Burnard & Light, 1996). The TEI header and the descriptive fields of later versions of MARC closely resemble the functional structure of the ISBD. Despite the structural similarities with MARC and ISBD, however, TEI does not require the use of controlled vocabulary and as such does not readily convert to either standard. Early TEI development eschewed strict cataloging requirements in the expectation that non-catalogers would use the specification. The decision to conform to standard cataloging practices is left to the creating agency (SCHEMAS Registry, 2002). Such a flexible approach favors ease of use over uniformity in order to facilitate a wider adoption of the standard (MIT Libraries, 2004) - an approach that Dublin Core has also uses.

Metadata initiatives developed subsequent to TEI have benefited from TEI's success and derive structures from TEI Lite. Encoded Archival Description (EAD) borrowed TEI's header concept (Burnard & Light, 1996, p. 13). Other metadata initiatives are domain specific applications of TEI. The Consortium for the Computer Interchange of Museum Information (CIMI) uses the TEI framework for the description of museum resources (Burnard & Light, 1996, p. 15). Another derivation is the Spoken Text Markup Language (STML), a text to speech markup language inspired by TEI (Sproat, 1997). Similarly, the Music Encoding Initiative (MEI) was based on TEI (Roland, 2002).

Most notable of TEI influences was in the development of XML. TEI represented the first and most precise SGML implementation at the time of XML's development. As a result, developers of TEI were closely involved in defining XML. Especially useful to the nascent XML specification was TEI's extended pointer language which served as a prototype for XLink and Xpointer (DeRose, 1999).

Evaluation of TEI Lite

TEI Lite was created to present a useful subset of TEI that provides the elements necessary for most common encodings. The 140 elements of TEI Lite represent only a fraction of the hundreds of elements available in TEI and its extensions. The majority of the subset, besides those in the header, define basic structural and perceptual attributes necessary for textual objects, but not so many as to become overly granular. Additionally, the use of a lesser number of elements restricts the size of the metadata vocabulary that different agencies need to have in order to understand conventions used during encoding and markup. The subset represents a lowest common denominator of sorts that is compliant with and upgradeable to the full TEI specification.

There are a number of criticisms with TEI encodings. First, as mentioned previously, the header lacks a controlled vocabulary for bibliographic elements. There is a compromise between usability from the perspective of creation and accessibility in terms of resource location. Free text bibliographic descriptions, however, could prove to be more useful for scholars of ancient texts which, by their unique character, require more detailed descriptions than those afforded in library cataloging (Pouchard, 1998). The upcoming P5 version of TEI will allow external metadata and namespaces to be included in TEI documents (TEI Consortium, 2004, Guidelines - P5 status). Embedding MARC encoded data may offer a solution to controlled vocabulary problem, although it is uncertain how such features will cascade into TEI Lite.

Second, texts may overlap semantic and organizational structures. XML and TEI are hierarchical languages that require inelegant procedures to represent such overlapping structures. The overlap problem is especially pertinent to representing variant structures beyond the word or character level such as macro-level versions and variations (Smith, 1999).

Third, the reduced set of elements available in TEI Lite reduces the chance of over-granular structure, but divergent encoding practices are still possible. The basic structural elements (p and div) and their attributes may be used differently and result in confusion when encoded documents are exchanged. Numerous “best practices” standards have been created to help alleviate variation within institutions (TEI Consortium, 2004, Tutorials). The loosely prescribe structuring rules, however, demand that TEI rendering tools be just as flexible and not beholden to a particular encoding practice.

Finally, the basic assumptions underlying the use of structural elements creates problems for representing the physical structure of a work. TEI is based primarily on encoding the intellectual structures of a text, such as chapters, acts, volumes, and other semantic containers. Such assumptions preempt encoding structures based on physical attributes of the container, such as the sequence of formes in early printed texts (Bauman & Catapano, 1999). The scope of this problem may be beyond the capabilities of TEI Lite and require use of the larger element set of TEI.

Despite these criticisms, TEI Lite successfully achieves its goal of providing a readily adaptable point of entry to TEI. Furthermore, TEI Lite sufficiently addresses the domain problems that TEI was meant to solve. We can judge the 1987 Poughkeepsie Principles in terms of the current implementation of TEI Lite: TEI Lite provides for simple, clear, and concise representations of textual objects; Expression in XML allows for efficient processing, the use of non-specialized software, and conforms to existing standards; Structural definitions in TEI Lite are not as rigorous as TEI, and the user may not extend TEI Lite freely, however, upward compatibility between the specifications provides a solution.

In addition to basic principles, we may judge success by the degree to which TEI Lite has been adopted. The Oxford Text Archive and the Electronic Text Centers at the University of Virginia and the University of Michigan use TEI Lite to encode their holdings. The TEI Consortium uses TEI Lite in its technical documentation (Burnard & Sperberg-McQueen, 2002, p. 2). Additionally, a significant number of the projects listed on the consortium Web site use TEI Lite (TEI Consortium, 2004, Projects using TEI) and a cursory Web and journal search reveals that TEI Lite is frequently used for encoding projects and research.

Conclusion

TEI Lite is an introductory subset of TEI, one of the earliest metadata initiatives. The encoding standard blends a flexible implementation with established descriptive principles. The result is a metadata set that is easy to apply and capable of describing many types of objects. The success of TEI, representing the efforts of scholars worldwide, has informed the development of many subsequent metadata standards and influenced the development of XML. Development of the standard continues as does its increased use in projects for a variety of domains.

References
(see also pathfinder & annotated bibliography)

Bauman, S. & Catapano, T. (1999). TEI and the encoding of the physical structure of books. Computers and the Humanities, 33(1/2), 113–127.

Burnard, L. & Sperberg-McQueen, C. (1995, updated 2002). TEI Lite: An introduction to text encoding for interchange. Retrieved on 18 September, 2004, from http://www.tei-c.org/Lite/teiu5_en.pdf.

Burnard, L. & Light, R. (1996). Three SGML metadata formats: TEI, EAD, and CIMI: A Study for BIBLINK Work Package 1.1. Retrieved on 18 September, 2004, from http://www.ifla.org/documents/libraries/cataloging/metadata/biblink2.pdf.

Burnard, L. & Popham, M. (1999). Putting our headers together: A report on the TEI header meeting 12 September 1997. Computers and the Humanities, 33(1/2), 39–47.

Burnard, L. (2000). Text encoding for interchange: A new consortium. Ariadne, 24(21 June 2000). Retrieved on 16 September, 2004, from http://www.ariadne.ac.uk/issue24/tei/.

DeRose, S. (1999). XML and the TEI. Computers and the Humanities, 33(1/2), 11–30.

HTML Writers Guild (2001). An introduction to the Text Encoding Initiative (TEI), DTD. Retrieved on 26 November, 2004, from http://gutenberg.hwg.org/teidtds.html.

MIT Libraries (2004). MIT metadata reference guide: TEI (Text Encoding Initiative) metadata. Retrieved on 16 September, 2004, from http://libraries.mit.edu/guides/subjects/metadata/standards/tei.html.

Mylonas, E. & Renear, A. (1999). The Text Encoding Initiative at 10: Not just an interchange format anymore – But a new research community. Computers and the Humanities, 33(1/2), 1–9.

Pouchard, L. (1998). Cataloging for digital libraries: The TEI scheme and the TEI header. Katharine Sharp Review, 6(Winter 1998). Retrieved on 18 September, 2004, from http://alexia.lis.uiuc.edu/review/6/pouchard.html.

Roland, P. (2002). The Music Encoding Initiative (MEI). Musical Applications using XML (MAX) 2002 Conference. Retrieved on 21 November, 2004, from http://dl.lib.virginia.edu/bin/dtd/mei/maxpaper.pdf.

SCHEMAS Registry (2002). Activity reports: Text Encoding Initiative. Retrieved on 16 September, 2004, from http://www.schemas-forum.org/registry/desire/activityreports.php3
?field=filename&value=TEI_D29D35(RDF).rtf.

Smith, D. (1999). Textual variation and version control in the TEI. Computers and the Humanities, 33(1/2), 103–112.

Sproat, R., Taylor, P., Tanenblatt. M. & Isard, A. (1997). A markup language for text-to-speech synthesis. 5th European Conference on Speech Communication and Technology, Rhodes, Greece, September 22-25, 1997. Retrieved on 21 November, 2004, from http://www.talkingheads.computing.edu.au/resources/documents/serge/
Sproat/A%20Markup%20Language%20for%20TTS%20Synthesis-Sproat.pdf.

TEI Consortium (2004). Text Encoding Initiative. Retrieved on 16 September, 2004, from http://www.tei-c.org.

Attachment	Size
TEI-Lite_pathfinder.pdf	235.58 KB