Reflections on the SAA 2006 Annual Conference - Part I

Last week I breezed through Washington, DC to attend the SAA/NAGARA/CoSA Joint Conference. Last year at this time, I attended the SAA conference as a new, student member and, as it was my first ever professional conference, I spent most of the time trying to acclimate myself to the conference ebb and flow. This year I've committed to taking better notes, talking a bit more, and, of course, sharing my observations here.

First off, these notes are my attempt to forge meaning from the shards of information that reached me. They are not meant to be comprehensive in their coverage of the sessions I attended, but merely document my thoughts and observations which, predictably, are skewed towards my own research interests. These observations are very raw and are meant to suggest areas of further research or verification. As clearly as possible I will try to indicate what was directly expressed versus what I interpreted or generated.

Second, I consciously entered each of these sessions with some overarching personal question or intent, not only to help me decide which sessions to attend but to ensure that my mind remained focused on the topics and issues that are of interest to me. I will state these for each session's notes which should help the reader understand my mindset and the subsequent observations.

In this episode, the first day: Thursday, 3 August, 2006.

Session #103: “'X' Marks the Spot: Archiving GIS Databases”

I attended this session because I hoped to gain some insight into preservation efforts focused on what I will call “non-linear” records – things like data sets, Web applications, and other “New Media” information. It has long puzzled me how to apply the best practices of digital document preservation to digital forms that span application domains, physical locations, networks, and so on. My concern arose during the processing of the Joyce papers, where hypertext was salient to many of the underlying works, but it also haunts me regularly in my capacity as a Web applications developer. My working theory here is that geospatial data sets and the applications used to access them present generally the same preservation challenges as software, multimedia & games, relational databases, and so on.

Three presentations were given, each with distinctive backgrounds and approaches. Helen Wong Smith of the Kamehameha Schools of Hawaii presented a geospatial cultural / historical database project used to document and maintain land holdings in Hawaii. Next, Richard Marciano of the San Diego Supercomputer Center presented briefs about several ongoing projects with GIS and geospatial aspects. Among these were the InterPARES VanMap project, the Persistent Archival Testbed (PAT) project, ICAP, and a new project called eLegacy. Finally, James Henderson of the Maine State Archives presented some of his perspectives and challenges in preserving geospatial data as state government records.

Geospatial data refers to data sets that link some sort of information (text, image, etc.) to a fixed location or area at a specified time period. In the case of the Kamehameha Schools, diverse media such as songs, images, and historical accounts are linked to specific locations within the School's land holdings. Localities in the state of Maine maintain road and property data in GIS systems to support applications such as E911. The most salient aspect of these data sets is that they change over time – notable historical events happen periodically, roads are re-routed or built, and so on – much as any other database changes when updated, which suggests that preservation efforts for one can be applied to the other and in other similarly structured applications.

The three presentations did not flow seamlessly, but did manage to expose some overarching themes. Perhaps the most significant theme that I observed is the relationship between data sets that change over time and versioning in unitary documents. The key difference between these two concepts is that examining versions of a document reveals the thought process involved in achieving a final or published work, while examining geospatial data shows how things were at various points in time. Additionally, the time between discrete versions of documents are usually much shorter than those of geospatial data, usually days versus years, and documents often have a terminal form after which changes cease, whereas geospatial data is usually open-ended or otherwise arbitrarily bounded. Aside from these differences, the approach to preserving and accessing versions and geospatial data seems very similar. Data sets that change over time lend themselves to access via temporal queries; where date or date range becomes part of the query criteria. For a suitably large number of versions, an access mechanism based on date queries would work just as well as it would for geospatial data. Further, for any body of records that span a period of time, temporal queries can be an immensely useful tool for narrowing query results to relevant time periods.

When I thought about these ideas in terms of Web applications (such as CRM, sales support, inventory management, etc. -- putting aside the question of why save them) some of the analogies with GIS data break down. For one, GIS data works in 'layers,” where types of data can be segregated like unitary documents. Unfortunately, relational databases have no such abstraction – they are built to store data efficiently, not in ways that can be easily separated.

Another problem is that even though Web application data can be captured by taking snapshots, in much the same way as GIS data, the rate of change within the data set can often be much faster – on the order of seconds – than the slower changes in things such as historical events and roads. Further, as the snapshot horizon nears the immediate, the storage and processing requirements become untenable – it is impossible to take a snapshot of a database with a frequency that is at or less than the time required to make the snapshot. As an aside, I wonder what solutions might be suggested by data warehousing techniques.

Beyond the capturing of the state of the data, Web applications require that not only the data be maintained, but the application code itself be maintained. Seldom does an application remain unchanged over its service life – bugs are repaired, features are added and removed, and so on. These changes can affect the way that the underlying data is represented to the user. Additionally, such changes are often accompanied by changes to the database structure itself. As a result, snapshots should be acquired after such changes are applied. Although not enough detail was given for each of these projects, I wonder if some of the same issues manifested in work with GIS data sets.

Session #208: “Big Bird's Digital Future: Appraisal and Selection of Public Television Programming”

I attended this session in order to revisit my work on the CHAT digital video preservation plan in the context of similar video preservation projects. I hoped to validate the decisions that were made in formulating the plan and see what new work, if any, had been done in digital video preservation and access since early last year. As the title of the session suggests, the subject area focused on TV broadcasts, but I anticipated that the overarching preservation concerns would be indistinguishable from any other video preservation effort.

The three presentations fit together well, despite differences in scope. Thomas Connors of the National Public Broadcasting Archives and the University of Maryland gave the first presentation. Connors led us through a brief presentation that started with mention of a podcast by Brewster Kahle of Internet Archive fame, which invokes the contentious “save everything” debate. Connors invoked the scarcity argument which allowed him to move into a discussion on the lack of literature treating video appraisal criteria. The remainder of his presentation described Danielle Dumerer's ranking system, which I interpreted as a risk assessment matrix, for appraising video collections and prioritizing preservation efforts. This system operationalizes criteria such as current condition of the assets, cost of retention, intellectual rights, use potential, and perceived production value, which is a more formalized but identical process that I used for the CHAT plan. He then showed how this system mirrors guidelines described by the RLG and NPO.

Next in the session was Lisa Carter of the University of Kentucky. Carter shared her observations in working with television archives, mostly those based on magnetic analog media. Among these observations were the importance of proper storage of media, the frailty of tape based media, and the importance of keeping the original media even upon conversion to more stable media or digital versions – all of which were expressed in the CHAT plan. Much of her talked focused on the importance of metadata for both access and preservation, most notably, the need to work metadata collection into formal workflows. I found the concept of “shutdown procedures” to be most interesting, where the creators of a video execute a series of steps to describe, document, and otherwise properly close out a production as a means of combating the often ad hoc procedures that producers often use for the sake of brevity and leave archivists in the dark.

Leah Weisse of the WGBH (Boston) Media Archives and Preservation Center presented some of her observations in working with the significant back catalog of WGBH broadcasts, reaching all the way back to the 1950s. One important issue that she presented is that challenges that new direct to drive and flash memory systems present to preservation. In these cases, there is no original media to work with in the future since the impetus of the users of these devices is to move the digital file off of the memory device and reuse it for subsequent productions. This is identical to the behaviors of digital camera users, but I had never thought of this in terms of full video capture. Perhaps the greatest challenge presented in this situation is the need for more rigorous descriptive procedures to ensure that the digital files can be identified, and thus managed, after they have been moved from the capture device. One observation I made during her presentation is the issue of versioning that I observed during the GIS session. In this case, the versioning is not only in terms of initial or draft productions (thin director's cut versus theatrical release in film), but also reformatted versions (letterbox, etc.) and display formats (HD, streaming, etc.). Weisse had to deal with many of these for many of the works, which implies that the versioning issue is really a genre and form-crossing concern. I need to see what has been said about versioning in the archival literature and how it translates to other forms.

Session #310: “The Current State of Electronic Records Preservation”

Despite it's comprehensive title, I knew that this session would likely cover only a high-level review of some of the major projects. With this understanding, I approached this session as a brief update to material I had received while in classes a year or so prior.

David Lake of NARA and Lee Stout of Penn State University addressed ongoing work on the Electronic Records Archives (ERA) for the National Archives. The ERA seems to be the flagship project in North America, at least owing to the amount of information about it that I have encountered of late. At this point, the ERA has a developer – Lockheed-Martin – and is slated for an initial, though not comprehensive release in Fall of 2007. Much of the questions about the ERA focused on the potential for using the resulting products in venues outside of the National Archives and whether it would be available as an open-source or similar product. The response emphasized that this project was not only a set of software, but an instantiation of NARA's workflow processes. The message seemed to be that while some products that do specific tasks may be portable to other environments, the core of ERA is specific to NARA and its practices.

Next, Hans Hofman from the National Archives of the Netherlands presented a general overview of three current European projects: Digital Preservation Europe (DPE), PLANETS – a research project, and CASPAR. Much of what Hofman presented was very high-level conceptually, but he did take care to place these projects into the context of previous research and efforts upon which they build.

Finally, Kenneth Thibodeau of NARA wrapped up the session, providing a bit of thought that transcended the specifics of the previous presenters. One thought that I took away from his remarks are, paraphrased, that the ERA has shown that preservation has to be attacked as an organizational problem, not a process in isolation – something that mirrors what I have said before in terms of archival thought infiltrating the process of creation and the tools used by the creators. One other take-away was his emphasis on the need for digital format repositories of the type that Harvard is developing. I interpreted this as not merely as reference databases, but living applications that can provide a supporting framework for preservation software platforms and applications – think Web services for digital format preservation information.

General Observations

I had one meta-observation concerning the conference as a whole. Each session was recorded by the conference staff using each room's audio setup. The inputs consisted of usually three microphones, one at the podium and two on the panel table. In virtually every session I attended, the panel participants had to consciously remind themselves to repeat questions from the audience into the microphone so that they would be recorded in addition to the responses given. This process strikes me as a visceral metaphor for the function of archivists and the frustrations they feel when working with their various constituents. I often hear the refrain that archival thought needs to happen early in the creation of records, if not before, and given that the recording of these sessions is an inherently future-focused activity – an attempt to create a complete record of the proceedings – the panel's self-reminding process seems apropos. I have said it before in this venue in different ways, but if we are to capture a more complete cultural record for the future, archival thought in the form of deliberately future-minded actions must be insinuated into our information management – not only archivists, but everyone that creates information and, especially for the digital realm, in the tools that we use. I envision this as a sort of repurposing of the seventh generation concept for our cultural memory as it is represented in our information objects.