infoSpace - Digital Archives

An Emulation Experiment

tkiehne — Tue, 30 Oct 2007 06:01:53 +0000

Through my technical work and experience with preservation projects, I feel that I have a good grasp of migration as a digital preservation strategy. Unfortunately, I have much less functional experience with emulation as a digital preservation strategy. The concept of running a virtual machine within the physical resources of another is intuitive enough, but I have as yet had no real experience with a full emulation environment. Recently I thought about recovering some classic-era Macintosh files that I have had in storage and figured that their recovery could make for an excellent hands-on emulation experience.

I used Macintosh computers from their introduction in 1984 until 1998. My last Mac was a 16 Mhz 78030 machine with 8 Mb RAM, and 80 Mb of hard drive space running OS 7.5. Before disposing of the machine in 2005 (donating it to Goodwill), I copied the entire contents of the hard drive – system software, applications, and all – to a 100 Mb ZIP disc. Simply using one of the several Mac to Windows utilities would suffice for recovering many of the documents that are cross-platform (images, raw text, etc.) or that could be migrated (Microsoft Word, Excel, etc.), but there are some documents that might be better off using the original application environment to make the conversion to a cross-platform format.

A number of programs were discovered during my initial search for Mac emulators, including Softmac, PearPC, vMac, Shapeshifter, Executor, and Basilisk II. These programs run the spectrum from proprietary to open source, the different types of hardware environments emulated, and the platforms that the software will run on. Additionally, many of these programs have not been maintained in some years. For my experiment, I chose the Windows port of Basilisk II since it met my basic criteria of a free, open source program that is still somewhat current.

The basic concept behind emulators is that they provide hardware encapsulation and interfaces to allow an operating system (OS) to run in a non-native environment. The system requirements for these emulators is quite unassuming for current computing hardware, however there are some extra software requirements that are somewhat peculiar. In many of the programs mentioned above, a copy of the Mac ROM BIOS is required to run the emulator at all, and in every case, a complete copy of the emulated OS is required in order to run software within the emulator.

The ROM BIOS is a physical chip present in the Macintosh hardware that contains the basic machine instructions used by the OS, which is considered by Apple to be proprietary code. The emulators avoid copyright infringement by requiring the user to provide a copy of the ROM rather than embed one into the software, thus violating Apple's intellectual property rights. The ROM can be obtained legally in one of two ways: extract the ROM BIOS from a functional Mac of the correct vintage for the OS version to be emulated; or, purchase a ROM card with an actual Macintosh ROM chip from a commercial vendor. The preservation-minded among us can already see issues for future emulation efforts. Incidentally, the Copyright Office has allowed exceptions to copyright and anti-circumvention provisions of the Digital Millenium Copyright Act (DMCA) in certain circumstances for archives and libraries which would seem to include this very situation. (The exception will be in force until October 2009, at which time it will have to be renewed... would it not be nice to have a permanent exception?)

As for having a copy of the OS, there are also two ways to get a copy: 1) have an existing system disc or copy of a system folder; or 2) get a disc image from another source. Apple still has some older system software disc images, installers, upgraders and the like for download, and it is pretty easy to find pre-made Mac system disc images for emulators out on the net. Fortunately, I already have a system copy on my ZIP disc which will run so long as the emulation environment is of a vintage that supports the OS.

The only issue is how to copy the contents of the HFS formatted ZIP disc, using Windows, to a place where the emulator can access it. There are two steps to this: 1) creating an HFS volume within the Windows filesystem, and 2) accurately copying the contents of the original HFS media to the new volume, preserving all OS-specific aspects of the data. For Macs, this means preserving both the resource and data forks of the files. (see the Joyce project report for more on Mac files and HFS)

Basilisk II uses disk volume images, called hardfiles (file extension: HFV), to simulate an HFS volume within the emulator. The Windows GUI interface (and presumably the command line tools for the non-Windows versions of Basilisk) can create a raw hardfile, but cannot copy anything into it. Fortunately there is a free program for Windows called HFVExplorer that can create HFV files and view or manage their contents by copying to or from Windows volumes or other HFS volumes that are accessible to Windows (including CD-ROMs, floppy, SCSI, removables, etc.).

Unfortunately, HFVExplorer would not mount my ZIP drive – using either parallel port or USB model. I would have to guess that because the program has not been updated since 1999 – before Windows 2000/XP – that the program was unable to correctly access the removable media. It is possible that HFVExplorer running on Windows 98 or NT would not encounter this problem, but I am not about to revert to either of those operating systems. Besides, having to use an older operating system in order to get an emulator to work runs counter to common sense.

Unable to rectify the issue, I tracked down a copy of HFSUtils (Windows port), which is the utility package that HFVExplorer was originally based upon. HFSUtils is a set of command line tools that provide basic file management tasks for mounting and manipulating HFS volumes. I mounted the ZIP volume and moved around using the command line with ease. Copying was laborious, however, because the hcopy program could not recursively copy nested directories. Ideally I would have modified the source to do this, or created a script to use HFSUtils, but then I might as well figure out how to fix HFVExplorer.

In order to move the experiment along, I proceeded by copying the directories and files individually using hcopy at the command line; the most banal of tasks, to be sure, but quite effective. I copied files from the HFS ZIP to a location on my Windows hard drive, then used HFVExplorer to copy the files into an HFV volume file.

Aside from the lack of a recursive directory copy, there were some other annoying problems. Firstly, any characters in the source filenames with characters outside of the standard ASCII range (above number 127) had translation issues. The problems came from trademark symbols, em dashes, and other special characters used in directory and filenames in Mac OS. When the command line utility rendered these characters in a file list, they appeared as question marks by default. Even when using some of the program options for hls (the file listing utility), the characters still did not display correctly in the Windows character set. Attempts to copy files with special characters failed since the DOS command line sent the translated character to the command. As an aside, it is possible to access HFS nodes (directories and files) by their node ID (or node specification pair), but unfortunately, hcopy exposed no means to exploit this feature.

I noticed a second issue that relates to the file metadata, specifically the creation and modification dates for the files copied from the HFS source to the Windows volume. The copies on the Windows volume showed creation and modification dates as the date of the copy operation and not the original dates. Fortunately, the files retained their resource forks during the transfer to the emulation environment, meaning that they had all of the file metadata intact with the original dates. I'm already well aware of inconsistencies in file dates from platform to platform, but it would be ideal from a preservation perspective if the hcopy routine were to access the file metadata in the HFS source and set the correct creation date for the Windows copies.

Incidentally, there are other programs available for copying HFS volume data to Windows: MacDrive and TransMac. Each of these are commercial software, but free demos are sometimes available On a whim I tried a demo version of TransMac, which copied the source files just fine, but I found out upon transferring the files to the HFV volume that the binary (program) files were converted in such a way that they would not function in the emulation environment. Unless I missed something in my attempt, this issue would effectively prevent emulation using files copied using these programs.

With all of the emulator software in place and a testbed of data from the original ZIP disk copied into an HFV volume file, I could test the emulator. The Basilisk GUI program in Windows allows you to define a set of disk or volume images that will be loaded at startup. This is the equivalent of having hard drives installed or discs inserted at boot time. For an initial test, I downloaded and used a bare system 7.0 system disc image. It worked flawlessly on the first try, providing me with a visceral demonstration of the difference between 16 Mhz and 1 Ghz processor speeds – my 1997 self would have been dazzled.

Basilisk II GUI: Selecting volumes at startup

After fixing a directory hierarchy issue with my copied volume (the system folder was not at the root level on the hard drive image), I relaunched the emulator which resulted in an almost perfect reproduction of the mac desktop I last viewed in 2005.

Basilisk II: Running Mac OS 7.5 in Windows 2000

Next step: resolve the transfer issues to try and establish a full, unmodified copy of the HFS source ZIP disk into the HFV volume file. Ideally, an unfettered copy should work flawlessly for all programs, assuming that the emulator is complete. If that is the case, then I will finally be able to access the last remaining documents that I need to convert within the original application environment.

Audio Encoding Project Milestone

tkiehne — Mon, 19 Feb 2007 00:05:25 +0000

This week I achieved a major milestone in my personal audio encoding project -- after a longer period of time than I planned. Other than a few stragglers, I have managed to completely encode all of my compact discs, including creating use copies and fully describing the objects using embedded metadata. These are all stored on a 0.75 TB NAS appliance configured in a RAID 5 array. Additionally, I have ingested (using the term rather loosely) the use copies into an access system, Ampache.

When I embarked on this project, I estimated that my entire collection -- CDs, albums, and cassettes -- would amount to about 300GB in total. The lossless formats in the corpus currently comprise approximately 230GB across 7500 distinct objects. Once I finish encoding the analog formats I expect to more than exceed my 300GB initial estimate. Incidentally, the total corpus including use copies amounts to about 280GB across 14,500 objects*.

The next step, other than sweeping up the straggler CDs, is to move to encoding my cassettes. I have fewer cassettes than vinyl records and, since there are fewer noise reduction and quality issues to address, I figure that encoding them is the logical next step.

*Note: Although one might think the number of objects would simply be double the number of lossless objects, some releases were live or mixed which I encoded as one track for the use copy, rather than as individual tracks as they appeared on the CD. This approach preserves the work as a unitary effort for access while maintaining the original order and structure in the lossless format.

Metadata Quality Control

tkiehne — Wed, 17 Jan 2007 05:20:54 +0000

As I near the end of the first phase of my audio encoding project I feel the need to share some of the metadata quality control observations that I have collected.

Although ripping my CDs to digital media has been time consuming, it has not been near as laborious as checking and correcting the metadata that was automatically gathered during the process. FreeDB as an automatic metadata gathering service has been very helpful, but as I reviewed the corpus of encoded audio, I found many disturbing errors: misspellings, typos, missing articles, missing fields omission or misrepresentation of international characters, and, of course, the usual discrepancies in case handling, title formating, and normalized forms..

To find and correct these errors, I relied on one or more separate discography databases and integrated metadata management software. Discogs.com has been an invaluable resource for confirming apparent discrepancies, as well as helping me find and correct release dates, many of which were not found in the FreeDb data. Discogs does some rather stringent data normalization, often deviating from what is present on the actual releases, so that they can eliminate redundancy and excessive cross-linking between records. This issue has been the source of heated debate on the site's forums, as well it ought to be. Having submitted information to Discogs for many of my rarer CDs helped me to understand the compromises that they have made in their system so that I could understand why deviations from the original objects occurred and make an informed decision as to whether to apply changes. In the absence of information from Discogs, label, band, and fan sites have also come in handy for verifying information.

The most important tool I've used is an integrated metadata management program called Tag & Rename. This particular program merges a Windows explorer-like interface for viewing directories and files with an embedded metadata viewer that is capable of extracting and manipulating all of the major audio metadata formats (id3, ogg vorbis, aac, ape, etc.). The software provides a middle ground between the file system and the content which greatly increases the speed at which I can update embedded metadata.

In fact, there seem to be many such tools for this purpose for all sorts of digital object types. Another one I have come across is Exifer, a program that allows editing of embedded EXIF information in digital photos. I expect this program will come in handy when I begin processing my seven years worth of digital images.

Between these two programs, I have come up with a general list of essential characteristics for embedded metadata editors:

Filesystem integration: Functions such as copy, move, delete, rename, create directories, and so on. This feature ensures that you can stay within the metadata editing environment which saves time wasted in program switching. One thing that has been missing in my experience which could be useful is having multiple filesystem views so that you can jump between directories or volumes without leaving your working directory. This idea is insipred by my preferred code editor, Homesite.
Metadata listed in directory views: Selected metadata should be shown as part of the file list to allow a quick appraisal of the contents of the embedded metadata. Like a file list in the operating system, the list should be sortable and/or filterable by metadata field.
Ability to manipulate many different metadata standards: The program should be able to manipulate all applicable formats for the target object type (image, sound, text, etc.). Additionally, an ideal program would be extensible such that new metadata and file types could be added as needed.
Automated or batch editing: Manual, object by object editing is an expected feature, but the greatest time saver is the ability to modify entire directories or lists at once. Additionally, the ability to transfer between one metadata format to another applicable format (e.g.: id3v1 to id3v2) is essential. Copying tags directly from one field to another in the same file, swapping tags, and copying tags from one file to another have also been essential features. Finally, extraction of metadata from the filesystem, such as regular expression or pattern conversion of filenames into metadata, and vice-versa, has also come in handy.
Ability to create or access authority files: Tag & Rename allows me to create a list of genres for music files and exposes that list in edit dialogs, although it does not apparently have the option to force me to use only this list. In the absence of pre-coordinate lists, input masks should be available, especially for date and time fields. An added bonus for more detailed metadata formats could be accessing authoritative Web services for standard entries, such as a LCSH service for subjects, though I am not certain that such things yet exist.
Aggregate and summary views: This feature does not exist in Tag & Rename, but having brought all of my encoded music into an access system I have found the feature sorely lacking. Essentially, there should be a way to see the total number of objects marked with specific data, for example: grouped by genre. By browsing a list of all genres returned by my access system I was able to see outliers or variants that were present (e.g.: Synth Pop vs. Synthpop) and find them so that I could go back to the metadata editor and normalize as needed. It would be ideal to have this capability within the editor; although simply conforming to an authority list of genres would have prevented this particular problem, there may be situations where a strict authority list is not desirable.

.This is by no means an exhaustive list, and is perhaps too general to fit all object types, but the basic concept is clear. As a rule ,the less typing one does, the more accurate the metadata, but as I have experienced, even external databases have errors. Over the course of thousands of files or records, small error percentages accrue quickly. I can only imagine the headaches that would have arisen were my project to take place in a larger organization, with many people participating in the encoding and preservation process, let alone with a much larger corpus. It is clear that quality control of metadata, whether hand-entered or not, is crucial. These software tools

Slashdot: Archiving Digital Data an Unsolved Problem

tkiehne — Tue, 21 Nov 2006 06:24:33 +0000

The headline on a front page post on Slashdot today reads:

"Archiving Digital Data an Unsolved Problem"

which links to this article in Popular Mechanics. For archivists, this headline states the obvious, but the words betray how the technology sector, at least stereotypically, views archives and backups as equivalent. Wading through the comments (and discarding the obligatory comical entries), we find a rather robust discussion on digital preservation, sans academic terminology. All the familiar preservation topics -- Migration, emulation, media and file formats, genres, the influence of intellectual property law -- are touched upon, if rather superficially. One commenter brought up the issue of compression in digital archives, but it seems that none have touched the DRM issue (I'll have to remedy that).

That said, however, it is encouraging to see this article highlighted on one of the premiere tech blogs as well as in Popular Mechanics. It's going to take quite a bit more exposure to digital preservation problems in the tech community to get the point across -- to impart the long view, as it were -- but this is a good start.

Neil Beagrie on Personal Digital Libraries and Collections

tkiehne — Mon, 16 Oct 2006 07:37:50 +0000

I finally got around to reading Neil Beagrie's D-Lib article, "Plenty of Room at the Bottom? Personal Digital Libraries and Collections" (June 2005), and I regret not having done so sooner (alas, I have a great deal left in my "to read" folder). This article touches on several major themes in my academic pursuits of the last few years, which I will briefly describe here.

What drew me to the archival field was the overarching concern I have about the potential loss to our external memory in the sense of our information bearing objects. Being firmly seated in the digital generation, my concern is mostly over digital materials, and having completed my information science degree I find that, though I still worry about our institutions' digital preservation efforts, it is the enormous amount of personal digital information that people the world over possess that really worries me. Beagrie's article attacks this issue head-on, naming this body "personal digital collections" and enumerating not only the threat of loss, but the challenge these non-traditional collections pose to our "memory institutions."

Personal digital collections are subject to the same threats to persistence that the large institutional and academic projects are â€“ obsolete formats and media, access regimes such as passwords and DRM, and so on. Beagrie also enumerates missing data as a threat, with the parenthetical "email, webpages, etc." It seems that he means links to web pages, references to emails that have been deleted and so on, but I also wonder if mere information mismanagement is also intended? A recent episode in my own personal digital information management should elucidate.

As part of my ongoing audio encoding project, I have been preserving some of my own audio works from the last decade. I have also been checking my music collection, including these personal works, against an online discography database, Discogs.com. Every release in the Discogs database represents a physical object (CD, LP, etc.) released by a specific entity (record label), and lists not only the track information, but catalog information, liner notes, and cover art. As you can probably guess, the music that I created and released was not widely known or distributed (I still have a day job), so naturally there were no previous entries in Discogs.com. In the process of updating the database with my defunct label's releases, I found to my horror that I had lost some of the original digital files containing artwork and layout for some of my releases! Granted, I have not always been preservation-minded, but I had always assumed that these files were migrated from computer to computer over the past decade. Certainly lapses of this sort pose a significant hazard to personal digital collections, and I'm sure that it qualifies as "missing data."

Interestingly enough, my Discogs example also touches on Beagrie's discussion of "information banks." Although Discogs does not store the actual information represented in its indexes (the music), it is easy to visualize how it could were it not for the copyright regime so voraciously defended by the music industry. This worn argument aside, Discogs does implement a social networking component of the likes proffered in Beagrie's discussion of information sharing services such as blogs and sites like Flikr. By adding a social networking component, all of these sites, whether they publish unique user content or merely aggregate collected information (like Discogs), add a layer of informational value in the form of contributed information (e.g.: blog comments) or linked information (e.g.: relationships between artists in Discogs). But perhaps more importantly, the creation of these information banks, whatever their form, supports my assertion that digital preservation efforts must be aggregated at some level beyond a single (physical) entity's capabilities -- that only distributed efforts will ensure that digital assets are adequately preserved and accessed, let alone described and identified. This is as true for the National Archives as it is for Joe Q. Public's personal works.

As an aside, I could not help but notice that all of the talk about social networks and personal collections seemed to echo writings on digitally mediated identity by Danah Boyd. Beagrie's Venn diagram showing the definition of "public persona" begs comparison to Boyd's thesis work in faceted identity. I imagine that there is much to explore about the intersection of faceted identities or, for that matter, multiple personal public persona's, and the consequences to the "Lifetime Personal Web-spaces" concept mentioned at the end of the article.

In closing, one quote in particular caught my attention as it factors into my explorations into the "save everything" debate. Beagrie says (which he credits to Michael Lesk): "The combination of cheap digital storage and very sophisticated retrieval tools is shifting the balance of costs: digitally it is becoming cheaper to collect and more expensive to select, and cheaper to search than to organize." In other words, the scarcity argument is shifting from "we don't have enough space" to "we don't have the time to organize what we have," but as Beagrie seems to say, it no longer matters so long as you do not expect traditional access mechanisms. Or, more succinctly (with a nod to Catherine Stollar for originally expressing it): "what we do... will change, but why we do it does not."

Audio Encoding Project Resumes (or, a funny thing happened on the way to 300 GB)

tkiehne — Tue, 26 Sep 2006 06:41:22 +0000

It's been a while (almost 8 months, to be exact) since I have updated this forum on the status of my audio encoding project. I could cite the usual life delays and an unusually busy Summer as excuses, but there is more to it.

So, a funny thing happened on my way to 300 GB...

Not long after my last update, steady encoding progress brought me to about 240 GB of encoded music. As far as CDs go not much is left to encode -- perhaps 100 CDs out of the originally estimated 800 -- and I have mostly caught up in creating Ogg Vorbis reference copies. As I worked my way towards filing my 300 GB external drive, however, I began having strange pangs of trepidation centered on the thought: what happens if I lose this drive? Knowing full well that the roughly 240 GB of data represented a significant investment in time and effort, and also knowing full well the fallibility of technology and the loss risk inherent in only one copy of, well, anything, I became reluctant to continue encoding until some of these risks could be mitigated. I cannot say that this trepidation represents anything near as harrowing as what must be felt by an archivist handling rare, unique manuscripts â€“ I have the original objects to re-encode from, and most of them are not unique â€“ but through my meager risk I certainly feel for those who work in such risky situations.

Since halting progress, and having finished the aforementioned busy Summer, I have come into possession of a network attached storage (NAS) server, specifically, a 1 Terabyte Buffalo TeraStation. The prices have recently dropped on these units in the wake of the newer 2 TB versions and, likely, pressure from a spate of competing 1 TB boxes. For the benefit of those who didn't just click the link, the 1TB model contains four 250 GB hard drives and is capable of a variety of RAID configurations and storage capacities. I opted for the relative safety of a 750 GB RAID 5 configuration which, though not absolutely fail-safe, does protect against a single drive failure and quite effectively allays my trepidation over continuing the project.

I've since copied the entire contents of the 300 GB external drive to the TeraServer in preparation for resuming the encoding process, unencumbered by worry. More to come.

Reflections on the SAA 2006 Annual Conference - Part II

tkiehne — Thu, 24 Aug 2006 03:25:50 +0000

This entry is a continuation of my observations on this year's SAA annual conference. For more, see Part I.

Plenary Session II: "Technology"

Each of the three plenary sessions was hosted once each by the three joint conference organization, the second one headlined by SAA. This year, SAA president Richard Pearce-Moses opened the session with a talk summarizing his work over the last year in exploring the "new skills" needed by archivists for the digital era. Between his writings in Archival Outlook (here and here) and the New Skills Colloquium in June I have already heard much of what he had to say, but it was nice to see it presented so succinctly to a room full of archivists â€“ a name drop didn't hurt, either!

There were a couple of key points that he made which validate opinions I have had and expressed in the past. One is that traditional archivists have a tendency to avoid the challenges presented by digital records â€“ paraphrasing Pearce-Moses: to hope that someone else will deal with it instead. Second, he essentially stated that if archivists do not rise to the challenge, other professions will. I have previously expressed my concern over how terminology and practices that technology vendors use come into direct conflict with those that archivists use so it was encouraging to hear it put to the audience.

Following Pearce-Moses was a talk by Brewster Kahle of Internet Archive fame, which was a pleasant surprise, mainly because I was curious to see how he would present the "save everything" argument in this venue. Kahle's presentation was decidedly oriented towards a lay audience, being rather shallow in scope and simple in terms of technical detail, but I can understand his trepidation over being inaccessible to a decidedly non-technical audience â€“ In fact, I have seen this happen on numerous occasions when tech industry professionals or computer science academics are asked to speak to librarians or archivists.

Aside from this lapse, however, Kahle definitely had a couple of key points to make and drove them home. One main point can be paraphrased as: we can save everything digitally because in the grand scheme of things it's not really that expensive. He threw out some general figures based on estimated amounts of data found in print, film, etc. and showed how these figures are inexpensive in an institutional or government context. Kahle didn't address appraisal and selection, which I am certain many in the audience would have loved to bring up, but I believe that addressing such concerns would have made for a significantly longer presentation. Second, he mentioned very little about preservation and preservation strategies and how they might impact the costs and requirements for storage and management. The main point he made about preservation was to reiterate the LOCKSS principle, saying essentially that the only proven way to keep information safe is to make lots of copies. But, I can understand why he would not delve too deeply into this topic as it brings into play discussion of formats, technological obsolescence, and of course, increased storage and costs. In summary, I appreciated his presentation as a general position statement, but I can easily imagine that few skeptics in the audience were turned.

The plenary session was wrapped up with a star appearance by "Cokie" Roberts, writer and ABC News correspondent. Her speech was quite entertaining, the content of which was mostly focused on her experiences in researching for her various books and how her experiences in advocating for breast cancer research could apply to helping fund archives and archival research. The most interesting part of her presentation was most likely an unintended argument for "save it all."

During her speech, Roberts discussed how difficult it was for her to find source documents regarding or by the wives and women related to the "founding fathers" for her book Founding Mothers. Some of the difficulty was due to the usual mis-management of documents, including deliberate destruction by the creators, but more of a problem was the fact that the perspectives of the women of the subject period were considered to be inferior to those of the men â€“ in other words, there was a conscious selection judgment made on the part of archivists not to keep such records. These decisions could be waved through as sexist or as some related conspiratorial power struggle, and no doubt some of it is, but the issue I keyed in on is that no one can know with certainty what will be of interest to future researchers. This is perhaps the strongest argument for "save it all," not only because of the value to users, but because it is not a technological reason. It is this one thought that weaved Robert's speech seamlessly into the previous two, a feat that is tempting to attribute to her renowned brilliance, but then again may just as likely be due to the latent inertia behind the notion to "save it all."

Exhibit Hall and Student Poster Sessions

Having presented a poster at last year's exhibit, I felt a responsibility to take a look this year's presentations. Two posters caught my attention this year. The first was "Search and Preserve: Collecting the Punk and Hardcore Communities" by Debi Griffith of the University of Wisconsin at Madison. I found this poster to be of personal interest for many reasons: First, it embodies a core argument behind my desire to save everything, that being that relying on conventional institutions and selection and appraisal can become biased against "fringe" or unpopular communities, thus ensuring a bias in or an incomplete cultural record. Another reason is that I have been a participant in some of these communities, from punk and alternative music, to industrial, techno, and experimental music. I am a semi-avid collector of DIY-style zines and publications put forth by these communities, a habit that started well before I had any idea about archives and such. My participation in these communities has taught me how the "mainstream" can easily, if not deliberately, misrepresent such movements and how important it is to ensure that the record includes the perspectives and views of the communities in question.

The second poster that I caught my interest was "Digital Object Identifiers and Resource Identifiers in Archival Description" by Krista Ferrante of Simmons College. This poster was fairly simple, presenting DOI and handle servers as a means of providing persistent identification of electronic records, but it did remind me that I need to finally get something together on the XRI/XDI specification in the archival context.

Session #508: "Future Shock: Saving the Signals of Audio-visual Records"

I attended this session for much the same reasons I attended session #208, that is, to validate the decisions that were made in formulating the CHAT preservation plan and see what new work had been done in digital video preservation and access since early last year. The difference between this session and #208 is that this session covered projects specifically dealing with audio, video, and audiovisual research rather than TV.

The first presentation was by Steve Weiss of the University of North Carolina at Chapel Hill, who presented his work with restoring and preserving African American cultural audio works. His presentation was heavy on demonstrations of the various music and voice recordings, but fairly light on process and lessons learned. One idea that I took from his presentation had to do with software for testing CD recordable media prior to use. All during the CHAT research I had not come up with such software, but it struck me as not only plausible but desirable to confirm recordable media before attempting to write data in order to avoid having to troubleshoot bad recordings after the fact. No specific software was mentioned, but knowing that such software exists should make it easy to find â€“ more research is necessary here.

The second presentation was given by Joanne Rudof of the Fortunoff Video Archive for WW-II Holocaust Testimonies. Rudof described in detail the process used to migrate and preserve a large number of Beta-SP cassettes of oral histories and testimonies. Much of the initial process she described sounded similar in concept to the CHAT plan: surveying and inventorying existing media, developing a "triage" plan to prioritize preservation efforts, etc. The major portion of the effort centered on the implementation of an experimental robotic system called SAMMA which comprised a semi-automated system for copying the existing cassettes to newer media and creating MPEG-2 digital surrogates. It was difficult to tell from the information presented how much material (in hours) was actually migrated â€“ one figure I heard was about 250 hours or 10 TB of MPEG-2 â€“ but the final number of cassettes migrated came out to over 2000. The mini-DV cassettes used by CHAT are newer and at less risk than those of the Fortunoff archive, but if the number of hours was correct, then we managed to develop a plan that took more time and individual work effort, but only a fraction of the cost of this project â€“ several hundred thousand versus a few thousand. I've been meaning to revisit the CHAT project in terms of results and I think the low budget aspect may be the tack to take.

Some research findings were presented, one set by Virginia Danielson of Harvard University, who gave an overview of her work with the "Sound Directions" project, and the other a short update by Jim Reilly, who was brought in by the session chair to discuss some of his work. Danielson's presentation focused on some of the best practices determinations made by her project, or as she put it, "not bad practices." One thing I noted to research from her presentation is the IASA TC-04 preservation manual. The main takeaway from Reilly's presentation was, paraphrased, that there is no single or simple cause of physical degradation of magnetic media. This reinforces my doubt, stated in the CHAT plan, over the long-term efficacy of tape media as an archival solution. As optical and disc-based magnetic media overtake magnetic tape in storage capacity, the days of tape media certainly seem numbered.

Reflections on the SAA 2006 Annual Conference - Part I

tkiehne — Tue, 15 Aug 2006 01:10:50 +0000

Last week I breezed through Washington, DC to attend the SAA/NAGARA/CoSA Joint Conference. Last year at this time, I attended the SAA conference as a new, student member and, as it was my first ever professional conference, I spent most of the time trying to acclimate myself to the conference ebb and flow. This year I've committed to taking better notes, talking a bit more, and, of course, sharing my observations here.

First off, these notes are my attempt to forge meaning from the shards of information that reached me. They are not meant to be comprehensive in their coverage of the sessions I attended, but merely document my thoughts and observations which, predictably, are skewed towards my own research interests. These observations are very raw and are meant to suggest areas of further research or verification. As clearly as possible I will try to indicate what was directly expressed versus what I interpreted or generated.

Second, I consciously entered each of these sessions with some overarching personal question or intent, not only to help me decide which sessions to attend but to ensure that my mind remained focused on the topics and issues that are of interest to me. I will state these for each session's notes which should help the reader understand my mindset and the subsequent observations.

In this episode, the first day: Thursday, 3 August, 2006.

Session #103: “'X' Marks the Spot: Archiving GIS Databases”

I attended this session because I hoped to gain some insight into preservation efforts focused on what I will call “non-linear” records – things like data sets, Web applications, and other “New Media” information. It has long puzzled me how to apply the best practices of digital document preservation to digital forms that span application domains, physical locations, networks, and so on. My concern arose during the processing of the Joyce papers, where hypertext was salient to many of the underlying works, but it also haunts me regularly in my capacity as a Web applications developer. My working theory here is that geospatial data sets and the applications used to access them present generally the same preservation challenges as software, multimedia & games, relational databases, and so on.

Three presentations were given, each with distinctive backgrounds and approaches. Helen Wong Smith of the Kamehameha Schools of Hawaii presented a geospatial cultural / historical database project used to document and maintain land holdings in Hawaii. Next, Richard Marciano of the San Diego Supercomputer Center presented briefs about several ongoing projects with GIS and geospatial aspects. Among these were the InterPARES VanMap project, the Persistent Archival Testbed (PAT) project, ICAP, and a new project called eLegacy. Finally, James Henderson of the Maine State Archives presented some of his perspectives and challenges in preserving geospatial data as state government records.

Geospatial data refers to data sets that link some sort of information (text, image, etc.) to a fixed location or area at a specified time period. In the case of the Kamehameha Schools, diverse media such as songs, images, and historical accounts are linked to specific locations within the School's land holdings. Localities in the state of Maine maintain road and property data in GIS systems to support applications such as E911. The most salient aspect of these data sets is that they change over time – notable historical events happen periodically, roads are re-routed or built, and so on – much as any other database changes when updated, which suggests that preservation efforts for one can be applied to the other and in other similarly structured applications.

The three presentations did not flow seamlessly, but did manage to expose some overarching themes. Perhaps the most significant theme that I observed is the relationship between data sets that change over time and versioning in unitary documents. The key difference between these two concepts is that examining versions of a document reveals the thought process involved in achieving a final or published work, while examining geospatial data shows how things were at various points in time. Additionally, the time between discrete versions of documents are usually much shorter than those of geospatial data, usually days versus years, and documents often have a terminal form after which changes cease, whereas geospatial data is usually open-ended or otherwise arbitrarily bounded. Aside from these differences, the approach to preserving and accessing versions and geospatial data seems very similar. Data sets that change over time lend themselves to access via temporal queries; where date or date range becomes part of the query criteria. For a suitably large number of versions, an access mechanism based on date queries would work just as well as it would for geospatial data. Further, for any body of records that span a period of time, temporal queries can be an immensely useful tool for narrowing query results to relevant time periods.

When I thought about these ideas in terms of Web applications (such as CRM, sales support, inventory management, etc. -- putting aside the question of why save them) some of the analogies with GIS data break down. For one, GIS data works in 'layers,” where types of data can be segregated like unitary documents. Unfortunately, relational databases have no such abstraction – they are built to store data efficiently, not in ways that can be easily separated.

Another problem is that even though Web application data can be captured by taking snapshots, in much the same way as GIS data, the rate of change within the data set can often be much faster – on the order of seconds – than the slower changes in things such as historical events and roads. Further, as the snapshot horizon nears the immediate, the storage and processing requirements become untenable – it is impossible to take a snapshot of a database with a frequency that is at or less than the time required to make the snapshot. As an aside, I wonder what solutions might be suggested by data warehousing techniques.

Beyond the capturing of the state of the data, Web applications require that not only the data be maintained, but the application code itself be maintained. Seldom does an application remain unchanged over its service life – bugs are repaired, features are added and removed, and so on. These changes can affect the way that the underlying data is represented to the user. Additionally, such changes are often accompanied by changes to the database structure itself. As a result, snapshots should be acquired after such changes are applied. Although not enough detail was given for each of these projects, I wonder if some of the same issues manifested in work with GIS data sets.

Session #208: “Big Bird's Digital Future: Appraisal and Selection of Public Television Programming”

I attended this session in order to revisit my work on the CHAT digital video preservation plan in the context of similar video preservation projects. I hoped to validate the decisions that were made in formulating the plan and see what new work, if any, had been done in digital video preservation and access since early last year. As the title of the session suggests, the subject area focused on TV broadcasts, but I anticipated that the overarching preservation concerns would be indistinguishable from any other video preservation effort.

The three presentations fit together well, despite differences in scope. Thomas Connors of the National Public Broadcasting Archives and the University of Maryland gave the first presentation. Connors led us through a brief presentation that started with mention of a podcast by Brewster Kahle of Internet Archive fame, which invokes the contentious “save everything” debate. Connors invoked the scarcity argument which allowed him to move into a discussion on the lack of literature treating video appraisal criteria. The remainder of his presentation described Danielle Dumerer's ranking system, which I interpreted as a risk assessment matrix, for appraising video collections and prioritizing preservation efforts. This system operationalizes criteria such as current condition of the assets, cost of retention, intellectual rights, use potential, and perceived production value, which is a more formalized but identical process that I used for the CHAT plan. He then showed how this system mirrors guidelines described by the RLG and NPO.

Next in the session was Lisa Carter of the University of Kentucky. Carter shared her observations in working with television archives, mostly those based on magnetic analog media. Among these observations were the importance of proper storage of media, the frailty of tape based media, and the importance of keeping the original media even upon conversion to more stable media or digital versions – all of which were expressed in the CHAT plan. Much of her talked focused on the importance of metadata for both access and preservation, most notably, the need to work metadata collection into formal workflows. I found the concept of “shutdown procedures” to be most interesting, where the creators of a video execute a series of steps to describe, document, and otherwise properly close out a production as a means of combating the often ad hoc procedures that producers often use for the sake of brevity and leave archivists in the dark.

Leah Weisse of the WGBH (Boston) Media Archives and Preservation Center presented some of her observations in working with the significant back catalog of WGBH broadcasts, reaching all the way back to the 1950s. One important issue that she presented is that challenges that new direct to drive and flash memory systems present to preservation. In these cases, there is no original media to work with in the future since the impetus of the users of these devices is to move the digital file off of the memory device and reuse it for subsequent productions. This is identical to the behaviors of digital camera users, but I had never thought of this in terms of full video capture. Perhaps the greatest challenge presented in this situation is the need for more rigorous descriptive procedures to ensure that the digital files can be identified, and thus managed, after they have been moved from the capture device. One observation I made during her presentation is the issue of versioning that I observed during the GIS session. In this case, the versioning is not only in terms of initial or draft productions (thin director's cut versus theatrical release in film), but also reformatted versions (letterbox, etc.) and display formats (HD, streaming, etc.). Weisse had to deal with many of these for many of the works, which implies that the versioning issue is really a genre and form-crossing concern. I need to see what has been said about versioning in the archival literature and how it translates to other forms.

Session #310: “The Current State of Electronic Records Preservation”

Despite it's comprehensive title, I knew that this session would likely cover only a high-level review of some of the major projects. With this understanding, I approached this session as a brief update to material I had received while in classes a year or so prior.

David Lake of NARA and Lee Stout of Penn State University addressed ongoing work on the Electronic Records Archives (ERA) for the National Archives. The ERA seems to be the flagship project in North America, at least owing to the amount of information about it that I have encountered of late. At this point, the ERA has a developer – Lockheed-Martin – and is slated for an initial, though not comprehensive release in Fall of 2007. Much of the questions about the ERA focused on the potential for using the resulting products in venues outside of the National Archives and whether it would be available as an open-source or similar product. The response emphasized that this project was not only a set of software, but an instantiation of NARA's workflow processes. The message seemed to be that while some products that do specific tasks may be portable to other environments, the core of ERA is specific to NARA and its practices.

Next, Hans Hofman from the National Archives of the Netherlands presented a general overview of three current European projects: Digital Preservation Europe (DPE), PLANETS – a research project, and CASPAR. Much of what Hofman presented was very high-level conceptually, but he did take care to place these projects into the context of previous research and efforts upon which they build.

Finally, Kenneth Thibodeau of NARA wrapped up the session, providing a bit of thought that transcended the specifics of the previous presenters. One thought that I took away from his remarks are, paraphrased, that the ERA has shown that preservation has to be attacked as an organizational problem, not a process in isolation – something that mirrors what I have said before in terms of archival thought infiltrating the process of creation and the tools used by the creators. One other take-away was his emphasis on the need for digital format repositories of the type that Harvard is developing. I interpreted this as not merely as reference databases, but living applications that can provide a supporting framework for preservation software platforms and applications – think Web services for digital format preservation information.

General Observations

I had one meta-observation concerning the conference as a whole. Each session was recorded by the conference staff using each room's audio setup. The inputs consisted of usually three microphones, one at the podium and two on the panel table. In virtually every session I attended, the panel participants had to consciously remind themselves to repeat questions from the audience into the microphone so that they would be recorded in addition to the responses given. This process strikes me as a visceral metaphor for the function of archivists and the frustrations they feel when working with their various constituents. I often hear the refrain that archival thought needs to happen early in the creation of records, if not before, and given that the recording of these sessions is an inherently future-focused activity – an attempt to create a complete record of the proceedings – the panel's self-reminding process seems apropos. I have said it before in this venue in different ways, but if we are to capture a more complete cultural record for the future, archival thought in the form of deliberately future-minded actions must be insinuated into our information management – not only archivists, but everyone that creates information and, especially for the digital realm, in the tools that we use. I envision this as a sort of repurposing of the seventh generation concept for our cultural memory as it is represented in our information objects.

What's in a Creation Date?

tkiehne — Fri, 28 Jul 2006 07:59:50 +0000

There is a certain perception that often accompanies digital objects and, more broadly, computer systems as a whole. This sort of perception manifests itself when, for example, we hear about how massively compressed digital MP3 files are considered to be "perfect" quality audio or in similar myths concerning the infallibility of all things digital. These perceptions are based on incomplete or inaccurate assumptions about how software, operating systems, or file systems function. My favorite way of stating this is that computers are only as smart as those who designed them â€“ if to err is human, then the same goes for our electronic creations.

When making the transition from paper to digital records, these assumptions are likely to appear in unexpected places. While working on the Joyce collection, we ran headlong into one of these assumptions, made a note of it, then moved on. But I promised that I would look closer at the issue at a later time... so here I go.

Anyone with a modicum of computer literacy is familiar with managing digital files through some means -- be it command line or GUI -- and has been exposed to the fact that the computer's file system(s) maintain not only filenames, but various dates such as creation date, modification date, and access date. At first blush, this seems like a godsend for archivists struggling to put concrete attributes on virtual objects. Certainly these dates mean what they say â€“ the creation date is the date it was created, etc. -- and these attributes follow the digital object wherever it goes, correct? Unfortunately, a little investigation sheds some doubt on the subject.

I devised a simple set of experiments to confirm or deny the assumption that all filesystem date metadata is the same and means what we assume it to mean. I selected the three major operating systems in use today, Windows 2000/NT, Macintosh OS X, and Linux, and conducted a variation of the following sequence on each:

Create an arbitrary text file in an arbitrary location on a local hard drive (volume)
Modify the text of the file and save it to the same location
Move the file from one directory to another on the same local volume
Make a copy of the file to another directory on same volume
Copy the file to a separate volume

After each step, I gathered date information from the filesystem (e.g.: creation and modification dates) and generated an MD5 hash to confirm whether the contents of the file stayed the same or changed. I will now discuss the details of this experiment for each operating system.

Windows 2000/NT

Table 1: Windows 2000 (NTFS)
Action	Creation Date	Modified Date	MD5 Hash
Created	07/27/2006 19:48:43	07/27/2006 19:48:43	d41d8cd98f00b204e9800998ecf8427e
Modified and Saved	07/27/2006 19:48:43	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14
Moved from one directory to another	07/27/2006 19:48:43	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14
Copied to another directory on same NTFS volume	07/27/2006 19:52:14	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14
Copied to another NTFS volume	07/27/2006 19:53:25	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14

Table 1 above shows a tabulated view of the experiment's results for a Windows computer using the NTFS file system. File dates were collected from the Windows file properties dialog, while MD5 hashes were generated using a freeware program called HashCalc. The first two steps passed as predicted, with the new file correctly showing a change in modification date and MD5 hash. The third step shows that Windows considers a moved file on the same volume to be the same before and after the move â€“ again, this makes sense.

Upon making a new copy, however, common sense starts to break down. The modification date stays the same as before the copy â€“ demonstrating, as the MD5 hash confirmed, that no changes have been applied â€“ but the creation date has changed to the time of the copy operation. This simultaneously makes sense and is confusing: we now have a new copy of the file, with its own creation date, but now the modification date precedes the creation date, which flies in the face of common sense. How can a file have been modified before it was created? But it does not end there. Upon copying across hard drives, the creation date is again modified, once again bringing up the creation/modification dichotomy.

Macintosh OS X

Table 2: MacOS X (HFS)
Action	Creation Date	Modified Date	MD5 Hash
Created	07/27/2006 20:04:00	07/27/2006 20:04:00	a53165315d1e86c5739d34e1243f5f4d
Modified and Saved	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Moved from one directory to another	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to another directory on same HFS volume	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to another HFS volume	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to FAT (MS-DOS) volume (OS X view)	12/31/1903 16:00:00	07/27/2006 20:18:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to FAT (MS-DOS) volume (Windows view)	07/27/2006 20:18:59	07/27/2006 20:18:58	cb697c6c073f85c43e2dfb100f5b725e
Copied back to HFS volume from FAT volume	12/31/1903 16:00:00	07/27/2006 20:18:00	cb697c6c073f85c43e2dfb100f5b725e

Table 2 above shows a tabulated view of the experiment's results for a Macintosh OS X computer using the HFS+ file system. File dates were gathered using the Finder's Get Info command and MD5 hashes were computed using the built-in command line program, "md5" (Note here that the Get Info dialog does not show seconds in dates). Here we see behavior that conforms to our assumptions â€“ the creation date follows the file throughout its movements on the machine and the modification date is only changed when an actual modification is made.

With a little extra knowledge of how Macintosh file systems work, however, it is understood that each file is actually a pair of files: the resource fork, which holds metadata about the file, and a data fork, which holds the content of the file. Many file systems do not respect this dyadic system, which can create problems when Macintosh files are exchanged with other operating systems or through network transfer. To this end, I conducted a few more steps that involved transferring the file to a non-HFS+ volume (in the form of a FAT formatted USB flash drive) and viewing the transferred file in both Macintosh and Windows environments. As you can see, the both dates were significantly affected. The transfer was considered to be a modification, thus changing the modification date, but then the creation date became skewed. Macintosh could not recognize the creation date, instead displaying the Macintosh epoch, while Windows interpreted the creation date as being the same (but off by one second, for some odd reason) as the time of the copy operation. Upon copying the file from the flash drive back to the HFS+ volume, we see that the dates were preserved as changed during the transfer to the flash drive.

Linux

Table 3: Linux (ext2)
Action	Change Date*	Modified Date	MD5 Hash
Created	07/28/2006 00:12:28	07/28/2006 00:12:28	d41d8cd98f00b204e9800998ecf8427e
Modified and Saved	07/28/2006 00:13:43	07/28/2006 00:13:43	93ad68660a99d36a665a553672a8148d
Moved from one directory to another	07/28/2006 00:14:38	07/28/2006 00:13:43	93ad68660a99d36a665a553672a8148d
Copied to another directory on same ext2 volume	07/28/2006 00:15:51	07/28/2006 00:15:51	93ad68660a99d36a665a553672a8148d

Table 3 above shows a tabulated view of the results for a Red Hat Linux computer using the ext2 file system. File dates were gathered using the "stat" command and MD5 hashes computed using the "md5sum" command. Already there is one glaring difference between the Linux results and the previous two, that is, the non-existence of a creation date. Instead, I have shown the Changed date (status change or ctime) as reported by stat. It was difficult to determine the reason for this omission, especially since some references incorrectly referred to the Changed date as the creation date (e.g.: Poirier, 2001), but I found an email discussion thread that helped to clarify some of the reasons. In short, the creators of the ext2 filesystem, and Linux in general, deemed the concept of a creation date as being too nebulous to model, so they omitted it. The Windows experiment demonstrates some of the potential issues behind the concept of a digital creation date and lends some legitimacy to the decision to omit, even if it does seem a bit unsettling.

Continuing with the experiment anyway, we can see how the modification dates and hashes behave in the way the Windows operations did when modified and moved. Copying the file, however, altered the modification date, which is different behavior than the other two operating systems. Additionally, we see how the Changed date is updated with each action, regardless of the effect on the content of the file. It is worth noting that the Changed date may also be updated when using seemingly content-neutral commands such as grep and find. In this way, the Changed date acts more like an Access date and lends very little help to archival processing.

Analysis

Each of these experiments shows how the assumptions of the software makers dictates the behavior of what seem to be common sense concepts, thus threatening the validity of assumptions we make while using them. In the case of Windows, the assumption is that any copy operation creates a new file and is treated as a new object, but leaves behind a paradoxical situation where the modification date precedes creation. In Macintosh, every copy of a file, so long as it is made on a compatible volume, can be traced back to the original object by creation date â€“ in essence, every copy of a Macintosh file is simply a new version, not a new object. Linux, on the other hand, repairs the Windows dichotomy by bringing the modification date forward with each new object instance.

At the surface, we may want to proclaim that filesystem metadata cannot be trusted and debate the merits of ignoring it completely. This is understandable, but perhaps a bit hasty. It might be better to consider filesystem metadata as helpful to the extent that it has been properly maintained during he record's lifetime. Since creation and modification dates support authenticity it only seems fitting that our treatment of their apparent flaws should derive from similar concepts. In other words, the lessons of this experiment should not only guide the handling of digital objects in a repository setting, but in assessing the reliability of filesystem metadata as generated in the originating environment. If the recordkeeping systems that generated the digital objects, including policies and documented procedures outside the systems, if any, can be assessed, then the metadata accompanying the objects may be salvageable. Without such knowledge, though, it is wise to treat any and all filesystem metadata with prejudice.

Even with a thorough knowledge of the originating environment, can we trust dates and times as the filesystem reports them? Certainly a to-the-second time should be taken with skepticism -- time zone settings, variations between computers and clock drift ensure that the only way exact times can be compared is within the same system. But beyond that, dates may even prove fallible: unskilled users may neglect to set the system clock correctly, or miss a daylight savings shift. Further, power outages or system failures can have detrimental effects on the system clock and, in the case of Macintosh systems where the file metadata is stored as one of the two file parts, metadata may become corrupted just as normal data files can. All of this goes without mentioning date errors deliberately created by knowing users in order to deceive or conceal -- forgery is always a risk.

These problems should demonstrate that the skills of an archivist in determining the authenticity and reliability of records do not fade away in a digital environment, but that the means of performing these tasks change. An intuition and knowledge of the assumptions underlying the technology is key, as is a thorough understanding of the origin of the records â€“ the latter being a skill that archivists already possess. Hopefully this experiment will help to increase the skills of the former.

PAT Project Lessons Learned, Part 2

tkiehne — Tue, 25 Apr 2006 06:02:23 +0000

I first heard about the Persistent Archives Testbed (PAT) Project at the SAA Annual Meeting in August 2005. The project merges the efforts of several large institutions -- NHPRC, NARA, SDSC, etc. -- in an effort to test data grid technology as a means of federated archival storage. In two of the more recent issues of Archival Outlook published by SAA, the a question has been posed to two different groups. The question is roughly: what skills are needed to work with electronic records; The two groups asked were archivists and IT professionals. In light of my recent musings, and the upcoming colloquium in Washington D.C., I took great interest in the most recent article.

Part two of the article series, IT Professionals' Perspectives (Archival Outlook, Mar/Apr 2006, pp. 8 & 27, not yet available online), asks: "what skills / knowledge should IT professionals have to work with archival records and archivists?" Three people were asked (or at least responded to) this question: Adil Hasan of the e-Science Center at the Rutherford Appleton Laboratory in the UK, and Richard Marciano and Reagan Moore of the SDSC. Eureka, I thought â€“ this is exactly the issue I have had running around in my mind lately and from the perspective that has the most to offer with regards to my personal interest.

Hasan starts off strong, proffering that IT types working alongside archivists must have explicit domain knowledge of archival workflows and concepts. This is basic, however, as any programmer trying to develop applications for any domain must have an understanding of that domain -- be it supply chain management, inventory control, marketing, data mining, or even archives. The important takeaway from Hasan is the notion of developing a "toolkit" for archivists to "[combat] the deluge of electronic information." This is exactly the conclusion I came to working on the Joyce collection last year, and from what I understand, quite a lot of attention is being brought to the issue of archival toolkits.

Marciano also touches a salient point: that archivists and IT have different ways of speaking that often overlap. Archivists do much eye-rolling in response the ways tech companies use the term "archival," particularly when describing storage media. Less frequent, but no less important, are differences in the meanings of creation date (especially as implemented by a certain dominant computer operating system), "archive" as a type of compressed file, and other notions of metadata and naming conventions that are imposed by well-meaning IT professionals. Marciano correctly asserts that this terminology gap must be bridged by IT personnel who work with archivists and describes the ongoing efforts by Richard Pearce-Moses to that effect.

But although the respondents answered the question asked of them, this is where my enthusiasm faded slightly. Both Marciano and Moore, from what I was able to read of their comments, address the immediate concerns of IT personnel working on archival projects and with archivists, but appear to avoid the notion that there must be some kind of backflow of concepts into the IT field itself in order to truly address the question of long term preservation of electronic records. Marciano gets close when he says: "If navigated properly, the unsuspected world of archives that unfolds has the potential to draw IT folks in and transform them into champions of the cause." Precisely. And this is the crux of the challenge: how can archivists instill some of the basic, time-worn traditions of records management and preservation for maintaining reliable and authentic records back into the short-term, light-speed horizon of IT? IT is great and manipulating electronic information in any way imaginable, so it is not a stretch to believe that they can effectively and definitively extend its longevity as well, beyond mere "backwards compatibility" and better search methods.

Somehow, we as archivists (or the archivally-aware) must figure out how to imbue programmers with a consciousness of how the artifacts produced by their applications are used, now and into the far future, and how their decisions affect not only the ability of future users to access the information, but the ability of the custodians of that information to work with it as well. Such an awareness will present the IT field with a challenge that they cannot live down and will conquer (with guidance, of course). In summary, I maintain that we must take a two pronged approach in order to overcome the problem of electronic information: combat the deluge of information with appropriate tools (as Hasan put it), and expand the utility of artifacts produced by the tech sector as informed by archival practice.