infoSpace blogs

An Emulation Experiment

tkiehne — Tue, 30 Oct 2007 06:01:53 +0000

Through my technical work and experience with preservation projects, I feel that I have a good grasp of migration as a digital preservation strategy. Unfortunately, I have much less functional experience with emulation as a digital preservation strategy. The concept of running a virtual machine within the physical resources of another is intuitive enough, but I have as yet had no real experience with a full emulation environment. Recently I thought about recovering some classic-era Macintosh files that I have had in storage and figured that their recovery could make for an excellent hands-on emulation experience.

I used Macintosh computers from their introduction in 1984 until 1998. My last Mac was a 16 Mhz 78030 machine with 8 Mb RAM, and 80 Mb of hard drive space running OS 7.5. Before disposing of the machine in 2005 (donating it to Goodwill), I copied the entire contents of the hard drive – system software, applications, and all – to a 100 Mb ZIP disc. Simply using one of the several Mac to Windows utilities would suffice for recovering many of the documents that are cross-platform (images, raw text, etc.) or that could be migrated (Microsoft Word, Excel, etc.), but there are some documents that might be better off using the original application environment to make the conversion to a cross-platform format.

A number of programs were discovered during my initial search for Mac emulators, including Softmac, PearPC, vMac, Shapeshifter, Executor, and Basilisk II. These programs run the spectrum from proprietary to open source, the different types of hardware environments emulated, and the platforms that the software will run on. Additionally, many of these programs have not been maintained in some years. For my experiment, I chose the Windows port of Basilisk II since it met my basic criteria of a free, open source program that is still somewhat current.

The basic concept behind emulators is that they provide hardware encapsulation and interfaces to allow an operating system (OS) to run in a non-native environment. The system requirements for these emulators is quite unassuming for current computing hardware, however there are some extra software requirements that are somewhat peculiar. In many of the programs mentioned above, a copy of the Mac ROM BIOS is required to run the emulator at all, and in every case, a complete copy of the emulated OS is required in order to run software within the emulator.

The ROM BIOS is a physical chip present in the Macintosh hardware that contains the basic machine instructions used by the OS, which is considered by Apple to be proprietary code. The emulators avoid copyright infringement by requiring the user to provide a copy of the ROM rather than embed one into the software, thus violating Apple's intellectual property rights. The ROM can be obtained legally in one of two ways: extract the ROM BIOS from a functional Mac of the correct vintage for the OS version to be emulated; or, purchase a ROM card with an actual Macintosh ROM chip from a commercial vendor. The preservation-minded among us can already see issues for future emulation efforts. Incidentally, the Copyright Office has allowed exceptions to copyright and anti-circumvention provisions of the Digital Millenium Copyright Act (DMCA) in certain circumstances for archives and libraries which would seem to include this very situation. (The exception will be in force until October 2009, at which time it will have to be renewed... would it not be nice to have a permanent exception?)

As for having a copy of the OS, there are also two ways to get a copy: 1) have an existing system disc or copy of a system folder; or 2) get a disc image from another source. Apple still has some older system software disc images, installers, upgraders and the like for download, and it is pretty easy to find pre-made Mac system disc images for emulators out on the net. Fortunately, I already have a system copy on my ZIP disc which will run so long as the emulation environment is of a vintage that supports the OS.

The only issue is how to copy the contents of the HFS formatted ZIP disc, using Windows, to a place where the emulator can access it. There are two steps to this: 1) creating an HFS volume within the Windows filesystem, and 2) accurately copying the contents of the original HFS media to the new volume, preserving all OS-specific aspects of the data. For Macs, this means preserving both the resource and data forks of the files. (see the Joyce project report for more on Mac files and HFS)

Basilisk II uses disk volume images, called hardfiles (file extension: HFV), to simulate an HFS volume within the emulator. The Windows GUI interface (and presumably the command line tools for the non-Windows versions of Basilisk) can create a raw hardfile, but cannot copy anything into it. Fortunately there is a free program for Windows called HFVExplorer that can create HFV files and view or manage their contents by copying to or from Windows volumes or other HFS volumes that are accessible to Windows (including CD-ROMs, floppy, SCSI, removables, etc.).

Unfortunately, HFVExplorer would not mount my ZIP drive – using either parallel port or USB model. I would have to guess that because the program has not been updated since 1999 – before Windows 2000/XP – that the program was unable to correctly access the removable media. It is possible that HFVExplorer running on Windows 98 or NT would not encounter this problem, but I am not about to revert to either of those operating systems. Besides, having to use an older operating system in order to get an emulator to work runs counter to common sense.

Unable to rectify the issue, I tracked down a copy of HFSUtils (Windows port), which is the utility package that HFVExplorer was originally based upon. HFSUtils is a set of command line tools that provide basic file management tasks for mounting and manipulating HFS volumes. I mounted the ZIP volume and moved around using the command line with ease. Copying was laborious, however, because the hcopy program could not recursively copy nested directories. Ideally I would have modified the source to do this, or created a script to use HFSUtils, but then I might as well figure out how to fix HFVExplorer.

In order to move the experiment along, I proceeded by copying the directories and files individually using hcopy at the command line; the most banal of tasks, to be sure, but quite effective. I copied files from the HFS ZIP to a location on my Windows hard drive, then used HFVExplorer to copy the files into an HFV volume file.

Aside from the lack of a recursive directory copy, there were some other annoying problems. Firstly, any characters in the source filenames with characters outside of the standard ASCII range (above number 127) had translation issues. The problems came from trademark symbols, em dashes, and other special characters used in directory and filenames in Mac OS. When the command line utility rendered these characters in a file list, they appeared as question marks by default. Even when using some of the program options for hls (the file listing utility), the characters still did not display correctly in the Windows character set. Attempts to copy files with special characters failed since the DOS command line sent the translated character to the command. As an aside, it is possible to access HFS nodes (directories and files) by their node ID (or node specification pair), but unfortunately, hcopy exposed no means to exploit this feature.

I noticed a second issue that relates to the file metadata, specifically the creation and modification dates for the files copied from the HFS source to the Windows volume. The copies on the Windows volume showed creation and modification dates as the date of the copy operation and not the original dates. Fortunately, the files retained their resource forks during the transfer to the emulation environment, meaning that they had all of the file metadata intact with the original dates. I'm already well aware of inconsistencies in file dates from platform to platform, but it would be ideal from a preservation perspective if the hcopy routine were to access the file metadata in the HFS source and set the correct creation date for the Windows copies.

Incidentally, there are other programs available for copying HFS volume data to Windows: MacDrive and TransMac. Each of these are commercial software, but free demos are sometimes available On a whim I tried a demo version of TransMac, which copied the source files just fine, but I found out upon transferring the files to the HFV volume that the binary (program) files were converted in such a way that they would not function in the emulation environment. Unless I missed something in my attempt, this issue would effectively prevent emulation using files copied using these programs.

With all of the emulator software in place and a testbed of data from the original ZIP disk copied into an HFV volume file, I could test the emulator. The Basilisk GUI program in Windows allows you to define a set of disk or volume images that will be loaded at startup. This is the equivalent of having hard drives installed or discs inserted at boot time. For an initial test, I downloaded and used a bare system 7.0 system disc image. It worked flawlessly on the first try, providing me with a visceral demonstration of the difference between 16 Mhz and 1 Ghz processor speeds – my 1997 self would have been dazzled.

Basilisk II GUI: Selecting volumes at startup

After fixing a directory hierarchy issue with my copied volume (the system folder was not at the root level on the hard drive image), I relaunched the emulator which resulted in an almost perfect reproduction of the mac desktop I last viewed in 2005.

Basilisk II: Running Mac OS 7.5 in Windows 2000

Next step: resolve the transfer issues to try and establish a full, unmodified copy of the HFS source ZIP disk into the HFV volume file. Ideally, an unfettered copy should work flawlessly for all programs, assuming that the emulator is complete. If that is the case, then I will finally be able to access the last remaining documents that I need to convert within the original application environment.

Software Activation, DRM, and Implications for Digital Preservation

tkiehne — Fri, 03 Aug 2007 23:00:39 +0000

It's time again for another installment in my ongoing audio encoding project saga. For some time now I have been on the verge of the next phase of the project, which involves encoding the remaining analog sound objects in my collection, specifically cassette tapes and vinyl records. Procrastination, combined with a serious dose of being busy with other things, has delayed my progress on this phase of the project, but one technical aspect has also proved crucial.

In order to digitize the analog sound objects I require a software platform for encoding the analog input into digital objects that is also capable of cleaning-up the analog input of analog artifacts, such as tape hiss, pops, clicks, scratches, etc. There are many software packages that are available on the market for sound recording and processing and, fortunately, I already "own" one of them: Sonic Foundry's Sound Forge.

So, what's the technical problem, you ask? Well, I purchased version 5 of the software in 2001 as part of a special introductory promotion at a very reasonable price. Unfortunately, Sonic Foundry transferred ownership of the entire Sound Forge product line, as well as a few other key products, to Sony in 2003. Normally this wouldn't mean a thing, except for the fact that professional-level software like Sound Forge is protected by an online registration/activation scheme. In a nutshell, the software will install and run just fine for a 30 day trial period. During that period, you are expected to perform one of a set of procedures to register the product with the vendor which, when completed, will eliminate the 30 day countdown and give you full, unlimited access to the program. As you can guess, the transfer to Sony complicated the process in that the online registration routine built into the original program could no longer find the registration server — these functions had been transferred to Sony while the software remained unchanged.

Not being satisfied with only 30 days of the program at a time, and unwilling to shell out the bucks to upgrade, I embarked on a search to figure out the new registration procedures. I'll spare the details, except to say that it took some Googling, several failed customer service contact attempts, numerous user forum searches, and a call to a number that I finally managed to track down, which implored me to visit a chat application on their Web site in order to get to the information I needed to reactivate "my" software.

In the end, no big deal, right? But, my experience exposes some very important digital preservation issues. Sound Forge is not itself a particularly important piece of digital information in itself. It is a toolkit used to create the artifacts in which we are interested; in this case, sound artifacts. The same could be said about Photoshop, or any of an increasing number of professional media toolkits. Perhaps the furthest extent that a person in the future might need current or past versions of these software tools would be to regenerate projects that one might have created using them, or to analyze detailed technical aspects of the software. But, again, it is the products of these programs that will most likely interest future users, archivists, and the like.

But consider this: the registration and activation process used in software like Sound Forge is conceptually identical to the license management process in Digital Rights Management (DRM) schemes used to protect digital information, particularly music, movies, and other copyrighted works. Having reviewed my account above, one could imaging that instead of activating software that I purchased, that I might have been trying to access a DRM-encoded sound or video file that I had purchased in the past. The same issues with license servers, transfer of ownership/responsibility, changes in the license registration schemes and so on are just as pertinent in this new situation.

Everything managed to turn out alright for me in this case, but imagine if Sonic Foundry had simply disappeared instead of selling off it's product line? Or what if I had tried to install this software 10 or 15 years later, after the market had decided that the software no longer held enough value to justify supporting it? All discussion about ownership of digital information aside (a discussion of which would explain my liberal use of scare quotes), it seems apparent from this example that if left to the market (as governed by long copyright terms and far-reaching copyright legislation), we stand to lose not only the right to preserve digital information, but the technical ability to do so. Conveniently enough, I've treated on this situation before.

Stay tuned as I embark on the more complicated phases on my encoding project.

Bringing Records to the Users

tkiehne — Tue, 17 Apr 2007 07:06:37 +0000

I've volunteered at the local National Archives branch for over a year now. Over this time I have gained the acute impression of lament over the dwindling number of researchers and members of the public who make the trip out to the archives to do research. Indeed, it is not uncommon for me to enter a virtually empty research room during my Thursday afternoon visits.

What is seldom spoken of, but is vitally central to the issue, is the instant information gratification that the general populace receives from their increasingly ubiquitous internet connections. The reams of information available through Google, or the convenience of accessing Ancestry.com from home (instead of for free at the archives), keeps them at home and only adds insult to the injury of fewer patrons.

NARA is not alone in this lament. Günter Waibel of OCLC recently spoke of it in a Hanging Together blog post covering the IMLS WebWise Conference, where he reposts a quote by Deanna Marcum of the Library of Congress that rather eloquently sums up the issue:

“What I think our challenge is, it is not enough for us to create the perfect finding system, we know from all the user studies that individuals, who are looking for information, go directly to the open web, and our marvelous catalogues are not getting used. We have to find ways to take our content and the metadata and move that content to the open web.”

NARA relies on an online system called the Archival Research Catalog (ARC) to catalog and describe its considerable physical holdings. This system was derived from an earlier prototype called NARA's Archival Information Locator (NAIL), which primarily contained images and other non-textual surrogates. Each of these systems are of late-1990's vintage – a Web lifetime ago – and, quite frankly, the user interface shows it. One of my contacts at NARA once stated that the government seems to be about a decade behind when it comes to information management, and in this case it seems true enough.

Granted, the scope of these projects is monumental and one cannot expect them to be updated at Web speed to, say, incorporate some of the more useful “Web 2.0” principles. However, given the exodus of potential patrons to full-text search engines, there is a particularly devastating truth hidden behind the aging ARC code. To paraphrase Marcum, we need to take our content and metadata to where the users are. As it turns out, some informal research provides a striking illustration of this need.

One evening I found myself looking through the Web site of the Northwest Digital Archives, a group of education, government, and private archives across the Pacific Northwest and Alaska. I was curious about a number of things, but not the least of which was the potential for sharing finding aids for holdings at NARA's Seattle branch with the NWDA. This led me to search and browse some of the NWDA finding aids available via their search engine.

After browsing some of their EAD-encoded and HTML transformed finding aids, it occurred to me that this method of presentation very likely exposed the finding aids to Web agents acting on behalf of search engines like Google. I immediately wondered if the same was true of ARC and set out to verify.

Google exposes a number of advanced search parameters that can, among other things, limit the results to one site or domain. Using this feature, I performed a search for the number of records returned by Google for the NWDA and ARC respectively (click these links to see for yourself or reference the images below). The results are shocking: on the order of 3000 results from the NWDA, most of which are finding aids that have been full-text indexed; but only one result for ARC – its main search page!

Figure 1: A Google search result for the Northwest Digital Archives

Figure 2: A Google search for ARC records

I should not need to explain this to you, but the point is clear: the catalog information about holdings of the members of the NWDA are being seen by the masses searching Google, while not one of the multitudinous ARC records is being seen outside NARA's domain. Although seasoned researchers will know to go directly to the NARA site to search their holdings, the well-meaning masses will not. But which is easier: changing millions of people's Web searching habits, or altering an application to get the data to where those millions of people already are?

Fortunately, this example was very clearly understood by NARA staff at the Seattle branch, and apparently also with the staff in D.C. Within weeks of my demonstration I heard reports of changes being made to ARC to allow direct links to the catalog records, and a plan to phase in search engines by transferring indexes to these links directly to the major search engines. Regardless of where the impetus came from or what plans were already in motion, I am happy to know that progress is being made.

I've been working with the marketing sector as a programmer for over nine years. The companies I have worked for have specialized in Web and online marketing, and, although I have no direct role in marketing planning, I have of necessity become intimately familiar with such concepts as ROI and performance tracking, sales cycles, and lead/customer database development. Because of this exposure, the notion of how to reach people online is somewhat intuitive to me, but that knowledge is not yet widely dispersed among the archival profession. From this example, the intersection of Web marketing and archives seems to be an interesting space for me to explore.

Audio Encoding Project Milestone

tkiehne — Mon, 19 Feb 2007 00:05:25 +0000

This week I achieved a major milestone in my personal audio encoding project -- after a longer period of time than I planned. Other than a few stragglers, I have managed to completely encode all of my compact discs, including creating use copies and fully describing the objects using embedded metadata. These are all stored on a 0.75 TB NAS appliance configured in a RAID 5 array. Additionally, I have ingested (using the term rather loosely) the use copies into an access system, Ampache.

When I embarked on this project, I estimated that my entire collection -- CDs, albums, and cassettes -- would amount to about 300GB in total. The lossless formats in the corpus currently comprise approximately 230GB across 7500 distinct objects. Once I finish encoding the analog formats I expect to more than exceed my 300GB initial estimate. Incidentally, the total corpus including use copies amounts to about 280GB across 14,500 objects*.

The next step, other than sweeping up the straggler CDs, is to move to encoding my cassettes. I have fewer cassettes than vinyl records and, since there are fewer noise reduction and quality issues to address, I figure that encoding them is the logical next step.

*Note: Although one might think the number of objects would simply be double the number of lossless objects, some releases were live or mixed which I encoded as one track for the use copy, rather than as individual tracks as they appeared on the CD. This approach preserves the work as a unitary effort for access while maintaining the original order and structure in the lossless format.

Digital Storage Update 2007

tkiehne — Thu, 15 Feb 2007 07:57:45 +0000

It has been well over a year since my last digital storage update, and though there has not been any earthshaking new technology announced within that time, there has nevertheless been some advancement in several areas that I would like to address.

Vertical / Perpendicular Drives

One of the major highlights of the last year has been the introduction of so-called vertical or perpendicular drive technology. Vertical recording aligns the data bits in a vertical, or perpendicular, format with respect to the plane of the the storage media, instead of the traditional horizontal arrangement. Vertical techniques are already in use and have significantly increased storage densities, particularly for compact notebook drives (see Wired: Hard Drives Get Vertical Boost).

New Media Manipulation Techniques

Working around the superparamagnetic limit by recording data perpendicular to the plane of the media is expected to peak at a data density of about 1TB per square inch. Seagate is looking to extend this gain by combining it with other technologies, to the extent that we could see data densities of 50TB per square inch within 10 years. A technique called HAMR (heat-assisted magnetic recording) uses lasers to heat up the disk surface while writing, which later cools to a more stable state. The heat expansion exposes fewer individual grains of disc material to the write process, thus increasing data density. This process is further refined by organizing the grains into a more regular pattern in a process balled bit patterning, where a chemically encoded molecular pattern is infused into the substrate during creation. The combination of these techniques with vertical recording yields a bit of data per grain of magnetic substrate, compared to about one bit per 50 grains that we see now (see Wired: Inside Seagate's R&D Labs).

Hybrid Drives and Solid State Storage

While some manufacturers continue to push for higher data densities, others have improved devices in different ways. Hybrid drives have been developed that combine solid-state flash memory and conventional magnetic discs to increase speed and reliability. Though this development does little to increase storage capacities, it does help with reducing power consumption and portends the elimination of moving parts -- and the corresponding risk of mechanical failure -- as flash memory increases in capacity (see: PC Magazine: Seagate Launches First Hybrid Hard Drive).

Speaking of Flash memory, Freescale has improved on the concept by introducing MRAM (magnetoresistive random-access memory). MRAM boasts faster read/write speeds and better stability than current Flash memory while still holding data after power has been removed from the chip. This technology improves on the upper limit of the lifespan of Flash memory (see: BBC: 'Magnetic memory' chip unveiled).

As if MRAM and Flash are not enough, research is continuing on “phase change” memory that promises more stable storage than Flash memory at as much as 500 times the speed. In addition to faster and more stable storage, phase change chips promise to be much more compact. Initial prototypes of phase change chips have already been introduced by Samsung, and there will likely be production models out within a couple of years (see: Wall Street Journal: Disk Drives Face Challenge If New Chip Comes to Market).

Again, these solid-state technologies do little to increase storage capacities, but improve stability and power consumption, and thus, offer more efficient and stable overall storage and retrieval systems.

Optical Storage

HD-DVD and Blu-Ray media are set to multiply their storage capacities by adding additional layers and increasing the data density per layer. At base specifications, 10 layers on an HD DVD would yield 150GB, assuming 15GB per layer. For Blu-ray, the total over 10 layers jumps to 250GB, assuming the base 25GB per layer. These extra layers are not supported by current readers, but the concept indicates a potentially longer lifespan for standards that initially seemed to be dead on arrival. (see: Daily Tech: Three HD Layers Today, Ten Tomorrow)

Meanwhile, much of the HD-DVD vs Blu-Ray debate has been thwarted by the announcement of a hybrid disc capable of storing data in both formats on one disc. Warner Brothers recently unveiled Total HD Disc, which eschews a standard format DVD layer in order to bundle the two competing HD formats into one disc playable in either type of HD player. This approach is contrasted by the introduction of dual players which have both HD-DVD and Blu-Ray capabilities (see: New York Times: New Disc May Sway DVD Wars).

Even at the higher recording capacities imbued by multiple layers, neither standard will approach the capacities of the terabyte holographic discs that I reported on last year. Given the massive marketing effort behind HD-DVD and Blu-Ray, and the emphasis on applications in entertainment as opposed to mass storage, I have doubts over commercial manufacturers dumping these in favor of holographic media any time soon. The most likely effect of the market battles over the two dominant HD formats is that newer, higher capacity formats will come at a premium for those seeking to implement high capacity data storage solutions. Ars Technica suggests that smaller capacity formats will be exploited first in order to decrease the cost to end users and hasten adoption. But even such decreased capacities are expected to be greater than even the multi-layer HD-DVD and Blu-Ray concepts discussed above (see: Ars Technica: Holographic storage a reality before the end of the year).

While the HD-DVD / Blu-Ray market squabbles continue, yet another terabyte optical technique has been developed. Research at the University of Central Florida developed a 3-D optical system that uses two different light wavelengths to write to multi-layer DVD media that promise more than a terabyte per disc. No plans yet on market potential, but with so many terabyte optical techniques, one or more are bound to arrive soon (see: University of Central Florida: UCF Researcher’s 3-D Digital Storage System Could Hold a Library on One Disc).

Tape Scrolls On

Not to be outdone, new developments in tape technology promise 15 times greater data density in new cassette form factors within five years. This translates to roughly 8 TB per cartridge (see: SpaceMart: IBM breakthrough multiplies the amount of data that can be stored on tapes and Wired: Tape Storage Increases 15 Times). With this sort of density, tape still offers the best price to capacity ratio and still out-carries all storage media short of large magnetic disc arrays. The question of long-term reliability of tape is still debatable, however.

The X-Factor

Moving on to more theoretical realms, scientists at the Max Planck Institute have made a breakthrough on a 40 year old theory that reveals tiny, closed magnetic circuits -- vortexes -- that demonstrate polar properties that could represent data bits. This phenomenon occurs on a scale of about 20 atoms in diameter, which is much smaller than the single grains of magnetic material that Seagate hopes to exploit in the near future (see above). Techniques exploiting this phenomenon are expected to be much more resilient against external disruptions such as heat and magnetic fields, but no word yet on a horizon for practical application or storage densities (see: Max-Planck-Gesellschaft: Magnetic Needles turn Somersaults).

Even further out there is a scheme proposed by a Drexel University professor that claims 12.8 Petabytes in the space of a cubic centimeter! The technique exploits the properties of nano-scale, ferromagnetic wires stabilized by water. Again, commercialization seems quite a ways away, but this should provide good fodder for speculative fiction writers everywhere, at least until the shock of a Petabyte iPod Nano wears off (see: Drexel University: For a Bigger Computer Hard-drive, Just Add Water).

Metadata Quality Control

tkiehne — Wed, 17 Jan 2007 05:20:54 +0000

As I near the end of the first phase of my audio encoding project I feel the need to share some of the metadata quality control observations that I have collected.

Although ripping my CDs to digital media has been time consuming, it has not been near as laborious as checking and correcting the metadata that was automatically gathered during the process. FreeDB as an automatic metadata gathering service has been very helpful, but as I reviewed the corpus of encoded audio, I found many disturbing errors: misspellings, typos, missing articles, missing fields omission or misrepresentation of international characters, and, of course, the usual discrepancies in case handling, title formating, and normalized forms..

To find and correct these errors, I relied on one or more separate discography databases and integrated metadata management software. Discogs.com has been an invaluable resource for confirming apparent discrepancies, as well as helping me find and correct release dates, many of which were not found in the FreeDb data. Discogs does some rather stringent data normalization, often deviating from what is present on the actual releases, so that they can eliminate redundancy and excessive cross-linking between records. This issue has been the source of heated debate on the site's forums, as well it ought to be. Having submitted information to Discogs for many of my rarer CDs helped me to understand the compromises that they have made in their system so that I could understand why deviations from the original objects occurred and make an informed decision as to whether to apply changes. In the absence of information from Discogs, label, band, and fan sites have also come in handy for verifying information.

The most important tool I've used is an integrated metadata management program called Tag & Rename. This particular program merges a Windows explorer-like interface for viewing directories and files with an embedded metadata viewer that is capable of extracting and manipulating all of the major audio metadata formats (id3, ogg vorbis, aac, ape, etc.). The software provides a middle ground between the file system and the content which greatly increases the speed at which I can update embedded metadata.

In fact, there seem to be many such tools for this purpose for all sorts of digital object types. Another one I have come across is Exifer, a program that allows editing of embedded EXIF information in digital photos. I expect this program will come in handy when I begin processing my seven years worth of digital images.

Between these two programs, I have come up with a general list of essential characteristics for embedded metadata editors:

Filesystem integration: Functions such as copy, move, delete, rename, create directories, and so on. This feature ensures that you can stay within the metadata editing environment which saves time wasted in program switching. One thing that has been missing in my experience which could be useful is having multiple filesystem views so that you can jump between directories or volumes without leaving your working directory. This idea is insipred by my preferred code editor, Homesite.
Metadata listed in directory views: Selected metadata should be shown as part of the file list to allow a quick appraisal of the contents of the embedded metadata. Like a file list in the operating system, the list should be sortable and/or filterable by metadata field.
Ability to manipulate many different metadata standards: The program should be able to manipulate all applicable formats for the target object type (image, sound, text, etc.). Additionally, an ideal program would be extensible such that new metadata and file types could be added as needed.
Automated or batch editing: Manual, object by object editing is an expected feature, but the greatest time saver is the ability to modify entire directories or lists at once. Additionally, the ability to transfer between one metadata format to another applicable format (e.g.: id3v1 to id3v2) is essential. Copying tags directly from one field to another in the same file, swapping tags, and copying tags from one file to another have also been essential features. Finally, extraction of metadata from the filesystem, such as regular expression or pattern conversion of filenames into metadata, and vice-versa, has also come in handy.
Ability to create or access authority files: Tag & Rename allows me to create a list of genres for music files and exposes that list in edit dialogs, although it does not apparently have the option to force me to use only this list. In the absence of pre-coordinate lists, input masks should be available, especially for date and time fields. An added bonus for more detailed metadata formats could be accessing authoritative Web services for standard entries, such as a LCSH service for subjects, though I am not certain that such things yet exist.
Aggregate and summary views: This feature does not exist in Tag & Rename, but having brought all of my encoded music into an access system I have found the feature sorely lacking. Essentially, there should be a way to see the total number of objects marked with specific data, for example: grouped by genre. By browsing a list of all genres returned by my access system I was able to see outliers or variants that were present (e.g.: Synth Pop vs. Synthpop) and find them so that I could go back to the metadata editor and normalize as needed. It would be ideal to have this capability within the editor; although simply conforming to an authority list of genres would have prevented this particular problem, there may be situations where a strict authority list is not desirable.

.This is by no means an exhaustive list, and is perhaps too general to fit all object types, but the basic concept is clear. As a rule ,the less typing one does, the more accurate the metadata, but as I have experienced, even external databases have errors. Over the course of thousands of files or records, small error percentages accrue quickly. I can only imagine the headaches that would have arisen were my project to take place in a larger organization, with many people participating in the encoding and preservation process, let alone with a much larger corpus. It is clear that quality control of metadata, whether hand-entered or not, is crucial. These software tools

Slashdot: Archiving Digital Data an Unsolved Problem

tkiehne — Tue, 21 Nov 2006 06:24:33 +0000

The headline on a front page post on Slashdot today reads:

"Archiving Digital Data an Unsolved Problem"

which links to this article in Popular Mechanics. For archivists, this headline states the obvious, but the words betray how the technology sector, at least stereotypically, views archives and backups as equivalent. Wading through the comments (and discarding the obligatory comical entries), we find a rather robust discussion on digital preservation, sans academic terminology. All the familiar preservation topics -- Migration, emulation, media and file formats, genres, the influence of intellectual property law -- are touched upon, if rather superficially. One commenter brought up the issue of compression in digital archives, but it seems that none have touched the DRM issue (I'll have to remedy that).

That said, however, it is encouraging to see this article highlighted on one of the premiere tech blogs as well as in Popular Mechanics. It's going to take quite a bit more exposure to digital preservation problems in the tech community to get the point across -- to impart the long view, as it were -- but this is a good start.

Neil Beagrie on Personal Digital Libraries and Collections

tkiehne — Mon, 16 Oct 2006 07:37:50 +0000

I finally got around to reading Neil Beagrie's D-Lib article, "Plenty of Room at the Bottom? Personal Digital Libraries and Collections" (June 2005), and I regret not having done so sooner (alas, I have a great deal left in my "to read" folder). This article touches on several major themes in my academic pursuits of the last few years, which I will briefly describe here.

What drew me to the archival field was the overarching concern I have about the potential loss to our external memory in the sense of our information bearing objects. Being firmly seated in the digital generation, my concern is mostly over digital materials, and having completed my information science degree I find that, though I still worry about our institutions' digital preservation efforts, it is the enormous amount of personal digital information that people the world over possess that really worries me. Beagrie's article attacks this issue head-on, naming this body "personal digital collections" and enumerating not only the threat of loss, but the challenge these non-traditional collections pose to our "memory institutions."

Personal digital collections are subject to the same threats to persistence that the large institutional and academic projects are â€“ obsolete formats and media, access regimes such as passwords and DRM, and so on. Beagrie also enumerates missing data as a threat, with the parenthetical "email, webpages, etc." It seems that he means links to web pages, references to emails that have been deleted and so on, but I also wonder if mere information mismanagement is also intended? A recent episode in my own personal digital information management should elucidate.

As part of my ongoing audio encoding project, I have been preserving some of my own audio works from the last decade. I have also been checking my music collection, including these personal works, against an online discography database, Discogs.com. Every release in the Discogs database represents a physical object (CD, LP, etc.) released by a specific entity (record label), and lists not only the track information, but catalog information, liner notes, and cover art. As you can probably guess, the music that I created and released was not widely known or distributed (I still have a day job), so naturally there were no previous entries in Discogs.com. In the process of updating the database with my defunct label's releases, I found to my horror that I had lost some of the original digital files containing artwork and layout for some of my releases! Granted, I have not always been preservation-minded, but I had always assumed that these files were migrated from computer to computer over the past decade. Certainly lapses of this sort pose a significant hazard to personal digital collections, and I'm sure that it qualifies as "missing data."

Interestingly enough, my Discogs example also touches on Beagrie's discussion of "information banks." Although Discogs does not store the actual information represented in its indexes (the music), it is easy to visualize how it could were it not for the copyright regime so voraciously defended by the music industry. This worn argument aside, Discogs does implement a social networking component of the likes proffered in Beagrie's discussion of information sharing services such as blogs and sites like Flikr. By adding a social networking component, all of these sites, whether they publish unique user content or merely aggregate collected information (like Discogs), add a layer of informational value in the form of contributed information (e.g.: blog comments) or linked information (e.g.: relationships between artists in Discogs). But perhaps more importantly, the creation of these information banks, whatever their form, supports my assertion that digital preservation efforts must be aggregated at some level beyond a single (physical) entity's capabilities -- that only distributed efforts will ensure that digital assets are adequately preserved and accessed, let alone described and identified. This is as true for the National Archives as it is for Joe Q. Public's personal works.

As an aside, I could not help but notice that all of the talk about social networks and personal collections seemed to echo writings on digitally mediated identity by Danah Boyd. Beagrie's Venn diagram showing the definition of "public persona" begs comparison to Boyd's thesis work in faceted identity. I imagine that there is much to explore about the intersection of faceted identities or, for that matter, multiple personal public persona's, and the consequences to the "Lifetime Personal Web-spaces" concept mentioned at the end of the article.

In closing, one quote in particular caught my attention as it factors into my explorations into the "save everything" debate. Beagrie says (which he credits to Michael Lesk): "The combination of cheap digital storage and very sophisticated retrieval tools is shifting the balance of costs: digitally it is becoming cheaper to collect and more expensive to select, and cheaper to search than to organize." In other words, the scarcity argument is shifting from "we don't have enough space" to "we don't have the time to organize what we have," but as Beagrie seems to say, it no longer matters so long as you do not expect traditional access mechanisms. Or, more succinctly (with a nod to Catherine Stollar for originally expressing it): "what we do... will change, but why we do it does not."

The Future of the Hard Drive

tkiehne — Thu, 12 Oct 2006 21:34:02 +0000

On (roughly) the 50th anniversary of the invention of the hard drive, Tom's Hardware interviews Seagate's Senior Field Applications Engineer Henrique Atzkern (Quo Vadis, Hard Drive? The 50th Anniversary of the HDD). In it, we catch a glimpse of some of the ideas being explored for increasing hard drive density, speed, and reliability, among other things. Parsing through the acronym alphabet soup and surface technicality, one thing remains clear: hard drive manufacturers are not running out of ideas for increasing storage capacity, so we can expect to continue seeing dramatic leaps in storage capacities.

Let's look at this in terms of what we are storing. Most people around my age can remember how any increase in storage capacity seemed to be followed immediately by increases in program size -- developers used the extra space to put more functionality and features into their programs. The storage capacity gap has long since dwarfed the needs of applications and operating systems, but users have since taken the lead. First, users struggled with storing images and audio while developers introduced new compression schemes to accommodate them. Later, video reached the masses and started filling hard drives, even in greatly compressed states.

But the gap keeps expanding as hard drives increase in size. Text documents are not getting any bigger, even though the applications that create them keep bloating. Moving from binary to XML representations has not significantly increased word processing file sizes. Same goes for images and audio -- the bits needed to losslessly represent a 1200 dpi scan have not increased, and the same goes for a 48 kHz digital audio file. In fact, the bits needed to losslessly represent audio have actually decreased with file formats such as FLAC. On the other hand, video storage requirements are still expanding. DV quality is now giving way to HD and I would expect a few more developments before we reach a state where more bits does not yield better quality (for most typical applications).

The only thing left to close the gap for these types of digital media is to have a lot of them. Even then, I expect that the total unused storage, taken across all systems, will increase as dramatically as the storage devices themselves. This can only mean good things for those who want to "save it all."

Audio Encoding Project Resumes (or, a funny thing happened on the way to 300 GB)

tkiehne — Tue, 26 Sep 2006 06:41:22 +0000

It's been a while (almost 8 months, to be exact) since I have updated this forum on the status of my audio encoding project. I could cite the usual life delays and an unusually busy Summer as excuses, but there is more to it.

So, a funny thing happened on my way to 300 GB...

Not long after my last update, steady encoding progress brought me to about 240 GB of encoded music. As far as CDs go not much is left to encode -- perhaps 100 CDs out of the originally estimated 800 -- and I have mostly caught up in creating Ogg Vorbis reference copies. As I worked my way towards filing my 300 GB external drive, however, I began having strange pangs of trepidation centered on the thought: what happens if I lose this drive? Knowing full well that the roughly 240 GB of data represented a significant investment in time and effort, and also knowing full well the fallibility of technology and the loss risk inherent in only one copy of, well, anything, I became reluctant to continue encoding until some of these risks could be mitigated. I cannot say that this trepidation represents anything near as harrowing as what must be felt by an archivist handling rare, unique manuscripts â€“ I have the original objects to re-encode from, and most of them are not unique â€“ but through my meager risk I certainly feel for those who work in such risky situations.

Since halting progress, and having finished the aforementioned busy Summer, I have come into possession of a network attached storage (NAS) server, specifically, a 1 Terabyte Buffalo TeraStation. The prices have recently dropped on these units in the wake of the newer 2 TB versions and, likely, pressure from a spate of competing 1 TB boxes. For the benefit of those who didn't just click the link, the 1TB model contains four 250 GB hard drives and is capable of a variety of RAID configurations and storage capacities. I opted for the relative safety of a 750 GB RAID 5 configuration which, though not absolutely fail-safe, does protect against a single drive failure and quite effectively allays my trepidation over continuing the project.

I've since copied the entire contents of the 300 GB external drive to the TeraServer in preparation for resuming the encoding process, unencumbered by worry. More to come.