Ruminations on Generating Project Metadata

Although I am still debugging the CD ripping problems I have been having, I have enough of a music corpus to begin thinking about second stage metadata generation. Additionally, I already have a corpus of 1700+ digital photos that I can also begin thinking about describing.

At this point, I have a small amount of metadata in the form of ID3-style tags embedded in the music files, playlist files describing relationships between music files, and possibly EXIF or other digital camera data embedded in the images. There are also latent attributes such as color depth, resolution, color profile for the images and compression profile, playing time, and filesize extent for the music which can be extracted wit the proper tools. The music has been described thus far using metadata extracted from the CDDB and may not prove to be accurate in some cases (I have a rant about genre description coming up, so stay tuned). All this metadata must be proofed before extending it into separable metadata objects and quite a bit more must be added, especially in terms of describing the contents of and subject indexing the photos.

Knowing the types of information that are currently available and having an eye towards the long term requirements of the collections, I can begin formulating a plan for metadata representation. One popular metadata standard is Dublin Core -- a simple, straightforward descriptive scheme. Unfortunately, DC is quite weak when it comes to encoding detailed technical or structural data, both of which are important for preservation. In all, DC is something of a cop-out to me -- something to use only in a situation where time is of the essence and description is of primary (sole?) importance.

METS, on the other hand, is a highly extensible container framework that can accommodate many other schemas. Having researched METS in the past, I know that the structural portions of the METS schema are particularly attractive for the projects I am working on. Many extensions and versions of METS already exist to handle a wide variety of situations, including photographs and sound recordings. Some of the possible extensions that I may use or derive from are:

For images:
UCB/Model Imaged Object Profile: These are probably overkill for born-digital images but could serve as a starting point.
If I am able to automatically extract more technical data from the image itself (EXIF, etc), the MIX extension (developed in partnership with the NISO Technical Metadata for Digital Still Images Standards Committee) could be of use here.

For music:
Library of Congress profile for Audio CDs.
MODS for description.

Unfortunately, METS is a fairly complex scheme. Having created a schema by hand, I know that the process of manual encoding is time consuming and error-prone. I know that much of the existing metadata described above can automatically be harvested and placed into the correct areas of the schema – IF a tool exists for the purpose and IF there is a mapping of external data to the appropriate place(s) in the schema. Should these conditions be met, however, all that would be needed of me is to review the automatically generated data, add unique descriptions (or choose them from previously used values), and let the tool do the dirty work of creating the XML and saving the data to disk.

From what I know, however, this process is the biggest roadblock to thorough description of digital objects for all types of projects. I have come across many "roll-your-own" systems used by various institutions and academic groups that I know from my Web development experience would be hard pressed to extend beyond the specifics of the environment for which they were produced. In other words, the tools created for these projects are not portable, probably not terribly scalable or extensible, and thus, of little practical use to others. This reminds me of the Web applications development environment some 6-7 years ago, where every new e-commerce or content delivery idea generated a new set of code, standards, and procedures. May I dream of an imminent development of standard frameworks for metadata generation tools in the spirit of current Web application frameworks?

For the time being, I might have to do the deed and roll-my-own as well. So far, I have found a METS Java toolkit that shows promise for developing a custom tool. With any luck, what I create might be something I can release into the wilds for other intrepid researchers to use (and critique).

For the digital images, I might be able to extend existing image management software. For some of my images, I use a framework called Gallery for sharing on the Web. So far, I have not come upon a suitable metadata extension for this application, but I do not see why I can't create one. There are already extensions used by Gallery for image manipulation that could be used for generating some technical metadata. Additionally, description is as easy as a Web form and structure can be inferred from the application itself (the hierarchy within photo albums). Once a means of encoding metadata is established, I can envision adding an OAI service as well. The main concerns I have with this approach (not so much for my sake, but for the greater Gallery user base) is that of authenticity. In other words, with the image manipulation capabilities of the application, one may be led to believe that they are describing and making available an original object, when in fact they are working with a lesser-quality copy. Perhaps I can revisit this in more detail later – it is not enough to dissuade me from the notion that such extensions to this widely used application are a good idea all around.