infoSpace - Metadata

Metadata Quality Control

tkiehne — Wed, 17 Jan 2007 05:20:54 +0000

As I near the end of the first phase of my audio encoding project I feel the need to share some of the metadata quality control observations that I have collected.

Although ripping my CDs to digital media has been time consuming, it has not been near as laborious as checking and correcting the metadata that was automatically gathered during the process. FreeDB as an automatic metadata gathering service has been very helpful, but as I reviewed the corpus of encoded audio, I found many disturbing errors: misspellings, typos, missing articles, missing fields omission or misrepresentation of international characters, and, of course, the usual discrepancies in case handling, title formating, and normalized forms..

To find and correct these errors, I relied on one or more separate discography databases and integrated metadata management software. Discogs.com has been an invaluable resource for confirming apparent discrepancies, as well as helping me find and correct release dates, many of which were not found in the FreeDb data. Discogs does some rather stringent data normalization, often deviating from what is present on the actual releases, so that they can eliminate redundancy and excessive cross-linking between records. This issue has been the source of heated debate on the site's forums, as well it ought to be. Having submitted information to Discogs for many of my rarer CDs helped me to understand the compromises that they have made in their system so that I could understand why deviations from the original objects occurred and make an informed decision as to whether to apply changes. In the absence of information from Discogs, label, band, and fan sites have also come in handy for verifying information.

The most important tool I've used is an integrated metadata management program called Tag & Rename. This particular program merges a Windows explorer-like interface for viewing directories and files with an embedded metadata viewer that is capable of extracting and manipulating all of the major audio metadata formats (id3, ogg vorbis, aac, ape, etc.). The software provides a middle ground between the file system and the content which greatly increases the speed at which I can update embedded metadata.

In fact, there seem to be many such tools for this purpose for all sorts of digital object types. Another one I have come across is Exifer, a program that allows editing of embedded EXIF information in digital photos. I expect this program will come in handy when I begin processing my seven years worth of digital images.

Between these two programs, I have come up with a general list of essential characteristics for embedded metadata editors:

Filesystem integration: Functions such as copy, move, delete, rename, create directories, and so on. This feature ensures that you can stay within the metadata editing environment which saves time wasted in program switching. One thing that has been missing in my experience which could be useful is having multiple filesystem views so that you can jump between directories or volumes without leaving your working directory. This idea is insipred by my preferred code editor, Homesite.
Metadata listed in directory views: Selected metadata should be shown as part of the file list to allow a quick appraisal of the contents of the embedded metadata. Like a file list in the operating system, the list should be sortable and/or filterable by metadata field.
Ability to manipulate many different metadata standards: The program should be able to manipulate all applicable formats for the target object type (image, sound, text, etc.). Additionally, an ideal program would be extensible such that new metadata and file types could be added as needed.
Automated or batch editing: Manual, object by object editing is an expected feature, but the greatest time saver is the ability to modify entire directories or lists at once. Additionally, the ability to transfer between one metadata format to another applicable format (e.g.: id3v1 to id3v2) is essential. Copying tags directly from one field to another in the same file, swapping tags, and copying tags from one file to another have also been essential features. Finally, extraction of metadata from the filesystem, such as regular expression or pattern conversion of filenames into metadata, and vice-versa, has also come in handy.
Ability to create or access authority files: Tag & Rename allows me to create a list of genres for music files and exposes that list in edit dialogs, although it does not apparently have the option to force me to use only this list. In the absence of pre-coordinate lists, input masks should be available, especially for date and time fields. An added bonus for more detailed metadata formats could be accessing authoritative Web services for standard entries, such as a LCSH service for subjects, though I am not certain that such things yet exist.
Aggregate and summary views: This feature does not exist in Tag & Rename, but having brought all of my encoded music into an access system I have found the feature sorely lacking. Essentially, there should be a way to see the total number of objects marked with specific data, for example: grouped by genre. By browsing a list of all genres returned by my access system I was able to see outliers or variants that were present (e.g.: Synth Pop vs. Synthpop) and find them so that I could go back to the metadata editor and normalize as needed. It would be ideal to have this capability within the editor; although simply conforming to an authority list of genres would have prevented this particular problem, there may be situations where a strict authority list is not desirable.

.This is by no means an exhaustive list, and is perhaps too general to fit all object types, but the basic concept is clear. As a rule ,the less typing one does, the more accurate the metadata, but as I have experienced, even external databases have errors. Over the course of thousands of files or records, small error percentages accrue quickly. I can only imagine the headaches that would have arisen were my project to take place in a larger organization, with many people participating in the encoding and preservation process, let alone with a much larger corpus. It is clear that quality control of metadata, whether hand-entered or not, is crucial. These software tools

What's in a Creation Date?

tkiehne — Fri, 28 Jul 2006 07:59:50 +0000

There is a certain perception that often accompanies digital objects and, more broadly, computer systems as a whole. This sort of perception manifests itself when, for example, we hear about how massively compressed digital MP3 files are considered to be "perfect" quality audio or in similar myths concerning the infallibility of all things digital. These perceptions are based on incomplete or inaccurate assumptions about how software, operating systems, or file systems function. My favorite way of stating this is that computers are only as smart as those who designed them â€“ if to err is human, then the same goes for our electronic creations.

When making the transition from paper to digital records, these assumptions are likely to appear in unexpected places. While working on the Joyce collection, we ran headlong into one of these assumptions, made a note of it, then moved on. But I promised that I would look closer at the issue at a later time... so here I go.

Anyone with a modicum of computer literacy is familiar with managing digital files through some means -- be it command line or GUI -- and has been exposed to the fact that the computer's file system(s) maintain not only filenames, but various dates such as creation date, modification date, and access date. At first blush, this seems like a godsend for archivists struggling to put concrete attributes on virtual objects. Certainly these dates mean what they say â€“ the creation date is the date it was created, etc. -- and these attributes follow the digital object wherever it goes, correct? Unfortunately, a little investigation sheds some doubt on the subject.

I devised a simple set of experiments to confirm or deny the assumption that all filesystem date metadata is the same and means what we assume it to mean. I selected the three major operating systems in use today, Windows 2000/NT, Macintosh OS X, and Linux, and conducted a variation of the following sequence on each:

Create an arbitrary text file in an arbitrary location on a local hard drive (volume)
Modify the text of the file and save it to the same location
Move the file from one directory to another on the same local volume
Make a copy of the file to another directory on same volume
Copy the file to a separate volume

After each step, I gathered date information from the filesystem (e.g.: creation and modification dates) and generated an MD5 hash to confirm whether the contents of the file stayed the same or changed. I will now discuss the details of this experiment for each operating system.

Windows 2000/NT

Table 1: Windows 2000 (NTFS)
Action	Creation Date	Modified Date	MD5 Hash
Created	07/27/2006 19:48:43	07/27/2006 19:48:43	d41d8cd98f00b204e9800998ecf8427e
Modified and Saved	07/27/2006 19:48:43	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14
Moved from one directory to another	07/27/2006 19:48:43	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14
Copied to another directory on same NTFS volume	07/27/2006 19:52:14	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14
Copied to another NTFS volume	07/27/2006 19:53:25	07/27/2006 19:50:05	0aa9bd7d122205a12e939f14d6946c14

Table 1 above shows a tabulated view of the experiment's results for a Windows computer using the NTFS file system. File dates were collected from the Windows file properties dialog, while MD5 hashes were generated using a freeware program called HashCalc. The first two steps passed as predicted, with the new file correctly showing a change in modification date and MD5 hash. The third step shows that Windows considers a moved file on the same volume to be the same before and after the move â€“ again, this makes sense.

Upon making a new copy, however, common sense starts to break down. The modification date stays the same as before the copy â€“ demonstrating, as the MD5 hash confirmed, that no changes have been applied â€“ but the creation date has changed to the time of the copy operation. This simultaneously makes sense and is confusing: we now have a new copy of the file, with its own creation date, but now the modification date precedes the creation date, which flies in the face of common sense. How can a file have been modified before it was created? But it does not end there. Upon copying across hard drives, the creation date is again modified, once again bringing up the creation/modification dichotomy.

Macintosh OS X

Table 2: MacOS X (HFS)
Action	Creation Date	Modified Date	MD5 Hash
Created	07/27/2006 20:04:00	07/27/2006 20:04:00	a53165315d1e86c5739d34e1243f5f4d
Modified and Saved	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Moved from one directory to another	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to another directory on same HFS volume	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to another HFS volume	07/27/2006 20:04:00	07/27/2006 20:07:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to FAT (MS-DOS) volume (OS X view)	12/31/1903 16:00:00	07/27/2006 20:18:00	cb697c6c073f85c43e2dfb100f5b725e
Copied to FAT (MS-DOS) volume (Windows view)	07/27/2006 20:18:59	07/27/2006 20:18:58	cb697c6c073f85c43e2dfb100f5b725e
Copied back to HFS volume from FAT volume	12/31/1903 16:00:00	07/27/2006 20:18:00	cb697c6c073f85c43e2dfb100f5b725e

Table 2 above shows a tabulated view of the experiment's results for a Macintosh OS X computer using the HFS+ file system. File dates were gathered using the Finder's Get Info command and MD5 hashes were computed using the built-in command line program, "md5" (Note here that the Get Info dialog does not show seconds in dates). Here we see behavior that conforms to our assumptions â€“ the creation date follows the file throughout its movements on the machine and the modification date is only changed when an actual modification is made.

With a little extra knowledge of how Macintosh file systems work, however, it is understood that each file is actually a pair of files: the resource fork, which holds metadata about the file, and a data fork, which holds the content of the file. Many file systems do not respect this dyadic system, which can create problems when Macintosh files are exchanged with other operating systems or through network transfer. To this end, I conducted a few more steps that involved transferring the file to a non-HFS+ volume (in the form of a FAT formatted USB flash drive) and viewing the transferred file in both Macintosh and Windows environments. As you can see, the both dates were significantly affected. The transfer was considered to be a modification, thus changing the modification date, but then the creation date became skewed. Macintosh could not recognize the creation date, instead displaying the Macintosh epoch, while Windows interpreted the creation date as being the same (but off by one second, for some odd reason) as the time of the copy operation. Upon copying the file from the flash drive back to the HFS+ volume, we see that the dates were preserved as changed during the transfer to the flash drive.

Linux

Table 3: Linux (ext2)
Action	Change Date*	Modified Date	MD5 Hash
Created	07/28/2006 00:12:28	07/28/2006 00:12:28	d41d8cd98f00b204e9800998ecf8427e
Modified and Saved	07/28/2006 00:13:43	07/28/2006 00:13:43	93ad68660a99d36a665a553672a8148d
Moved from one directory to another	07/28/2006 00:14:38	07/28/2006 00:13:43	93ad68660a99d36a665a553672a8148d
Copied to another directory on same ext2 volume	07/28/2006 00:15:51	07/28/2006 00:15:51	93ad68660a99d36a665a553672a8148d

Table 3 above shows a tabulated view of the results for a Red Hat Linux computer using the ext2 file system. File dates were gathered using the "stat" command and MD5 hashes computed using the "md5sum" command. Already there is one glaring difference between the Linux results and the previous two, that is, the non-existence of a creation date. Instead, I have shown the Changed date (status change or ctime) as reported by stat. It was difficult to determine the reason for this omission, especially since some references incorrectly referred to the Changed date as the creation date (e.g.: Poirier, 2001), but I found an email discussion thread that helped to clarify some of the reasons. In short, the creators of the ext2 filesystem, and Linux in general, deemed the concept of a creation date as being too nebulous to model, so they omitted it. The Windows experiment demonstrates some of the potential issues behind the concept of a digital creation date and lends some legitimacy to the decision to omit, even if it does seem a bit unsettling.

Continuing with the experiment anyway, we can see how the modification dates and hashes behave in the way the Windows operations did when modified and moved. Copying the file, however, altered the modification date, which is different behavior than the other two operating systems. Additionally, we see how the Changed date is updated with each action, regardless of the effect on the content of the file. It is worth noting that the Changed date may also be updated when using seemingly content-neutral commands such as grep and find. In this way, the Changed date acts more like an Access date and lends very little help to archival processing.

Analysis

Each of these experiments shows how the assumptions of the software makers dictates the behavior of what seem to be common sense concepts, thus threatening the validity of assumptions we make while using them. In the case of Windows, the assumption is that any copy operation creates a new file and is treated as a new object, but leaves behind a paradoxical situation where the modification date precedes creation. In Macintosh, every copy of a file, so long as it is made on a compatible volume, can be traced back to the original object by creation date â€“ in essence, every copy of a Macintosh file is simply a new version, not a new object. Linux, on the other hand, repairs the Windows dichotomy by bringing the modification date forward with each new object instance.

At the surface, we may want to proclaim that filesystem metadata cannot be trusted and debate the merits of ignoring it completely. This is understandable, but perhaps a bit hasty. It might be better to consider filesystem metadata as helpful to the extent that it has been properly maintained during he record's lifetime. Since creation and modification dates support authenticity it only seems fitting that our treatment of their apparent flaws should derive from similar concepts. In other words, the lessons of this experiment should not only guide the handling of digital objects in a repository setting, but in assessing the reliability of filesystem metadata as generated in the originating environment. If the recordkeeping systems that generated the digital objects, including policies and documented procedures outside the systems, if any, can be assessed, then the metadata accompanying the objects may be salvageable. Without such knowledge, though, it is wise to treat any and all filesystem metadata with prejudice.

Even with a thorough knowledge of the originating environment, can we trust dates and times as the filesystem reports them? Certainly a to-the-second time should be taken with skepticism -- time zone settings, variations between computers and clock drift ensure that the only way exact times can be compared is within the same system. But beyond that, dates may even prove fallible: unskilled users may neglect to set the system clock correctly, or miss a daylight savings shift. Further, power outages or system failures can have detrimental effects on the system clock and, in the case of Macintosh systems where the file metadata is stored as one of the two file parts, metadata may become corrupted just as normal data files can. All of this goes without mentioning date errors deliberately created by knowing users in order to deceive or conceal -- forgery is always a risk.

These problems should demonstrate that the skills of an archivist in determining the authenticity and reliability of records do not fade away in a digital environment, but that the means of performing these tasks change. An intuition and knowledge of the assumptions underlying the technology is key, as is a thorough understanding of the origin of the records â€“ the latter being a skill that archivists already possess. Hopefully this experiment will help to increase the skills of the former.

Audio Encoding Project: On Genre Description

tkiehne — Sun, 29 Jan 2006 03:30:27 +0000

First, a status update on the project. At this point, I have lost track of exactly how many discs I have encoded. This is probably because the ripping environment has been working virtually flawlessly since I finished troubleshooting, but, a rough estimate puts me at around 200-250 discs encoded. Now, to move on to an issue that has been in the back of my mind for a while: genre description.

Specifically, I am finding myself increasingly annoyed by the lack of depth in genre description allowed in ID3-type metadata. To expound: most digital audio formats support some sort of embedded metadata, one of the most common being the ID3 tag block used by MP3s and FLAC. The ID3 specification allows for a single field to describe the genre of the object. Since the ID3 tag is embedded within each unitary object, this allows for record-level description. Unfortunately, I have been encoding a disc at a time and the software I am using only allows descriptive metadata to be defined at the disc level, which is then copied into each file upon encoding. This is fine for fields such as album name, release year, or artist, but is quite frustrating for its tendency to stereotype artists or releases as a whole. To make things worse, the centralized database of CD information, FreeDB, limits the genre field to a choice of among only 11 -- and not a single one of them represents any type of electronic music. For a collection such as mine that is dominated by electronic and abstract styles, this limitation is unacceptable. Fortunately, I can override the FreeDB defaults for purposes of encoding.

Determining the most appropriate descriptive term for the genre of an object is a problem that is not at all new to descriptive cataloging. Any object can have different semantic uses and/or meanings depending upon the attitude and understanding of the describer and the user. Furthermore, using a pre-coordinate description precludes the notion that new understandings or uses for the content (hence, new genre descriptions) could become apparent at a later time. As a result, I have been selecting the most specific possible genre term that can help identify the musical genre within a fairly broad tolerance, avoiding overly obscure or transient terms. For some works, the best term is obvious, but compendiums of music with very little in common among songs or artists or eclectic compositions confound any attempt at detailed description without record-level control.

So, a combination of technological limitations and the theoretical limitations of description have conspired to limit the genre choices I may make while encoding. I can overcome the record level constraint by going back through my encoded collection with an ID3 tagger, but I am still limited to one, single term. This may suffice on some general level, but is highly unsatisfactory to me personally, and this is whyâ€¦

Envision, if you will, a media player or other system that provides access to my corpus of encoded music. The system in question could access by artist, title or year with high recall. But, imagine that I want to use this system to generate playlists on-the-fly based on various content semantics, the content being the music itself. The single-term genre field will yield high recall for many types of music, but recall suffers for types of music that cross genres or are equally applicable to several at a time. For example, much of my music could be termed â€œambient,â€ implying slow to no beats or rhythm and a generally softer or quieter composition. Some ambient tracks, however, lend themselves well to a more traditionally industrial genres, or downtempo/chill, or experimental â€“ all of which are genres that can stand in their own right apart from ambient. If we were to visualize this graphically, imagine a Venn diagram with all of these genres overlapping with ambient (and in some cases, a bit with each other). A single term is unable to capture this depth, thus, the recall of automatically generated playlists is limited.

Additionally, the genre description does little to capture two other semantic aspects of song content: tempo and (what I will refer to as) energy. Any song, no matter what genre, can be classified according to its rhythmic speed with very little disagreement among users. This additional level of description would enhance playlist generation by preventing the sudden acceleration or deceleration between tracks that is so prevalent among streaming Internet radio stations. A smooth, consistent feel can be projected across a whole playlist of between groups of songs, or algorithms could be devised to create a change in tempo across the playlist in a myriad of creative ways. One may also see a similar role for key and time signature â€“ all three of these could be determined automatically with great accuracy during the encoding process.

Energy is a bit less definitive. What I mean by energy is a description that takes into account the emotional states that may be experienced by the listener. There is an inherent bias that is transferred by the describer, but I feel that, like the genre description, energy could be described consistently enough to be used in an advisory capacity. Genre has been used to encompass energy to some degree. For example, I have seen CDDB descriptions such as â€œdark technoâ€ or â€œambient industrialâ€ that do the work of describing both the energy and technical style. Unfortunately, the result is that the genre term is devalued as ever more granular descriptions that can become lost over time as collective definitions of genre morph and change. As with tempo, energy can be used to prevent abrupt transitions in automatically generated playlists. Unlike tempo, which uses a linear scale, energy is much less definite and will require a thesaurus to determine appropriate transitions and relationships.

Regardless of the depth of description that is possible, it is clear that a single descriptive genre term is not sufficient. A simple modification to the ID3 specification could allow multiple genre terms to be stored at the record level, thus improving recall for access systems.

Ruminations on Generating Project Metadata

tkiehne — Fri, 02 Dec 2005 04:27:17 +0000

Although I am still debugging the CD ripping problems I have been having, I have enough of a music corpus to begin thinking about second stage metadata generation. Additionally, I already have a corpus of 1700+ digital photos that I can also begin thinking about describing.

At this point, I have a small amount of metadata in the form of ID3-style tags embedded in the music files, playlist files describing relationships between music files, and possibly EXIF or other digital camera data embedded in the images. There are also latent attributes such as color depth, resolution, color profile for the images and compression profile, playing time, and filesize extent for the music which can be extracted wit the proper tools. The music has been described thus far using metadata extracted from the CDDB and may not prove to be accurate in some cases (I have a rant about genre description coming up, so stay tuned). All this metadata must be proofed before extending it into separable metadata objects and quite a bit more must be added, especially in terms of describing the contents of and subject indexing the photos.

Knowing the types of information that are currently available and having an eye towards the long term requirements of the collections, I can begin formulating a plan for metadata representation. One popular metadata standard is Dublin Core -- a simple, straightforward descriptive scheme. Unfortunately, DC is quite weak when it comes to encoding detailed technical or structural data, both of which are important for preservation. In all, DC is something of a cop-out to me -- something to use only in a situation where time is of the essence and description is of primary (sole?) importance.

METS, on the other hand, is a highly extensible container framework that can accommodate many other schemas. Having researched METS in the past, I know that the structural portions of the METS schema are particularly attractive for the projects I am working on. Many extensions and versions of METS already exist to handle a wide variety of situations, including photographs and sound recordings. Some of the possible extensions that I may use or derive from are:

For images:
UCB/Model Imaged Object Profile: These are probably overkill for born-digital images but could serve as a starting point.
If I am able to automatically extract more technical data from the image itself (EXIF, etc), the MIX extension (developed in partnership with the NISO Technical Metadata for Digital Still Images Standards Committee) could be of use here.

For music:
Library of Congress profile for Audio CDs.
MODS for description.

Unfortunately, METS is a fairly complex scheme. Having created a schema by hand, I know that the process of manual encoding is time consuming and error-prone. I know that much of the existing metadata described above can automatically be harvested and placed into the correct areas of the schema â€“ IF a tool exists for the purpose and IF there is a mapping of external data to the appropriate place(s) in the schema. Should these conditions be met, however, all that would be needed of me is to review the automatically generated data, add unique descriptions (or choose them from previously used values), and let the tool do the dirty work of creating the XML and saving the data to disk.

From what I know, however, this process is the biggest roadblock to thorough description of digital objects for all types of projects. I have come across many "roll-your-own" systems used by various institutions and academic groups that I know from my Web development experience would be hard pressed to extend beyond the specifics of the environment for which they were produced. In other words, the tools created for these projects are not portable, probably not terribly scalable or extensible, and thus, of little practical use to others. This reminds me of the Web applications development environment some 6-7 years ago, where every new e-commerce or content delivery idea generated a new set of code, standards, and procedures. May I dream of an imminent development of standard frameworks for metadata generation tools in the spirit of current Web application frameworks?

For the time being, I might have to do the deed and roll-my-own as well. So far, I have found a METS Java toolkit that shows promise for developing a custom tool. With any luck, what I create might be something I can release into the wilds for other intrepid researchers to use (and critique).

For the digital images, I might be able to extend existing image management software. For some of my images, I use a framework called Gallery for sharing on the Web. So far, I have not come upon a suitable metadata extension for this application, but I do not see why I can't create one. There are already extensions used by Gallery for image manipulation that could be used for generating some technical metadata. Additionally, description is as easy as a Web form and structure can be inferred from the application itself (the hierarchy within photo albums). Once a means of encoding metadata is established, I can envision adding an OAI service as well. The main concerns I have with this approach (not so much for my sake, but for the greater Gallery user base) is that of authenticity. In other words, with the image manipulation capabilities of the application, one may be led to believe that they are describing and making available an original object, when in fact they are working with a lesser-quality copy. Perhaps I can revisit this in more detail later â€“ it is not enough to dissuade me from the notion that such extensions to this widely used application are a good idea all around.