Metadata Quality Control

As I near the end of the first phase of my audio encoding project I feel the need to share some of the metadata quality control observations that I have collected.

Although ripping my CDs to digital media has been time consuming, it has not been near as laborious as checking and correcting the metadata that was automatically gathered during the process. FreeDB as an automatic metadata gathering service has been very helpful, but as I reviewed the corpus of encoded audio, I found many disturbing errors: misspellings, typos, missing articles, missing fields omission or misrepresentation of international characters, and, of course, the usual discrepancies in case handling, title formating, and normalized forms..

To find and correct these errors, I relied on one or more separate discography databases and integrated metadata management software. Discogs.com has been an invaluable resource for confirming apparent discrepancies, as well as helping me find and correct release dates, many of which were not found in the FreeDb data. Discogs does some rather stringent data normalization, often deviating from what is present on the actual releases, so that they can eliminate redundancy and excessive cross-linking between records. This issue has been the source of heated debate on the site's forums, as well it ought to be. Having submitted information to Discogs for many of my rarer CDs helped me to understand the compromises that they have made in their system so that I could understand why deviations from the original objects occurred and make an informed decision as to whether to apply changes. In the absence of information from Discogs, label, band, and fan sites have also come in handy for verifying information.

The most important tool I've used is an integrated metadata management program called Tag & Rename. This particular program merges a Windows explorer-like interface for viewing directories and files with an embedded metadata viewer that is capable of extracting and manipulating all of the major audio metadata formats (id3, ogg vorbis, aac, ape, etc.). The software provides a middle ground between the file system and the content which greatly increases the speed at which I can update embedded metadata.

In fact, there seem to be many such tools for this purpose for all sorts of digital object types. Another one I have come across is Exifer, a program that allows editing of embedded EXIF information in digital photos. I expect this program will come in handy when I begin processing my seven years worth of digital images.

Between these two programs, I have come up with a general list of essential characteristics for embedded metadata editors:

  • Filesystem integration: Functions such as copy, move, delete, rename, create directories, and so on. This feature ensures that you can stay within the metadata editing environment which saves time wasted in program switching. One thing that has been missing in my experience which could be useful is having multiple filesystem views so that you can jump between directories or volumes without leaving your working directory. This idea is insipred by my preferred code editor, Homesite.
  • Metadata listed in directory views: Selected metadata should be shown as part of the file list to allow a quick appraisal of the contents of the embedded metadata. Like a file list in the operating system, the list should be sortable and/or filterable by metadata field.
  • Ability to manipulate many different metadata standards: The program should be able to manipulate all applicable formats for the target object type (image, sound, text, etc.). Additionally, an ideal program would be extensible such that new metadata and file types could be added as needed.
  • Automated or batch editing: Manual, object by object editing is an expected feature, but the greatest time saver is the ability to modify entire directories or lists at once. Additionally, the ability to transfer between one metadata format to another applicable format (e.g.: id3v1 to id3v2) is essential. Copying tags directly from one field to another in the same file, swapping tags, and copying tags from one file to another have also been essential features. Finally, extraction of metadata from the filesystem, such as regular expression or pattern conversion of filenames into metadata, and vice-versa, has also come in handy.
  • Ability to create or access authority files: Tag & Rename allows me to create a list of genres for music files and exposes that list in edit dialogs, although it does not apparently have the option to force me to use only this list. In the absence of pre-coordinate lists, input masks should be available, especially for date and time fields. An added bonus for more detailed metadata formats could be accessing authoritative Web services for standard entries, such as a LCSH service for subjects, though I am not certain that such things yet exist.
  • Aggregate and summary views: This feature does not exist in Tag & Rename, but having brought all of my encoded music into an access system I have found the feature sorely lacking. Essentially, there should be a way to see the total number of objects marked with specific data, for example: grouped by genre. By browsing a list of all genres returned by my access system I was able to see outliers or variants that were present (e.g.: Synth Pop vs. Synthpop) and find them so that I could go back to the metadata editor and normalize as needed. It would be ideal to have this capability within the editor; although simply conforming to an authority list of genres would have prevented this particular problem, there may be situations where a strict authority list is not desirable.

.This is by no means an exhaustive list, and is perhaps too general to fit all object types, but the basic concept is clear. As a rule ,the less typing one does, the more accurate the metadata, but as I have experienced, even external databases have errors. Over the course of thousands of files or records, small error percentages accrue quickly. I can only imagine the headaches that would have arisen were my project to take place in a larger organization, with many people participating in the encoding and preservation process, let alone with a much larger corpus. It is clear that quality control of metadata, whether hand-entered or not, is crucial. These software tools