Why MARC drives me nuts

2008 May 2
by Karen

I’m doing some work with the early release of WorldCat API to incorporate some new functionality into our content management system. Part of this involves searching WorldCat and manipulating the records retrieved in MARCXML. While the service will also return Dublin Core records, I decided that I preferred to work with the full information. This has created a variety of headaches and made me reacquaint myself with why I think MARCXML is just plain wrong, very wrong in fact.

First of all, the structure of a MARCXML document, immediately makes it difficult for one to select nodes because instead of rationally putting things like 245 as node names, MARCXML makes this a tag attribute on a datafield node. To make matters worse, while subfields are subfields what kind of subfield it is is also stored in an attribute. This makes for convoluted XPath statements in order to select nodes.

The second thing that infuriates me about MARCXML isn’t really a MARCXML phenomenon but rather a carryover from MARC itself. The issue is with punctuation which delimits subfields. Instead of having the system contextually show the proper punctuation given the fields and subfields, MARC has catalogers encode the fields WITH the proper punctuation. The problem with this is that one never gets the true proper value of the field. As a result, you need to strip off characters to get a proper display.

Part of me wishes that I’d not chosen to go the MARCXML route for this project. But the alternative of Dublin Core isn’t necessarily preferable. While the WorldCat API output of DC does clean up some of the punctuation issues and creates a simpler node structure, it also has the failings of Dublin Core which means that a creator can be many different things from author to illustrator to editor. A compromise might be to do a MARCXML to MODS transformation and work with the records as MODS.

For now I’ll keep working with the MARC though so I can get the project done. Maybe the next version will use a MARCXML to MODS transform though!

7 Responses leave one →
  1. 2008 May 2

    We (the Evergreen team) came to much the same conclusion regarding the utility of manipulating MARCXML directly. And, we came to the same solution — use the MODS version of the record instead. Based on our experience, I would highly recommend that tactic for your next version.

    The benefits are pretty clear once you’ve considered what MODS is actually providing. As a first-order example, you get semantic interpretation of MARC for free. For instance, titles, with all their constituent parts, are separated out for simple use using straightforward XPath. This means you don’t have to be a cataloger to actually build a display system around the data. No referencing AACR2, no learning the (sometimes vague and often confusing) interplay between subfields and indicators. Just simple, straightforward, explicit data with a well thought out structure and layout.

    It’s hard to overstate just that benefit alone, really, because most library developers are not, in fact, catalogers.

    Also, consider newly emerging efforts, such as the Bibliographic CQL Conext Set ( http://www.loc.gov/standards/sru/cql-bibliographic-searching.html ) which explicitly references MODS as the structural model for searching. Non-cataloger developers, let alone non-library developers, can read that and understand easily and at a glance how to implement a standards-compliant SRU interface to thier data or even how to consume other sites data, MARC or otherwise. One of the implications is that we can push ourselves and our data out into the wider world by lowering the barriers to entry for library/non-library collaboration.

    I may sound like I’m gushing a bit, but I honestly believe that MODS is one of the most important foundational (and, unfortunately, one of the more under-utilized and under-appreciated) technologies that the library world has developed.

    –miker

  2. 2008 May 2

    This reminds me of one of things lost when UKMARC was folded into MARC21: UKMARC didn’t require that the punctuation be embedded in the record. Nor does UNIMARC. I’ve often wished that USMARC/MARC21 had taken up that idea.

  3. 2008 May 2

    Thanks for suffering through the pain, Karen. We’re excited about opening up the WorldCat API to a wider audience…and glad that we have smart people like you, thinking about it!

  4. 2008 May 3

    To be fair to the creators of MARCXML, “245″ would be an invalid element name and wouldn’t work with any conforming XML tools. The XML standard (http://www.w3.org/TR/xml/) defines a valid name as:

    Name ::= (Letter | ‘_’ | ‘:’) (NameChar)*

    … so it has to begin with a letter, underscore, or colon.

    Even if XML allowed numeric names, I don’t see how it’s significantly more difficult to match nodes with XPath syntax like “//245″ vs. “//datafield[@tag='245']“. Selecting records based on the value of leader position 18 certainly isn’t fun, but then it isn’t much fun in MARC21 either.

    So MARCXML is nothing more than re-encoding MARC21 in XML format. The major benefit over MARC21 is that you get to work with standard tools and standard encodings.

  5. 2008 May 5

    I don’t understand your rant – sure MARC is a terrible format in some sense, but MARCXML is just an XML-representation of MARC, so don’t blame MARCXML but MARC! Like Dan showed you only have to know how to write the right XPath statement. By the way the only additional value of MARCXML compared to MARC is that you have a character encoding and basic structure in a format that can be used by all common programming libraries. Don’t expect more magic.

  6. 2008 May 5

    My rant isn’t just about MARCXML it is about MARC in general. I do think that in the process of deciding to create MARCXML, someone should have thought about thought about the parts of MARC that really don’t work well. Particularly if this is the data format we use to transmit between ourselves and non-libraryfolk. The node names don’t make sense (which I acknowledge is a legacy of MARC) but that doesn’t make it right or easy to use.

    On the subject of XPath, the example I crabbed about is EASY compared to some of the things one has to do to get the right data out of MARCXML. The XPath syntax when one has to get a particular subfield from a particular field with certain indicators, is IMHO extremely convoluted. The two problems I see with using MARCXML is that the person writing the code has to have a pretty good knowledge of XPath to deal with, and that same person needs to understand the intricacies of the MARC record syntax. Frankly, I don’t find that all that common in libraries, let alone outside of libraries.

    Which leave us caught between using a complex and cumbersome standard – MARCXML and a simpler standard which isn’t rich enough from a metadata perspective – Dublin Core. Forgive me but I just don’t find that satisfactory.

  7. 2008 May 6

    Apologies to Mike, whose comment got marked as spam for some reason. I’ve retrieved it from Akismet and pushed it through. Wish that it had gone through first because hits on many of my underlying frustrations.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS

You must be logged in to post a
video comment.