Harvard’s library released millions of MARC records into the wild recently. They’re not the first, as Jessamyn West points out, but as a wild creature herself, the Loon cannot but approve of additional wild data.
Stuart Shieber, who is not a librarian, disclaims any sense of how the records will be used, though his curiosity on the point is gratifying. Permit the Loon to blue-sky a few potential uses.
Libraries face a tremendous data migration. Our catalog records, currently employing 1960s-developed MARC as their carrier and a rather toxic and inconsistent stew of AACR2, ISBD(G), and “local practice” (different in every locality, of course) as their content, will be changing both carrier and content rules in relatively short order. The new carrier will in all likelihood be RDF. The new content rules, and the data models underlying them, are still under construction.
Harvard’s records may present a curious and useful study. Harvard’s library system is extremely, pathologically at this historical juncture, decentralized. Catalogers have been scattered hither and yon through various little autonomous physical libraries. How this affects the entirety of Harvard cataloging practice the Loon can’t say for sure, but in the natural course of things, it would tend to work against uniform practices and uniform record quality. Quality-control processes, especially after initial record creation, are dicey in any library; this adds a layer of temporal variation to Harvard’s dataset because of how long Harvard libraries have been MARC-cataloging materials. Again, how extensive that layer is the Loon can’t rightly say.
Contrast this with OCLC’s WorldCat, which for all its faults (and does it ever have them!) is able to impose layers of automated record-checking, migration, and consistent practice that are out of reach of any individual library or library system. OCLC Research has been doing considerable valuable research on its own records, asking questions about how migration might work and what its results will be, but the broader applicability of its results are somewhat limited by the relative homogeneity and high quality of its recordset.
So Harvard, Michigan, and others offer migration researchers and developers real-world, warts-and-all datasets to test their theories and tools on. That’s excellent.
Going a little further into the blue sky… the new catalog environment will, from the point of view of most libraries, contain considerably fewer moving parts than the current one. (This is good, because the current cataloging workflow-set is lamentably inefficient, and libraries have better uses for all those skilled people.) In a true “shared cataloging” environment, when a library receives a new copy of a book, cataloging will amount to stating “Library X (identified by its URI) has a copy of Book Y (identified by its URI; yes, there are complexities here, but they’ll be hidden from most libraries).” That’s it. The rest of the necessary information about Book Y will generally have been pushed into the networked linked-data environment by other actors, such as other libraries or even publishers or booksellers.
So given libraries’ legacy-MARC-record environment—a vast heterogeneous semi-structured record soup—how feasible is it to migrate a MARC recordset into the shared-cataloging cloud? How much of it can be done more or less automagically, keying in on the (sadly few) unambiguous (or even ambiguous) identifiers recorded in typical MARC practice? What’s the minimal human intervention required to make leftover records loadable?
This question has implications far beyond Harvard or Michigan, of course. One of the plaints from libraries (particularly public libraries) about the upcoming migration is “how will we ever afford it?” If all, or even most, of the work turns out to be automatable, that objection disappears, along with considerable amounts of expensive, clunky MARC-based technical infrastructure. (Will RDF-based infrastructure be any less clunky? The Loon suspects so, for the simple reason that libraries and their vendors are not the only actors in the linked-data environment. More eyes, more developers, make environments less clunky. Usually, anyway.)
Without real-world recordsets, we can’t discover the answers we’ll need as we contemplate migration. With them, we have a fighting chance.
The Loon values these metadata caches as a library-school instructor, too. They’re a bit formidable for the introductory information-organization class she teaches, but bright students with programming experience pop up now and again, and the Loon sees no reason not to have them cut their teeth on a twelve-million-record MARC dataset. (Big Data? We got your Big Data right here.)
So thank you, Harvard. Thank you, Michigan. Thanks to every library that overcomes fear and (sometimes) shame to put raw records out there. You are helping build a truly global network of library-held materials, a laudable process.

What does one do with millions of MARC records? by Library Loon, unless otherwise expressly stated, is licensed under a Creative Commons Attribution 3.0 United States License.