A person known by many readers of infoDOCKET for his work, writing, and presentations, Roy Tennant from OCLC Research, has a great post (with data) on the Hanging Together blog explaining how he used the WorldCat database to build a list of the countries of origin for the 300+ million MARC records in WorldCat.
Specifically, Roy’s research looked at the 260 $a subfield. (place of publication, distribution, etc.)
Tennant writes about some of the challenges he had with this project.
As you might imagine, what results from such an investigation is a complete dog’s breakfast, with a large variety of punctuation marks, typographical errors, imaginative spellings, and just plain junk. No, it is much better to parse bytes 15-17 of the 008 field, which at least are supposed to only contain values from this list maintained by the Library of Congress. Progress.
That is, until one discovers that this “Code List for Countries” is not exactly that. If you happen to be in a certain select part of the world (mostly the United States, Canada, and Australia), you can also select state or province-specific codes. So before I used this table to translate the codes for actual countries I first had to translate the table, so that the code for “California” translated instead to “United States”. Progress.
Make sure to read the complete blog post where Roy explains about other issues he faced.
But Wait…There’s More!
Roy and His OCLC Research Colleagues Recently Shared MARC Usage in WorldCat Data Visualizations
See Also: A Real-Time Stream New/Updated WorldCat Records (with Visualization)
It’s WorldCat Live!