January 17, 2022

The Library of Congress Posts Update and Releases Report About What’s Going On With Their Twitter Archive

Update Digital Preservation expert and Founder of LOCKSS, Dr. David Rosenthal offer some analysis of the amount of data the archive contains. Hat Tip: @lorcand

The Library of Congress is out with a blog post and white paper (embedded below) that provides info about the complete archive of  tweets that Twitter donated to The Library of Congress.

The donation was first announced on April 15, 2010 in blog posts by LC and Twitter.

Since then LC has remained very quiet with details about how the the Twitter archive might be used and if it would be available to the public either online or in person at LC.

While LC officials did make comments from time to time almost no new details emerged although we asked…a lot. We never understood (and still don’t) why LC has been so tight-lipped about this project.

One thing we did learn was that a Boulder, CO company named Gnip was working with LC to build the archive. By the way, Gnip is also provides (fee-based) exclusive access of every publicly available tweet back to 2006.

Today’s Update

Today, almost 1000 days after it was first announced, LC’s  Director of Communications, Gayle Osterberg, has written a blog post with an update about the LC’s Twitter archive.

Key Points from the Blog Post

  • Archive of tweets from 2006-2010 now complete.
  • Contains 170 billion tweets.
  • “The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012.”
  • LC’s focus is now, “addressing the significant technology challenges to making the archive accessible to researchers in a comprehensive, useful way.”
  • Getting this done is a priority for LC
  • LC has received more than 400 requests from researchers to use archive

It’s good to learn some new details about how the project is going.

However, the post and report lack specifics about:

  • Access to the archive (Who will be able to access? How will the process work?)
  • A preliminary/tentative timeline about when this access might become available. Later this year? Next year?
  • Details about the technology that will be used to search, organize tweets?
  • We did learn when the project launched that the Computational Approaches to Digital Stewardship partnership between Stanford and LC might be involved. Are they? Were they?
  • Why LC has been so quiet about how the project was developing.

The Washington Post has a story about the Twitter archive that includes several interesting details (not included in the LC document) that helps answer some of the questions listed above. This article includes several quotes from Deputy Librarian of Congress Robert Dizard that makes it sound like providing access for researchers will not be taking place anytime soon.

See: “Library of Congress has archive of tweets, but no plan for its public display.”

On the Data LC Has Now Archived

“It’s pretty raw,” [Deputy Librarian of Congress Robert] Dizard said. “You often hear a reference to Twitter as a fire hose, that constant stream of tweets going around the world. What we have here is a large and growing lake. What we need is the technology that allows us to both understand and make useful that lake of information.”

On Access

For now, giving researchers access to the archive remains cost-prohibitive for the cash-strapped library, which has spent tens of thousands of dollars on the project so far, Dizard says.

“We know from the testing we’ve done with even small parts of the data that we are not going to be able to, on our own, provide really useful access at a cost that is reasonable for us,” Dizard said. “For even just the 2006 to 2010 [portion of the] archive, which is about 21 billion tweets, just to do one search could take 24 hours using our existing servers.”

Future Plans

The eventual plan is to make the collection available only within the Library of Congress reading rooms. Requiring an in-person visit to search a database of material that originated online may seem incongruous, but Dizard says it’s a condition of the deal with Twitter, which gifted the archive, so that the library won’t be “competing with the commercial sector.”

Finally, here’s the complete white paper that LC made available to day. The section titled, “The Library of Congress Agreement with Twitter” includes details that have not been made public to this point although we asked LC several times back when the archive project was first announced.

Update on the Twitter Archive At The Library of Congress

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.