November 26, 2020

"Lessons From the Library: Behind the UK's Web Archive"

From Computer Weekly:

The truth is that the British Library has some of the most experienced and talented technologists in the IT and communications space, applying cutting-edge technology to solve some pretty tough and very interesting problems.

And their mission is not to bring a traditional institution into the modern age, because it’s already there. Indeed in areas such as digitisation and information storage, archiving and retrieval, it would put many big corporate IT departments to shame. A quick browse of www.bl.uk will provide a flavour of how some of this manifests itself on the Web, though a lot more goes on behind the scenes in support of academic institutions and researchers around the world.

I got a first-hand glimpse of all this when I visited the library’s facility in Boston Spa a few months ago and was hosted by Nicki Clegg, who manages the technical architecture group. Nicki oversees the evolution of the library’s information systems’ architecture, and leads a team that provides technical architectural expertise to programmes and projects.

One of these is the Web Archiving programme, which has been selectively preserving UK websites through a permission-based process since 2004 and making them accessible through the UK Web Archive. The programme acknowledges that a lot of the UK’s history now plays out on the web. It also works on the premise that website content is very often transient in nature. As any website designer or online media strategist will tell you, the key to a successful site is keeping the content fresh, current and relevant to your audience.

[Clip]

The current architecture is therefore a hybrid one, with local Tomcat servers running a version of the Wayback open source archiving software, working in tandem with very selective cloud-based processing. The clever piece (or one of them) is hosting of the archive index on EC2. The index is a critical component from a performance and scalability perspective, yet is very compressed and therefore easier to move around. Sending index requests for fast resolution in the cloud, but keeping the heavy-lifting content-serving mechanism local is a good compromise.

Read the Complete Article

Direct to the UK Web Archive

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share