"Lessons From the Library: Behind the UK's Web Archive"
The truth is that the British Library has some of the most experienced and talented technologists in the IT and communications space, applying cutting-edge technology to solve some pretty tough and very interesting problems.
And their mission is not to bring a traditional institution into the modern age, because it’s already there. Indeed in areas such as digitisation and information storage, archiving and retrieval, it would put many big corporate IT departments to shame. A quick browse of www.bl.uk will provide a flavour of how some of this manifests itself on the Web, though a lot more goes on behind the scenes in support of academic institutions and researchers around the world.
I got a first-hand glimpse of all this when I visited the library’s facility in Boston Spa a few months ago and was hosted by Nicki Clegg, who manages the technical architecture group. Nicki oversees the evolution of the library’s information systems’ architecture, and leads a team that provides technical architectural expertise to programmes and projects.
One of these is the Web Archiving programme, which has been selectively preserving UK websites through a permission-based process since 2004 and making them accessible through the UK Web Archive. The programme acknowledges that a lot of the UK’s history now plays out on the web. It also works on the premise that website content is very often transient in nature. As any website designer or online media strategist will tell you, the key to a successful site is keeping the content fresh, current and relevant to your audience.
The current architecture is therefore a hybrid one, with local Tomcat servers running a version of the Wayback open source archiving software, working in tandem with very selective cloud-based processing. The clever piece (or one of them) is hosting of the archive index on EC2. The index is a critical component from a performance and scalability perspective, yet is very compressed and therefore easier to move around. Sending index requests for fast resolution in the cloud, but keeping the heavy-lifting content-serving mechanism local is a good compromise.
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.