The New York Times Digitizes Millions of Historical Photos Using Google Cloud Technology
From The NY Times:
The New York Times today announced that it is leveraging the power of Google Cloud technology to digitize an extensive collection of photographs dating back to as early as the late 19th century. The process will uncover some never-before-seen-documents, equip Times journalists with an easily accessible historical reference source, and preserve The Times’s history, one of its most unique assets.
Prior to the digitization, millions of photographs, along with tens of millions of historical news clippings, microfilm records and other archival materials, existed only in a physical archive three levels below ground near The Times headquarters in New York City called “The New York Times Archival Library,” also known as the “morgue.” Though The Times officially began clipping and saving articles in the 1870s, they were not formally codified into a library until 1907.
Here’s how it works. Once an image is ingested into Cloud Storage, The Times uses Cloud Pub/Sub to kick off the processing pipeline to accomplish several tasks. Images are resized through services running on Google Kubernetes Engine (GKE) and the image’s metadata is stored in a PostgreSQL database running on Cloud SQL, Google’s fully-managed database offering.
Cloud Pub/Sub helped The New York Times create its processing pipeline without having to build complex APIs or business process systems. It’s a fully-managed solution, so there’s no time spent maintaining the underlying infrastructure.
In order to resize the images and modify image metadata, The Times uses “ImageMagick” and “ExifTool”, open-source command-line programs. They added ImageMagick and exiftool wrapped with Go services to Docker images in order to run them on GKE in a horizontally-scalable manner with minimal administrative effort. Adding more capacity to process more images is trivial, and The Times can stop or start its Kubernetes cluster when the service is not needed. The images are also stored in Cloud Storage multi-region buckets for availability in multiple locations.
About Gary Price
Gary Price (email@example.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.