January 18, 2019

The New York Times Digitizes Millions of Historical Photos Using Google Cloud Technology

From The NY Times:

The New York Times today announced that it is leveraging the power of Google Cloud technology to digitize an extensive collection of photographs dating back to as early as the late 19th century. The process will uncover some never-before-seen-documents, equip Times journalists with an easily accessible historical reference source, and preserve The Times’s history, one of its most unique assets.

Prior to the digitization, millions of photographs, along with tens of millions of historical news clippings, microfilm records and other archival materials, existed only in a physical archive three levels below ground near The Times headquarters in New York City called “The New York Times Archival Library,” also known as the “morgue.” Though The Times officially began clipping and saving articles in the 1870s, they were not formally codified into a library until 1907.

Read the Complete Announcement

From a Google Blog Post:

Here’s how it works. Once an image is ingested into Cloud Storage, The Times uses Cloud Pub/Sub to kick off the processing pipeline to accomplish several tasks. Images are resized through services running on Google Kubernetes Engine (GKE) and the image’s metadata is stored in a PostgreSQL database running on Cloud SQL, Google’s fully-managed database offering.

Cloud Pub/Sub helped The New York Times create its processing pipeline without having to build complex APIs or business process systems. It’s a fully-managed solution, so there’s no time spent maintaining the underlying infrastructure.

In order to resize the images and modify image metadata, The Times uses “ImageMagick” and “ExifTool”, open-source command-line programs. They added ImageMagick and exiftool wrapped with Go services to Docker images in order to run them on GKE in a horizontally-scalable manner with minimal administrative effort. Adding more capacity to process more images is trivial, and The Times can stop or start its Kubernetes cluster when the service is not needed. The images are also stored in Cloud Storage multi-region buckets for availability in multiple locations.

Learn More, Read the Complete Blog Post

Gary Price About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

Share