May 17, 2022

A New Book Chapter by Two Googlers: "Indexing the World Wide Web: The Journey So Far"

Title: Indexing the World Wide Web: The Journey So Far (PDF; Full Text)
Authors: Abhishek Das, Ankit Jain

This Chapter is Scheduled To Appear In: Next Generation Search Engines: Advanced Models for Information Retrieval, 2011

In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concepts in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.

UPDATED: 10/15/2012
Added Embed.

Indexing The World Wide Web: The Journey So Far

About Gary Price

Gary Price ( is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at, and is currently a contributing editor at Search Engine Land.