A New Book Chapter by Two Googlers: "Indexing the World Wide Web: The Journey So Far"
Title: Indexing the World Wide Web: The Journey So Far (PDF; Full Text)
Authors: Abhishek Das, Ankit Jain
This Chapter is Scheduled To Appear In: Next Generation Search Engines: Advanced Models for Information Retrieval, 2011
In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concepts in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.
UPDATED: 10/15/2012
Added Embed.
Indexing The World Wide Web: The Journey So Far
Filed under: Data Files
About Gary Price
Gary Price (gprice@gmail.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. He earned his MLIS degree from Wayne State University in Detroit. Price has won several awards including the SLA Innovations in Technology Award and Alumnus of the Year from the Wayne St. University Library and Information Science Program. From 2006-2009 he was Director of Online Information Services at Ask.com.