SUBSCRIBE
SUBSCRIBE
EXPLORE +
  • About infoDOCKET
  • Academic Libraries on LJ
  • Research on LJ
  • News on LJ
  • Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Libraries
    • Academic Libraries
    • Government Libraries
    • National Libraries
    • Public Libraries
  • Companies (Publishers/Vendors)
    • EBSCO
    • Elsevier
    • Ex Libris
    • Frontiers
    • Gale
    • PLOS
    • Scholastic
  • New Resources
    • Dashboards
    • Data Files
    • Digital Collections
    • Digital Preservation
    • Interactive Tools
    • Maps
    • Other
    • Podcasts
    • Productivity
  • New Research
    • Conference Presentations
    • Journal Articles
    • Lecture
    • New Issue
    • Reports
  • Topics
    • Archives & Special Collections
    • Associations & Organizations
    • Awards
    • Funding
    • Interviews
    • Jobs
    • Management & Leadership
    • News
    • Patrons & Users
    • Preservation
    • Profiles
    • Publishing
    • Roundup
    • Scholarly Communications
      • Open Access

January 24, 2013 by Gary Price

Web Search: What’s Common Crawl All About?

January 24, 2013 by Gary Price

From MIT Technology Review:

A nonprofit called Common Crawl is now using its own Web crawler and making a giant copy of the Web that it makes accessible to anyone. The organization offers up over five billion Web pages, available for free so that researchers and entrepreneurs can try things otherwise possible only for those with access to resources on the scale of Google’s.
[Clip]
Common Crawl has so far indexed more than five billion pages, adding up to 81 terabytes of data, made available through Amazon’s cloud computing service. For about $25 a programmer could set up an account with Amazon and get to work crunching Common Crawl data, says Lisa Green, Common Crawl’s director. The Internet Archive, another nonprofit, also compiles a copy of the Web and offers a service called the “Wayback Machine” that can show old versions of a particular page. However, it doesn’t allow anyone to analyze all its data at once in that way.
Common Crawl has already inspired or helped out some new Web startups. TinEye, a “reverse” search engine that finds images similar to one provided by the user, made use of early Common Crawl data to get started. One programmer’s personal project using Common Crawl data to measure how many of the Web’s pages connect to Facebook—some 22 percent, he concluded—led to his securing funding for a startup, Lucky Oyster, based on helping people find useful information in their social data.

Read the Complete Article
Learn More: Visit the Common Crawl Web Site and Take a Look at the Winner’s of Common Crawl’s Code Contest

Filed under: Data Files, Funding, News

SHARE:

Common Crawl

About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.

ADVERTISEMENT

Archives

Job Zone

ADVERTISEMENT

Recent Articles on LJ

After the MLIS

Capitol Gains: ALA 2022 Preview

From the Top: Library Leaders Talk EDI | Equity

Positioned for Power: Hiring an EDI Officer | Equity

Certified Sustainable | Sustainability

ADVERTISEMENT

Related Infodocket Posts

Report: "The Important Role Libraries Play in Building a Creative and Innovative Society"

From ArchDaily: As gateways to knowledge and culture, libraries play a fundamental role in society. Foundational in creating opportunities for learning, as well as supporting literacy and education, the resources ...

Not Real News: An Associated Press Roundup of Untrue Stories Shared Widely on Social Media This Week

From the Associated Press: A roundup of some of the most popular but completely untrue stories and visuals of the week. None of these are legit, even though they were ...

Statement: American Library Association (ALA) Condemns Threats of Violence in Libraries

Full Text of ALA Statement (6/24): In response to the alarming increase in acts of aggression toward library workers and patrons as reported by press across the country, the American ...

Roundup (June 24, 2022)

FCC and IMLS Sign Agreement to Promote Broadband Access More Than Fifty Libraries and Library Systems Live on EBSCO FOLIO Library Services Platform NIST Releases New Guidance and Resources on ...

Report: "Vatican Releases Thousands of Holocaust-Era Letters and Requests Online"

From the Associated Press (via Times of Israel): Pope Francis orders the online publication of 170 volumes of its Jewish files from the recently opened Pope Pius XII archives, the ...

The New York Public Library Opens a ‘Virtual Branch’ on Instagram and Launches a Reading Recommendation Project Using...

From NYPL: The virtual branch— a custom designed interactive AR (Augmented Reality) Effect accessible via Instagram Reels is the centerpiece of #NYPLSummerBookshelf, a new initiative to spark a love of ...

Roundup (June 23, 2022)

CLIR Invites Proposals for Pocket Burgundy Series (via Council on Library and Information Resources) Oregon’s State Library added to National Register of Historic Places (via Oregon Capital Chronicle)

State of New York Releases First-Of-Its Kind Statewide Address-Level Broadband Map

From GCN: An address-level, interactive broadband map will help officials in New York explore statewide high-speed internet availability, assess connectivity needs and better allocate state and federal funding. The map ...

Journal Article: "Rarely Analyzed: The Relationship Between Digital and Physical Rare Books Collections"

The article linked below was recently published by Information Technology and Libraries. Title Rarely Analyzed: The Relationship Between Digital and Physical Rare Books Collections Authors Allison McCormack University of Utah ...

Mellon Foundation Awards $600,000 to Digital Preservation Outreach and Education Network

From The Pratt Institute: The Mellon Foundation has awarded the Pratt Institute School of Information $600,000 to support the Digital Preservation Outreach and Education Network (DPOE-N) in collaboration with the ...

DPLA Receives $150,000 Grant From the Knight Foundation to Expand the Palace Marketplace and Palace Bookshelf

From a DPLA Announcement: DPLA’s ebook work is a key part of our mission to advance digital access to knowledge for all. Earlier this month, The Palace Project app and platform ...

Charles Watkinson Takes Office as AUPresses President

From an AUPresses Announcement: Charles Watkinson, director of the University of Michigan Press, has stepped into the presidency of the Association of University Presses. Watkinson, who also serves as associate ...

ADVERTISEMENT

FOLLOW INFODOCKET ON TWITTER

Tweets by @infodocket

ADVERTISEMENT

This coverage is free for all visitors. Your support makes this possible.

This coverage is free for all visitors. Your support makes this possible.

Primary Sidebar

  • News
  • Reviews+
  • Technology
  • Programs+
  • Design
  • Leadership
  • People
  • COVID-19
  • Advocacy
  • Opinion
  • INFOdocket
  • Job Zone

Reviews+

  • Booklists
  • Prepub Alert
  • Book Pulse
  • Media
  • Readers' Advisory
  • Self-Published Books
  • Review Submissions
  • Review for LJ

Awards

  • Library of the Year
  • Librarian of the Year
  • Movers & Shakers 2022
  • Paralibrarian of the Year
  • Best Small Library
  • Marketer of the Year
  • All Awards Guidelines
  • Community Impact Prize

Resources

  • LJ Index/Star Libraries
  • Research
  • White Papers / Case Studies

Events & PD

  • Online Courses
  • In-Person Events
  • Virtual Events
  • Webcasts
  • About Us
  • Contact Us
  • Advertise
  • Subscribe
  • Media Inquiries
  • Newsletter Sign Up
  • Submit Features/News
  • Data Privacy
  • Terms of Use
  • Terms of Sale
  • FAQs
  • Careers at MSI


© 2022 Library Journal. All rights reserved.


© 2022 Library Journal. All rights reserved.