April 24, 2014

Data Mining: New Research From Google’s Peter Norvig Finds Most-Used English Words and Letters

share save 171 16 Data Mining: New Research From Googles Peter Norvig Finds Most Used English Words and Letters

Update: At the bottom of this post we’ve added a link to the full text research paper that the article below references.

From TPM Idea Land:

“Etaoin srhldcu” may read like nonsense to most English speakers upon first blush, but as it turns out, the combination is quite significant. It represents, in order, the most used letters in the English language, according to a new survey of 743 billion words conducted by Google’s head of research Peter Norvig.

The survey, which was publicized by Google Research on Monday, was an update to the seminal 1965 survey of some 20,000 words gathered from a variety of printed sources — books, magazines, newspapers — conducted by Mark Mayzner, a former Bell Labs researcher.

[Clip]

Using the Google Books Ngram viewer (which shows word popularity over time), Norvig created a new dataset of some 97,565 unique words, collectively repeated 743.8 billion times, which he noted on his blog is 37 million more occurrences than the 20,000-word sample that Mayzner assembled. Norvig’s sample also included over 3 trillion individual letters.

Read the Complete Article

Highlights (via Google Research on Google+)

- R, L, and C are more common than originally thought.
- The average English word is 4.79 letters long.
- The most common 4-gram is “tion”.
- The most common 7-gram is “present”.
- The most common 9-gram is “different”.

Charts

Direct to Chart: Most Used Letters in the English Language

Direct to Chart: Most Frequently Appearing Words in the English Language

Top 10 Words

1. The
2. Of
3. And
4. To
5. In
6. A
7. Is
8. That
9. For
10. It

See Also: Here’s the Full Text of Peter Norvig’s New Research Paper:
“English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU”

More background, findings, and charts.

share save 171 16 Data Mining: New Research From Googles Peter Norvig Finds Most Used English Words and Letters
Gary Price About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.