From the Wikimedia Blog:
The Wikimedia Foundation’s Analytics team is releasing a monthly clickstream dataset.
Aggregate data on how readers browse Wikipedia contents can provide priceless insights into the structure of free knowledge and how different topics relate to each other. It can help identify gaps in content coverage (do readers stop browsing when they can’t find what they are looking for?) and help determine if the link structure of the largest online encyclopedia is optimally designed to support a learner’s needs.
Perhaps the most obvious usage of this data is to find where Wikipedia gets its traffic from. Not only clickstream data can be used to confirm that most traffic to Wikipedia comes via search engines, it can also be analyzed to find out—at any given time—which topics were popular on social media that resulted in a large number of clicks to Wikipedia articles.
A quick look into the November 2017 data for English Wikipedia tells us it contains nearly 26 million distinct links, between over 4.4 million nodes (articles), for a total of more than 6.7 billion clicks. The distribution of distinct links by type (see Ellery’s blog post for more details) is as follow:
- 60% of links (15.6M) are internal and account for 1.2 billion clicks (18%).
- 37% of links (9.6M) are from external entry-points (like a Google search results page) to an article and count for 5.5 billion clicks.
- 3% of links (773k) have type “other”, meaning they reference internal articles but the link to the destination page was not present in the source article at the time of computation. They account for 46 million clicks.