New Research on Privacy: “Online Tracking: A 1-Million-Site Measurement and Analysis”
From researchers at Princeton University, a new research paper (draft) titled, “Online tracking: A 1-million-site measurement and analysis” by Steven Englehardt and Arvind Narayanan
From the Abstract
We present the largest and most detailed measurement of online tracking conducted to date, based on a crawl of the top 1 million websites. We make 15 types of measurements on each site, including stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and the exchange of tracking data between different sites (“cookie syncing”). Our findings include multiple sophisticated fingerprinting techniques never before measured in the wild.
This measurement is made possible by our web privacy measurement tool, OpenWPM, which uses an automated version of a full-fledged consumer browser. It supports parallelism for speed and scale, automatic recovery from failures of the underlying browser, and comprehensive browser instrumentation.
OpenWPM is open-source and has already been used as the basis of seven published studies on web privacy and security.
The total number of third parties present on at least two first parties is over 81,000, but the prevalence quickly drops off. Only 123 of these 81,000 are present on more than 1% of sites. This suggests that the number of third parties that a regular user will encounter on a daily basis is relatively small. The effect is accentuated when we consider that different third parties may be owned by the same entity. All of the top 5 third parties, as well as 12 of the top 20, are Google-owned domains. In fact, Google, Facebook, and Twitter are the only third-party entities present on more than 10% of sites.
Third parties are a major roadblock to HTTPS adoption; insecure third-party resources loaded on secure sites (i.e. mixed content on HTTPS sites) will either be blocked or cause the browser to display security warnings. We find that a large number of third parties (54%) are only ever loaded over HTTP. A significant fraction of HTTP-default sites (26%) embed resources from at least one of the HTTP-only third parties on their homepage. These sites would be unable to upgrade to HTTPS without browsers displaying mixed content errors to their users, the majority of which (92%) would contain active content which would be blocked.
Around 78,000 first-party sites currently support HTTPS by default on their home pages. Nearly of these 8% load with mixed content warnings, of which 12% are caused by third-party trackers.
Top Category of Sites For Tracking
News sites have the most trackers.
Firefox’s third-party cookie blocking is very effective, only 237 sites (0.4%) have any third-party cookies set from a domain other than the landing page of the site. Most of these are for benign reasons, such as redirecting to the U.S. version of a non-U.S. site. We did find a handful of exceptions, including 32 that contained ID cookies. These sites appeared to be deliberately redirecting the landing page to a separate domain before redirecting back to the initial domain. Ghostery was effective at reducing both the number of third parties and ID cookies. The average number of third-party includes went down from 17.7 to 3.3, of which just 0.3 had third-party cookies (0.1 with IDs).
Direct to complete summary including info and findings re: device fingerprinting (which is of growing concern) . It also includes links to previously published papers from the Princeton Web Census as well as data sets and software.
Direct to Full Text Article (Draft): “Online tracking: A 1-million-site measurement and analysis.”
UPDATE: Eric Hellman has just posted findings of some tracking research he just completed of ARL member library web sites. Must read!
Hellman’s post is titled, “97% of Research Library Searches Leak Privacy… and Other Disappointing Statistics”
Note From Gary Price, infoDOCKET Founder and Editor:
See Also: I discussed some of these issues in a video interview from the 2015 Charleston Conference.
See Also: I was part of a panel that included Peter Brantley, Marshall Breeding, and Eric Hellman at the Fall 2014 CNI Meeting where these issues were discussed. Video and slides here.
One final thought, if these issues on digital privacy and library privacy are concerns of yours (personally, professionally or both) take some time to learn what you can do to reduce the amount of data you and your work online leaks. This can be done while at the same time supporting efforts to make changes for all.
Sure, here’s one. Make sure you and your users understand that when they borrow and ebook from OverDrive and place it on their Kindle.
Amazon retains the borrow record and any notes made in the book UNLESS the user erases it.
This is hardly a new issue and it could be changed quickly with a disclaimer and a link about how to remove the data manually. Doing this is easy and fast.
Here’s something I wrote on this topic about three years ago.
The library community asks for transparency from others but we could do a better job of this ourselves.
Example 2: Instead of using Google Analytics consider using Piwik, a free, open source analytics package.
About Gary Price
Gary Price (firstname.lastname@example.org) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.