January 22, 2020

New NIST Study Evaluates Effects of Race, Age, Sex on Face Recognition Software

From NIST:

How accurately do face recognition software tools identify people of varied sex, age and racial background? According to a new study by the National Institute of Standards and Technology (NIST), the answer depends on the algorithm at the heart of the system, the application that uses it and the data it’s fed — but the majority of face recognition algorithms exhibit demographic differentials. A differential means that an algorithm’s ability to match two images of the same person varies from one demographic group to another.

Results captured in the report, NISTIR 8280 Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects, are intended to inform policymakers and to help software developers better understand the performance of their algorithms. Face recognition technology has inspired public debate in part because of the need to understand the effect of demographics on face recognition algorithms.

“While it is usually incorrect to make statements across algorithms, we found empirical evidence for the existence of demographic differentials in the majority of the face recognition algorithms we studied,” said Patrick Grother, a NIST computer scientist and the report’s primary author. “While we do not explore what might cause these differentials, this data will be valuable to policymakers, developers and end users in thinking about the limitations and appropriate use of these algorithms.”


What sets the publication apart from most other face recognition research is its concern with each algorithm’s performance when considering demographic factors. For one-to-one matching, only a few previous studies explore demographic effects; for one-to-many matching, none have.

To evaluate the algorithms, the NIST team used four collections of photographs containing 18.27 million images of 8.49 million people. All came from operational databases provided by the State Department, the Department of Homeland Security and the FBI. The team did not use any images “scraped” directly from internet sources such as social media or from video surveillance.

The photos in the databases included metadata information indicating the subject’s age, sex, and either race or country of birth. Not only did the team measure each algorithm’s false positives and false negatives for both search types, but it also determined how much these error rates varied among the tags. In other words, how comparatively well did the algorithm perform on images of people from different groups?

Tests showed a wide range in accuracy across developers, with the most accurate algorithms producing many fewer errors. While the study’s focus was on individual algorithms, Grother pointed out five broader findings:

Tests showed a wide range in accuracy across developers, with the most accurate algorithms producing many fewer errors. While the study’s focus was on individual algorithms, Grother pointed out five broader findings:

  1. For one-to-one matching, the team saw higher rates of false positives for Asian and African American faces relative to images of Caucasians. The differentials often ranged from a factor of 10 to 100 times, depending on the individual algorithm. False positives might present a security concern to the system owner, as they may allow access to impostors.
  2. Among U.S.-developed algorithms, there were similar high rates of false positives in one-to-one matching for Asians, African Americans and native groups (which include Native American, American Indian, Alaskan Indian and Pacific Islanders). The American Indian demographic had the highest rates of false positives.
  3. However, a notable exception was for some algorithms developed in Asian countries. There was no such dramatic difference in false positives in one-to-one matching between Asian and Caucasian faces for algorithms developed in Asia. While Grother reiterated that the NIST study does not explore the relationship between cause and effect, one possible connection, and area for research, is the relationship between an algorithm’s performance and the data used to train it. “These results are an encouraging sign that more diverse training data may produce more equitable outcomes, should it be possible for developers to use such data,” he said.
  4. For one-to-many matching, the team saw higher rates of false positives for African American females. Differentials in false positives in one-to-many matching are particularly important because the consequences could include false accusations. (In this case, the test did not use the entire set of photos, but only one FBI database containing 1.6 million domestic mugshots.)
  5. However, not all algorithms give this high rate of false positives across demographics in one-to-many matching, and those that are the most equitable also rank among the most accurate. This last point underscores one overall message of the report: Different algorithms perform differently.

Direct to Full Text Report: NISTIR 8280 Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects
82 pages; PDF. 

Gary Price About Gary Price

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.