In a TED Featured Talk, I spoke about my personal experience with the coded gaze—my term for algorithmic bias. While my lighter-skinned friend had her face detected, my darker face was not detected—until I put on a white mask. After the talk was posted, I was curious about what kind of results I’d get if I ran my profile image across different facial analysis demos. Two demos did not detect my face, and the other two misgendered me. The demos did not distinguish between gender identity and biological sex. They simply provided two labels: male and female— if they even got to that point. I wanted to see if the results were just because of my unique facial features, or if I would find similar results by testing a larger group of faces.
This work continues the theme of exploring automated facial analysis technology. It’s important to understand that there is a clear distinction between face detection (“Is there a face?”) and face classification (“What kind of face?”). If no face is detected (as happened to me unless I put on a white mask), no further work can be done using that face. So, no classification tasks can happen if there’s no face detected. My prior work on the coded gaze explores face detection from the standpoint of a personal experience as a way of introducing the subject of algorithmic bias more broadly. The Gender Shades project dives deeper into gender classification, using 1,270 faces to demonstrate the need for more inclusive benchmark datasets, both to evaluate the performance of facial analysis technology and to disaggregate performance metrics to get a better sense of how automated systems perform across different subgroups of faces.
Once a system has detected a face, there are two types of recognition tasks that can be done: verification (one to one) and identification (one to many). Verification is a one-to-one task, used (for example) by iPhone X FaceID. Identification is a one-to-many task, using the image as a probe image compared to a gallery of other images; does the probe image match any of the items in the gallery? Identification would be done, for example, by law enforcement looking for criminal matches or a missing person.
Gender Shades explores face classification (“What type of face?”); it infers soft biometrics like gender, age, ethnicity, or emotion of face. The novel work in this project focuses on gender classification.
This advances gender classification benchmarking by introducing a new face dataset composed of 1,270 unique individuals that is more phenotypically balanced on the basis of skin type than existing benchmarks. To our knowledge this is the first gender classification benchmark labeled by the Fitzpatrick six-point skin type scale, allowing us to benchmark the performance of gender classification algorithms by skin type. Second, this work introduces the first intersectional demographic and phenotypic evaluation of face-based gender classification accuracy. Instead of evaluating accuracy by gender or skin type alone, accuracy is also examined on four intersectional subgroups: darker females, darker males, lighter females, and lighter males. The three commercial gender classifiers we evaluated have the lowest accuracy on darker females. Since computer vision technology is being utilized in high-stakes sectors such as healthcare and law enforcement, more work needs to be done in benchmarking vision algorithms for various demographic and phenotypic groups.
Using the dermatologist-approved Fitzpatrick Skin Type classification system, we characterize the gender and skin type distribution of two facial analysis benchmarks, IJB-A and Adience. We find that these datasets are overwhelmingly composed of lighter-skinned subjects (79.6% for IJB-A and 86.2% for Adience).
Preliminary analysis of the IJB-A and Adience benchmarks revealed overrepresentation of lighter males, underrepresentation of darker females, and underrepresentation of darker individuals in general. We developed the Pilot Parliaments Benchmark (PPB) to achieve better intersectional representation on the basis of gender and skin type. PPB consists of 1,270 individuals from three African countries (Rwanda, Senegal, and South Africa) and three European countries (Iceland, Finland, and Sweden), selected for gender parity in the national parliaments.
All evaluated companies provided a “gender classification” feature that uses the binary sex labels of female and male. This reductionist view of gender does not adequately capture the complexities of gender or address transgender identities. The companies provide no documentation to clarify whether their gender classification systems that provide sex labels are classifying gender identity or biological sex. To label the PPB data, we use female and male labels to indicate subjects perceived as women or men respectively.
Skin type is a limited proxy for ethnicity, and ethnicity is an unstable predictor of skin type. This is not to say there can be no correlation between the two. Given interclass variation in regard to phenotypes associated with an ethnicity, assessing phenotype directly is more useful than using demographic proxies when we want to evaluate how specific facial characteristics influence classification accuracy. Most critically, when attempting to create inclusive benchmarks, we need to account for intraclass variation within demographic groups.
Also, because we’re looking at computer vision, we wanted an objective measure. Ethnicities and races are not stable—they’re historically and socially constructed. As a result, for example, if I wanted a data set of Black people in the USA, phenotypic measures are more objective. This isn’t to say that they don’t correlate or map back, but because it’s computer vision we wanted to use as objective a measure as possible.
Artificial intelligence—which is infiltrating society, helping determine who is hired, fired, granted a loan, or even how long someone spends in prison—has a bias problem, and the scope and nature of this problem is largely hidden. Selecting training data to fine-tune artificial intelligence systems is a pivotal part of developing robust predictive models. However, bias reflecting social inequities in training data can embed unintended bias in the models that are created. Furthermore, benchmark datasets are used to assess progress on specific tasks like machine translation and pedestrian detection. Unrepresentative benchmark datasets and aggregate accuracy metrics can provide a false sense of universal progress on these tasks.
For a gender classification task, where within the inherited structure of a binary (female or male) there’s a 50 percent chance of being correct, the results failed for more than one in three women of color. It was surprising that, even when working within a binary–where you’d expect 50/50 odds–that some of the results were only a little over 30 percent correct.
IBM’s self-reported accuracy/performance results for its new API are available in the full response to the Gender Shades paper. Excerpts of the full paper are below.
The paper “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” by Joy Buolamwini and Timnit Gebru, that will be presented at the Conference on Fairness, Accountability, and Transparency (FAT*) in February 2018, evaluates three commercial API-based classifiers of gender from facial images, including IBM Watson Visual Recognition. The study finds these services to have recognition capabilities that are not balanced over genders and skin tones . In particular, the authors show that the highest error involves images of dark-skinned women, while the most accurate result is for light-skinned men.
For the past nine months, IBM has been working towards substantially increasing the accuracy of its new Watson Visual recognition for facial analysis, which now uses different training data and different recognition capabilities than the service evaluated in this study conducted in April 2017. IBM will be bringing this service out in production over next several weeks. As with all of IBM's publicly available software services, we are constantly evaluating and updating them with new features and capabilities that continue to significantly enhance their overall performance.
Even before the deployment of AI, IBM believes that organizations that collect, store, manage and process data have an obligation to handle it responsibly. That belief—embodied in our century-long commitment to trust and responsibility in all relationships—is why the world’s largest enterprises trust IBM as a steward of their most valuable data. We take that trust seriously and earn it every day by following beliefs and practices outlined in our description of Data Responsibility@IBM: and by working to continually enhance and improve the technologies we bring to the world.
Data ethics and AI has to be a conversation and commitment that transcends any one company and we’re grateful for your important contribution. Thank you again.
We believe the fairness of AI technologies is a critical issue for the industry and one that Microsoft takes very seriously. We've already taken steps to improve the accuracy of our facial recognition technology and we’re continuing to invest in research to recognize, understand and remove bias. Microsoft is fully committed to continuously improving the accuracy of the outcomes from the technology and is making further investments to do so.
I have yet to hear from Face++ as of February 6, 2018 even though I shared the research results with them on December 22, 2017, the same day I released the results to IBM and Microsoft.
I am advocating for full-spectrum inclusion that goes beyond tokenism. We should think of inclusion as a continuous process. Full-spectrum inclusion is not a one-shot situation; it is ongoing work to shift the ecosystem of tech itself.
The goal of facial analysis technology is to be able to work on any human face. Thus, researchers and companies with global ambitions must address the fact that they are not reaching undersampled majorities. Needless to say, people of color make up the majority of the world’s population. Additionally, in places such as the US, where communities of color are more subject to police scrutiny, their faces are the least represented in the training and benchmarks used for facial analysis technology. Flawed facial analysis technology poses a threat to civil liberties and provides a guise of machine neutrality that can subject innocent people to unwarranted scrutiny.
Even flawless facial analysis technology in the hands of authoritarian governments, personal adversaries, and aggressive marketers can be abused.
My goal with establishing the IEEE International Standards for Automated Facial Analysis Technology is to provide more inclusive benchmarks and rigorous reporting protocols to increase transparency in the accuracy and accountability in the use of these systems, while also stipulating context limitations. Facial analysis systems that have not been publicly audited for subgroup accuracy should not be used by law enforcement. Citizens should be given an opportunity to decide if this kind of technology should be used in their municipalities, and if they are adopted ongoing reports must be provided about their use and if the use has in fact contributed to specific goals for community safety. If diverse voices are not part of the decision-making processes around AI-fueled innovations, the inadvertent use of bias-prone technology will continue. The tech industry and AI research community need the undersampled majority.