Gender Shades

The Gender Shades project pilots an intersectional approach to inclusive product testing for AI.

Algorithmic Bias Persists

Gender Shades is a preliminary excavation of the inadvertent negligence that will cripple the age of automation and further exacerbate inequality if left to fester. The deeper we dig, the more remnants of bias we will find in our technology. We cannot afford to look away this time, because the stakes are simply too high.  We risk losing the gains made with the civil rights movement and women's movement under the false assumption of machine neutrality. Automated systems are not inherently neutral. They reflect the priorities, preferences, and prejudices—the coded gaze—of those who have the power to mold artificial intelligence.

Gender Shades: leading tech companies' commercial AI systems significantly mis-gender women and darker skinned individuals. Researcher Joy Buolamwini initiated a systematic investigation after testing her TED speaker photo on facial analysis technology from leading companies. Some companies did not detect her face. Others labeled her face as male. After analyzing results on 1270 uniques faces, the Gender Shades authors uncovered severe gender and skin-type bias in gender classification.

In the worst case, the failure rate on darker female faces is over one in three, for a task with a 50 percent chance of being correct. In the best case,one classifier achieves flawless performance on lighter males: 0 percent error rate.

Pale Male Data: existing measures of success in AI  don't reflect the global majority—we are fooling ourselves. Existing benchmark datasets overrepresent lighter men in particular and lighter individuals in general. Gender and skin type skews in existing datasets led to the creation of a new Pilot Parliaments Benchmark, composed of parliamentarians from the top three African and top three European countries, as ranked by gender parity in their parliaments as of May 2017.

Deploying AI in Ignorance: There is a need for inclusive AI testing and subgroup (demographic, appearance, etc.) accuracy reports. Companies do not disclose how well AI systems perform on different subgroups. Some admit to not checking. Evaluation needs to be intersectional: i.e. instead of examining male vs. female, lighter vs. darker, we also need to look at the intersections—darker females, darker males, lighter females, and lighter males. Phenotypic accuracy (performance on difference types of skin) should also be done where appropriate (computer vision applications).