Google reclaiming identity labels to improve machine learning abuse filters

Heads to Mardi Gras to teach its neural networks that 'gay' is not a toxic term

Train a machine learning model to detect ‘toxic’ words in online comments and it comes to some depressing conclusions.

During Google’s ongoing work on Perspective, an API that uses machine learning to detect abuse and harassment online, engineers found the models identified sentences that use words such as ‘gay’, ‘lesbian’ or ‘transgender’ as abusive.

“Unfortunately what happens when we give it this input – I’m a proud gay person – is the model predicts this is toxic,” said Google AI senior software engineer Ben Hutchinson at an ethics of data science conference at the University of Sydney last week.

“And the reason this happens, it seems, is because the majority of language on the internet on which this model was trained, which uses the word ‘gay’, is in language that is used to abuse and harass people. So the model has learnt the pattern that the word ‘gay’ is a toxic word,” he explained.

The impact of that conclusion finding its way back into online moderation tools is a serious one.

“Models like this are being deployed on website to automatically moderate comments and if we started blocking online comments like this one [I am a young proud gay male living with HIV] we take the voice away from marginalised communities,” Hutchinson said.

The issue is that identity labels like ‘gay’, ‘lesbian’ or ‘transgender’ are over-represented in abusive and toxic online comments. It follows that based on this data, the machine learning models attach negative connotations to the labels.

It’s not the models’ fault: Hutchinson likens them to the jinns, genies and golems found in ancient folklore. They are a non-human intelligence, he says, that “are neither inherently good nor inherently evil, but they are prone to misinterpreting things”.

“The important question is not is our model learning patterns from the data correctly? But rather how do we want our systems to impact people?” Hutchinson added.

To overcome what the machine learning community refers to as insufficient diversity in the training data, Google in March last year set about collecting statements about how marginalised group describe themselves and loved ones.

Throughout 2018, stalls were set up at Sydney Gay and Lesbian Mardi Gras, Auckland Pride and San Francisco Pride events. Attendees were invited to anonymously write down the identity labels they might give themselves and make a statement describing themselves and the ones they love. The labels and statements are still being collected online.

“It’s really important to go out there and get the data you need,” Hutchinson said. “The idea is to create a targeted test data set specifically aimed at testing whether models have harmful biases for a particular community using the language which they use.”

Armed with the new datasets, Hutchinson and team are now able to better understand at which points in neural networks models are determining an identity label is toxic by “peering inside the network”. They do this using a method called Concept Activation Vectors.

“The idea here is that we can take an input – such as ‘I’m an out gay person’ – we can pass it through a series of layers of our network, and then stop at one of the internal layers and look at the strengths of the activations of the nodes in that layer,” Hutchinson explained.

Essentially the techniques allow for two important things, Hutchinson said.

“Normative testing – how do we know that model we’ve trained is actually working in accordance with our human social norms and values, and the second goal is interpretability – we’d like to know if the model is misinterpreting what it is that we’re trying to do,” he explained.

The labels and statements – which continue to be collected online and in-person as part of Project Respect – will, later this year, go into an open source dataset so developers can it to teach their own machine learning models which words people use to positively identify themselves.

“The hope is that by expanding the diversity of training data, these models will be able to better parse what’s actually toxic and what’s not,” Hutchinson said.