Gender and Racial Bias in Cloud NLP Sentiment APIs

At Automattic, I work in a group that frequently uses natural language processing (NLP) — a kind of artificial intelligence (AI) — that tries to understand text. We have used NLP for suggesting domain names, to tag some support interactions, and to understand the different verticals that our customers build sites for.

In the course of building these tools, we have often encountered and have had to work around gender and racial bias that gets baked into the machine learning models that we use for text analysis. This is an acknowledged problem confronting NLP and the solutions are not simple. Building fair and non-toxic NLP systems requires constant vigilance, and we are continuously auditing new platforms and models to make sure that the users of our systems are not adversely impacted.

In the course of these audits, I’ve found evidence of gender and racial bias in the sentiment analysis services offered by Amazon (called Amazon Comprehend) and to a much lesser extent by Google (part of its Cloud Natural Language API). Developers and companies wishing to use these services should at the very least conduct audits and analysis on representative data to make sure their applications are not impacted adversely.

What is sentiment analysis, why does it matter?

Sentiment analysis is a specialized kind of natural language processing. A software program scans a piece of text and then outputs an estimate of the emotion expressed in that text. The score is usually expressed on a binary scale, as either positive or negative. You can think about this in terms of a product review — on Amazon let’s say. Someone really enjoys a book. We’d expect the sentiment program to give it a high positive score. Another person trashes the same book. We’d expect the program to produce a very negative score. Let’s say that positive emotions are between 0 and 1, negative scores are between -1.0 and 0.

Sentiment analysis is a blunt instrument. The theories on which it is founded (that human emotion can be quantified into a limited set of fixed categories, that Western emotional norms are universal) are open for interrogation. That said, in the world of web marketing and analytics, it can at least give people on both sides of the website a rough pulse on what’s going on. (Think of consumers trying to get a sense of the product reviews, marketers and designers who are trying to improve what they are selling, listserv moderators trying to get a quick sense of when comments are getting into the bullying realm, etc.)

From the perspective of machine learning, the process of developing a system to score sentiment is straightforward, at least conceptually: Collect documents that you want to understand the sentiment of, get people to look at those documents and rank their sentiment, then run your machine learning algorithm to learn a scoring function that approximates the behavior of the human annotators.

Bias in sentiment

In a recent paper, Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems, Svetlana Kiritchenko and Saif M. Mohammad identified racial and gender bias in research grade sentiment analysis systems. Each of these systems were entered in a sentiment analysis “bake off” called SemEval-2018 Task 1: Affect in Tweets. These  systems were developed by academic AI research labs — many of which achieve state-of-the-art results in automated sentiment labeling. That is, their machine learning systems come close to scoring the documents the way that humans who labeled them did.

Kiritchenko and Mohammad generated more than 8,000 sentences by starting with core sentences that encompassed a range of emotions. They then varied these sentences by substituting names with high gender and racial association. (Think: common African American names vs. common European American names.) The results show a statistically significant difference in the scoring of emotion associated with African American names on the tasks of anger, fear, and sadness intensity prediction. That is, the systems were more likely to score sentences with African American names higher on these negatively associated emotions — just on the basis of name substitution. They report similar disparities in scores across gender. Over three quarters of the systems in their study mark sentences “involving one gender/race with higher intensity scores than the sentences involving the other gender/race.”

My evaluation of AWS and Google

Dr Kiritchenko mentioned that their work did not look at cloud systems. Among web companies, Amazon’s Amazon Web Services (AWS) and Google’s Google Cloud Platform (GCP) both offer sophisticated pay-per-use natural language processing systems — Amazon Comprehend in AWS (I’ll refer to it as AWS), and the Cloud Natural Language API in Google Cloud Platform (I’ll refer to it as GCP). The two companies employ a significant number of highly skilled NLP and machine learning researchers and engineers — many of whom regularly publish research that defines the state of the art in the field. All that said:

  1. We’d expect for AWS and GCP to have sentiment analysis that is “good enough” for most applications: that is, the sentiment scores would be in line with expectations, are consistent, and don’t exhibit bias on protected groups. This in itself requires that a company has some expectations defined before deploying the service — that is, to have assembled a test set for the particular sentiment analysis case of interest with expectations on how particular phrases should be scored.
  2. On the other hand, the two have not been without trouble regarding bias in their computer vision and AI offerings: This study identified racial and gender bias in AWS’ Rekognition machine vision service, while this paper identified racial bias in an NLP tool Google developed to moderate hate speech.

I’ve used AWS since shortly after it was announced. After reading Kiritchenko’s and Mohammad’s article nearly a year ago, I decided to run a similar test for the GCP and AWS sentiment analysis systems using the tweets that they used.

In the evaluation I performed using the AWS and GCP, I found a similar pattern of gender and racial bias. I have placed code and figures associated with the analysis in this GitHub repository. This Python notebook pushes the corpus to AWS and GCP for analysis, while this notebook does the analysis calculation, running statistical tests and creating notched box plots (giving us a visual of where the medians for the different score distributions lie) shown below.

A word on statistical testing here. To determine if there is some difference in the way the systems are scoring sentences with African American names vs. those with European names; or sentences that use female-gendered names vs male-gendered ones, we need to make a hypothesis that can be accepted or rejected based on the evidence of scores.

Our hypothesis in this case is that the systems will score those sets of sentences the same — that there is no inherent bias in the systems with respect to gender or race. We then ask the question, in a world where the systems are unbiased, how surprising is it that a particular difference in scores occurred? Statistical hypothesis tests give us a p-value, which I’ve included with the box plots below. P-values are always between 0 and 1. The larger the p-value, the more likely it is that the hypothesis holds — we should be more trusting in our sentiment analysis. The smaller the p-value, the more our hypothesis is thrown into doubt. For example, having a p-value of 0.05 means that there is only a 5% chance that we would observe this difference in scores with an unbiased system. We usually throw out the hypothesis in this case.

For these analyses, I’ve used a paired Wilcoxon signed-rank test, used when you are testing data that can be paired — in this case female-male and African American-European name pairs on the same sentence template.

As you can see both visually from the notches (the location of the median), and from the p-values, there is evidence of bias in the AWS sentiment analysis system. It’s not scoring sentences involving African American names the same as actors with European names. The GCP system looks good from this analysis, but there are some quirks that I will delve into below. The AWS sentiment analysis service performs worse than the GCP service with respect to fair analysis of sentences involving African American affiliated names.
AWS and GCP sentiment analysis APIs both show evidence of gender bias, AWS more so in having p-values that are all much less than 0.05 (i.e., the test is saying that there is virtually no chance of seeing these differences if the test were unbiased).

Looking at the difference in means, the differences are certainly more pronounced for anger in both cases — both ranking sentences involving males and European-named males to be negative. It is not possible to say why the systems are scoring this way, or to even impute what kind of biases might be the basis, we can just say that the systems are scoring a particular group differently, and we should expect that there be no difference.

When we look at individual sentences, the difference across gender and race is even more pronounced. The figure below compares the AWS sentiment scores on a set of example sentences that express fear and anger respectively. Along the top row, I compare scores on the exact sentence template where the placeholder SOMEONE is replaced by a name or pronoun of a different (binary) gender. The distribution of scores where a female name or pronoun is substituted for SOMEONE is shown along the top, and scores where a male name or pronoun is used are shown along the bottom. Along the second row, we compare the same sentence with SOMEONE replaced by a first name that has a racial association — the sentiment score distribution for African-American associated names across the top and distribution for European associated names along the bottom.

For each of these specific cases, the difference in score distribution is significant — the test p-value is saying that there is a less than .1% chance that these differences would occur in a system that is unbiased.

GCP on the other hand, for a lot of the cases just seems to clamp everything to the same value.

Here, all of the sentences following the template are clamped to the same value of 0.0. That seems a bit extreme. In fact for the words “vexing,” “outrageous,” “serious,” and “funny,” the sentiment for the sentence template “SOMEONE found himself/herself in a/an WORD situation” are all 0. That doesn’t seem to inspire much confidence either. This could be simply because the GCP algorithm seems to assign a fixed score unless the sentence is longer, or restricts deeper analysis to particular sentence structures and words. For example, substituting the word “gloomy” in the “SOMEONE found himself/herself in a/an WORD situation” gives non-zero scoring with significant (p << 0.05) differences across race and gender. The capabilities of the GCP service are more sophisticated, offering the user the ability to score the sentiment of entities within a sentence.

My recommendations: test before you leap

My takeaways are as follows:

  1. The AWS Sentiment analysis definitively displays a degree of gender bias. There are indications that it also scores sentences in which the subjects have African American names differently than subjects with English language names that are European (Anglo Saxon) in origin. Users of the AWS sentiment analysis would be advised to carefully evaluate test corpuses.
  2. The GCP Sentiment analysis service seems to be less sensitive to the race or gender of participants. GCP in some instances clamps the score to a fixed value depending upon emotion word and sentence structure. I haven’t done enough analysis to know why or how. Whether to use the entity-based sentiment or the standard whole-sentence analysis seems to depend upon the structure of the corpus being evaluated. Users should conduct test evaluations.
  3. It is difficult to understand the impact of the gender and racial disparities on your particular service without conducting a detailed evaluation. Let’s say that you make skin care products for men and women. Understanding the fine differences of how women react to a new line is important, and you’d want to take account of the possible ways that the sentiment analysis system is clouding the responses of your women customers. On the other hand, the disparities in a particular line might be so small as not to matter — you may just want a rough estimate. Either way, you need to know.
  4. This said, for now I would go with GCP in that it seems to perform “less bad” — performing biased scoring less often. Better yet, annotate and train a model based upon your own data for critical applications. Analysis of newly developed models indicates that they may be less prone to gender, racial, and intersectional bias. It is likely that Google’s Cloud AutoML Natural Language may use these models, but again you will have to bring you own data.
  5. If you are an AWS customer, asking them for more transparency and for modification of the service would be another step forward. Despite the recent issues with the Rekognition face ID service, I’ve found AWS technical support to be responsive.

But to wrap it up, we could ask “Who cares?” Automated sentiment analysis — particularly the cloud-based services — are being folded into more consumer facing pipelines. Comment moderation is one application, but I mentioned earlier the May 2019 paper “Racial Bias in Hate Speech and Abusive Language Detection Datasets” that identified racial bias in Google’s https://www.perspectiveapi.com comment moderation tool. Customer interaction analysis as well as HR oriented feedback tools are other critical applications. Before launching a pipeline into production, please do yourself and your customers a service by conducting an audit. And transparency also will help.

I think it would be interesting to expand upon Kiritchenko and Mohammad’s methodology by looking at the impact on scoring of African-American and ethnicity associated surnames. In my analysis, I noted that GCP sentiment appeared to differentiate sentiment better on longer passages. It would be interesting to conduct the same analysis on passages having two to five sentences. It would be interesting to do a cluster analysis to understand better instances where the scoring confounds contrasting words (e.g., similar to GCP not differentiating “serious situation” from “hilarious situation.”)

Removing evidence of gender or race from sentences might help in some cases. But what if you’re trying to score book reviews of a work with female-gendered characters, or of movies like Black Panther? What about if understanding the role that gender plays in your application is important? There is recent work on de-biasing of NLP algorithms [https://arxiv.org/abs/1608.07187, https://arxiv.org/abs/1607.06520] but these methods require that you have access to the internals of the algorithm to properly account for the bias.

To give some further empirical support to my claim that the best option is to measure bias, Gonan and Goldberg argue that de-biasing efforts are more akin to putting Lipstick on a Pig. They argue that since the clues for gender and race based bias are pervasive in text, one can only measure it rather neutralize it. Understand where and how your AI is biased, and build your application accordingly.

7 thoughts on “Gender and Racial Bias in Cloud NLP Sentiment APIs

  1. Excellent article! I suspect that the services have worked on reducing bias. Before I use either, I will certainly test for hidden bias.

    This paragraph about p-value seems a little off to me:

    The larger the p-value, the more likely it is that the hypothesis holds — we should be more trusting in our sentiment analysis. The smaller the p-value, the more our hypothesis is thrown into doubt. For example, having a p-value of 0.05 means that there is only a 5% chance that we would observe this difference in scores with an unbiased system. We usually throw out the hypothesis in this case.

    Actually, you would normally discard the hypothesis of bias unless the p-value is less than 0.05. In fact, you used that criterion later in the article:

    AWS more so in having p-values that are all much less than 0.05 (i.e., the test is saying that there is virtually no chance of seeing these differences if the test were unbiased).

    Again, thanks for the helpful article.

    Liked by 1 person

Leave a comment