On May 1 of 2019, Dr. Christo Wilson gave a talk on his investigation into online behavioral experiments. The talk was based on a piper entitled Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild, which he and his students gave at the 2019 ACM Conference on Fairness, Accountability, and Transparency in Atlanta, Georgia.
Online behavioral experiments (OBEs) are studies (aka A/B/n tests) that people conduct on websites to gain insight into their users’ preferences. Users typically aren’t asked for consent and these studies are typically benign. Typically an OBE will explore questions such as whether changing the background color influences how the user interacts with the site or whether the user is more likely to read an article if the font is slightly larger.
Sometimes, these studies cross ethical boundaries. For example, Facebook conducted an ethically problematic experiment designed to manipulate the emotional state of its users. Dr Wilson is concerned about whether we (as users of online services and as practitioners) need to adopt more stringent ethical standards for these experiments. To probe this question, he and his group took a close look at OBEs deployed using a popular platform called Optimizely. In the talk, he went into detail on how Optimizely works, presents data on how sites typically use it, and discussed some possible guidelines for performing online experiments ethically. The main takeaway is to be transparent and up front with users.
Below, I’ve sectioned the talk video into smaller segments and included my notes and insights.
- Christo Wilson — firstname.lastname@example.org, @bowlinearl — is an Associate Professor at the College of Computer and Information Science at Northeastern University.
Background on Online Behavioral Experiments
- Summary: Websites are constantly being adjusted to deliver the best user experience — or to optimize some monetary objective. Many sites run classic human subject experiments to guide this optimization, but websites usually guard this process. In medicine and in the social sciences, these experiments are carefully monitored to prevent abuse. Can we apply the same standards to websites?
- Insights: Informed consent is an important principle that could be easily adopted –even automated. Additionally third parties versed in ethics could provide independent auditing of experimental practices.
Using Optimizely to Study OBEs
- Summary: In this section of the talk, Wilson described Optimizely, the platform that his group used to take a closer look at some live online experiments. Optimizely’s API enabled their team to gather basic statistics on experiments, including duration, audience segmentation, and device types involved. That is, when you crawl a site that is running an Optimizely test, you can grab a JSON file with details of the experiments running on the site. In all, they ended up doing an audit of 575 websites for three months.
- Insights: I thought it was a clever observation to use the configuration file in this way. At least for Optimizely sites, it also gives users an indication of whether they are involved in an experiment, and this might be a path that would allow for users to ask for more transparency.
- Summary: Most sites in their survey didn’t seem to use the full capabilities of Optimizely — most did little or no audience segmentation. The New York Times, AirAsia, and CREDO Mobile were among the sites running the most sophisticated experiments. Most of the sites have only two variations per experiment.
- Insights: It is was odd to see such a low use of deep segmentation. It may also be, as Dr. Wilson pointed out, that sophisticated experiments are being conducted in internal systems, or with Google Optimize.
Case Studies — Price Discrimination and NY Times
- Summary: This segment of the talk presents two case studies. The first is price discrimination: sites trying to adapt prices to what the customer is willing to pay based on guesses about the audience segment. The API can’t provide deep insight into how prices were being set. This is potentially concerning but there is no data to probe. The other case is the structure of news headlines at The New York Times. “Clickier” headlines seem to do better revenue-wise, which may conflict with journalistic ethics and standards.
- Insights: Wilson makes a good point 13 minutes into this section, which is that subtle changes, produced by the headline optimization could change completely the impression of the news reported and even further, change how different audience segments view the same event. In light of concerns about deepfakes, fake news, and infox, this is certainly something to pay attention to.
Limitations and Ethics
- Summary: There are a lot of gaps in getting deeper understanding of the audiences involved in the tests that were observed, and there are many testing platforms that are completely opaque. The encouraging thing is that in this sample, they uncovered no flagrant ethical violations. However, there is no disclosure of tests — even deep inside the terms of service statements. The DETOUR act — just now a proposal introduced in April by Senators Mark Warner and Deb Fischer — could provide a starting point to regulate online testing. It seeks to ban online experiments designed to manipulate users and to set standards for informed consent.
- Insights: Optimizely and other providers should probably provide up-front ethics training for setting up experiments, not as an afterthought. A quick video search of “ethics in a/b testing” turned up empty. There is an opportunity for a concerted effort across vendors for ethics training and audits that could be a vital component in securing consumer confidence and safety.
Questions and wrap up
- Summary: Dr. Wilson was asked if there was a good litmus test for assessing the ethics of an experiment. He suggested to begin by asking yourself whether users would become angry if it were to be revealed that they had unknowingly participated in the study. He suggested also to think through whether the test might exclude or treat unfairly a particular audience, especially members of a protected class. Assembling a trusted group of reviewers for experimental design, especially for the questionable cases might be a good way to start.
- Insights:I think there’s an opportunity to lead developing standards for experimental design, formal review processes, and overall transparency. This Udacity course — prepared by data scientists at Google — gives a short lesson on ethics in A/B testing. Again, it provides just an introduction to some of the issues. Given the legislative (e.g., DETOUR act) pressures, especially likely as the U.S. election nears, it looks like an ideal opening for practitioners to take the lead and develop standards amongst themselves.