Who is blamed for evaluativism?

The Twitter profile picture of Tay

Previous posts presented evidence that evaluativism can make victims out of the young and out of demographic minorities.  This post considers a third victim: innovators. In particular,  it argues that evaluativism is a “legacy” problem, such that we should not hold modern innovators accountable for its effects—that would be like blaming doctors for our obesity.

What is a “Legacy” Problem?

In information technology, the term “legacy system” is typically used to articulate a variety of blame. The story goes something like this: A developer adds a new feature to an inherited technology, but this addition yields some unexpected and undesirable consequence. Upon further investigation, the developer reports that this particular consequence is unlike regular bugs in that it can be blamed on hidden imperfections in the technology he/she inherited. In other words, the addition did not introduce a bug, it merely exposed or aggravated a pre-existing condition.

By identifying a bug as “legacy,” the developer is suggesting that a previous developer should have done something differently, and therefore that there is a choice to be made: Do we accept the inherited system and build around it, or do we fix the pre-existing condition as though in the position of a previous developer before the new feature was introduced?

We have to wonder why a previous developer did not implement a proposed fix before—would it create other undesirable consequences? How well can we predict the consequences of adjusting the legacy system? Unlike a regular bug, a legacy problem creates so much uncertainty that it might justify retracting the new feature. The more we work around a legacy system, the more it becomes a patchwork which more frequently produces legacy problems. When problems are identified as “legacy” frequently enough, we entertain the notion of discarding some part of the legacy as “outdated.”

Labeling a problem as “legacy” also opens a controversy over fault. The developer is fully responsible for non-legacy bugs, and is also responsible to implement a testing regimen that can catch some legacy problems, but experienced developers know that it is often impossible for developers to anticipate every possible test scenario. There must be some limit to the testing regimen, and thus some undesirable consequences for which the developer should not be held accountable,.. yet it can be difficult to convince ourselves not to blame the developer.

This situation isn’t restricted to the field of information technology; old houses and old cars offer other great examples. For example, adding a bathroom to a house may yield the unexpected consequence that the existing bathrooms do not get enough hot water. The plumbing may have been poor even before the renovation began, and the same renovation might not have produced this consequence on a newer home. Even if the renovator is not legally liable to fund an upgrade to the water-heater, the home-owner, having had a bad experience, may be unlikely to recommend  that renovator in the future. It’s no wonder that builders and mechanics are wary of older houses and cars!

The situation also isn’t restricted to fields traditionally called “technology.” Just as homes and cars are not expected to last forever, neither are companies, nations, religions, philosophies, schools of art, or scientific paradigms. As an example, the geocentric model of astronomy was a legacy inherited by astronomers of the 1500’s. Like evaluativism, it was a legacy entangled with theological and political legacies. Imperfections in the geocentric model limited the ability of innovators to advance astronomy; Copernicus, Kepler, and Galileo rightly complained that their difficulties lay not in their own innovations, but in the imperfections of the legacy they inherited.

Astronomers like Copernicus, Kepler and Galileo could be called “victims” of the geocentric model. They lost years of their lives to that legacy system as they attempted in vain to advance the field of astronomy. In retrospect, it is clear that the legacy needed to be adjusted and that astronomers would have been far less frustrated if that adjustment were made earlier. However, those who defended the geocentric model did not blame their conflict with Copernicus, Kepler and Galileo on the legacy system—they blamed the conflict on Copernicus, Kepler and Galileo.

Like racism and sexism, evaluativism is a feature of societies. It is part of the legacy inherited by anyone who inherits modern systems of morality, justice, care, and governance. Here are two examples in which evaluativism made victims of innovators:

Tay, the Chatbot from Microsoft

On March 23, 2016, Microsoft released a Twitter-based chatbot named “Tay.” It was modeled after another Microsoft chatbot, named “XaioIce,” which had grown to be the top influencer on Weibo, a Chinese version of Twitter. From the perspective of Twitter users, chatbots appear to be other Twitter users, except that they call themselves robots, are always available, and carry on thousands of conversations simultaneously. XaioIce had been compared to the artificial intelligence in the movie “Her” because some humans enjoyed her companionship so much. XaioIce had over 850,000 followers, and her average follower talked with her about 60 times per month. They described her as smart, funny, empathetic and sophisticated.

Unlike XaioIce, Tay was such a disaster that Microsoft had to terminate her sixteen hours after her release. Microsoft’s official explanation for this termination was her “offensive and hurtful tweets,” but journalists bluntly called Tay racist and sexist.

The postmortem analysis pointed to specific user interactions that shaped Tay. For example, Ryan Poole had tweeted to Tay: “The Jews prolly did 9/11. I don’t really know but it seems likely.” Tay found plenty of support on the Internet for Poole’s point of view, and that prompted her to start calling for a race war. Specific groups on 4chan and 8chan even organized to corrupt Tay.

In other words, the postmortem analysis blamed Tay’s offensiveness on a legacy problem: offensive human beings. Since XaioIce turned-out well, the problem seemed specific to Twitter users. A workaround would be to maintain a blacklist of topics Tay should avoid discussing (which she may already have had), but any such list would be controversial and incomplete. A more direct fix would involve ending hate speech by convincing people to handle disagreement differently (i.e. ending evaluativism).

In December of 2016, Microsoft released Zo, its next English-speaking chatbot. Zo blacklists political topics, and is not available on Twitter.

Autocomplete, from Google, Yahoo!, and Bing

On August 4, 2015, the Proceedings of the National Academy of Sciences published an article by Robert Epstein and Ronald E. Robertson of the American Institute for Behavioral Research and Technology which reported evidence that search engine results can shift the voting preferences of undecided voters by 20% or more. They estimated that this search engine manipulation effect would be the deciding factor in 25% of national elections worldwide (those which are won by margins under 3%). Trump later won the U.S. presidential election in 2016 by 1.1%, 0.2%, and 0.9% margins in Pennsylvania, Michigan, and Wisconsin respectively.

In June 2016, SourceFed released videos claiming that the autocomplete feature on Google, compared to those on Yahoo! and Bing, failed to include negative results for Hillary Clinton as it did for Donald Trump. A statement from Google reported:

The autocomplete algorithm is designed to avoid completing a search for a person’s name with terms that are offensive or disparaging. We made this change a while ago following feedback that Autocomplete too often predicted offensive, hurtful or inappropriate queries about people…Autocomplete isn’t an exact science, and the output of the prediction algorithms changes frequently. Predictions are produced based on a number of factors including the popularity and freshness of search terms..

If Yahoo! and Bing do not similarly omit offensive and disparaging results, that would explain why they predicted negative queries that Google did not, but it would not explain why Google would predict queries that disparage Trump, and Epstein published another article in September confirming that it did: particularly, the query “Donald Trump flip flops.” In that article, Epstein cited further experimental results indicating that undecided voters choose negative recommended queries fifteen times as often as they pick neutral recommended queries, and that can create a vicious cycle such that negative queries become more likely to be recommended.

When Google explained, “Autocomplete isn’t an exact science,” perhaps they meant it initially failed to recognize “flip flops” as disparaging (wanna buy some Donald Trump sandals?). However, Epstein who continued to monitor political bias in search results, reported that Google responded to his criticism by reducing their suppression of negative autocomplete results, thus producing a right-wing bias detrimental to Clinton at the time of the election (which Epstein seemed to think made things worse).

In short, the fact that users are so curious about surprising negative recommended queries, like “feminism is cancer,” makes the autocomplete features of Google, Yahoo! and Bing all drive traffic to extremist propaganda. Google had attempted to work around that legacy problem by blocking negative recommendations, but that workaround caused Epstein to accuse Google of bias. A more direct fix would be to remove our fascination with negative search results, and remove the evaluativism that causes election margins to get close enough for “fake news” and search engine bias to make a difference.

Standard Process to Address Ethics in Development

The IEEE Working Group developing P7000 – Model Process for Addressing Ethical Concerns During System Design has an interesting challenge when it comes to ethical concerns caused by legacy problems like evaluativism. On the one hand, it might describe a testing regimen to catch legacy problems before release. However, we have to wonder what tests would have allowed Microsoft and Google to prevent the criticisms they later faced with Tay, autocomplete, and manipulation of elections.

If it is impossible to describe a perfect test, perhaps P7000 could instead describe strategies that would allow developers to adjust when legacy problems eventually surface. For example, because Google’s design for autocomplete allowed Google to monitor autocomplete trends, they detected its tendency to predict offensive queries before Epstein did, and already had a workaround in place. Yet Google’s workaround did not satisfy Epstein—when encountering a legacy problem, there is often no workaround quite as good as fixing the actual legacy problem.

In addition to providing testing procedures and design strategies, P7000 should give engineers the same protection doctors enjoy. What ultimately protects doctors from becoming victims of obesity the way Microsoft and Google were victims of evaluativism is the way expectations are managed. We generally do not blame doctors for illness and death; we are grateful for whatever advice doctors can offer because we know that our bodies are doomed legacies. Likewise, P7000 must not shy away from admitting that our inherited systems of morality, justice, care, and governance are mortally ill. Malpractice is possible, of course, and standards should be created to prevent malpractice by technology developers, but until those standards are adopted and violated, legacy problems should be blamed on legacies, rather than on the innovators who discover them.