The Politics of Machine Learning

  • by

Your Comments ran over My Videos

Five days ago, I was sitting in the balcony of the Castro Theater in San Francisco, attending the Lesbians Who Tech summit, in complete awe as Kara Swisher and Susan Wojcicki met on the keynote stage for a candid conversation. These two seemed to have a friendly rapport, but this discussion was tense! Honestly, I’ve never seen a keynote stage so tense – Wojcicki appeared to be sweating, uncomfortably dodging questions, unprepared or unwilling to answer clearly in some cases. This has been written about elsewhere and the discussion was captured by CNBC, with this exchange starting around 1:33 in the full video.

The topic at hand: why had YouTube just purged over 400 channels, hundreds of millions of comments, and globally disabled comments by default on any video it determines is related to children?

“You have all reaped the benefits, but not [accepted] the responsibility, of having this platform.”

Kara Swisher

The short answer is somewhere between “YouTube was losing ad dollars” and “protect the children!” The backstory is far more complicated and has implications that are sweeping across our society, and, I believe, about to land on Capital Hill.

To explain this, I need to talk about a pair of bills passed in 2018 called SESTA/FOSTA. I am going to talk about the widespread use of AI & Machine Learning, and the use of bots to interfere in US democracy. Bear with me, this is gonna be a journey!

At Its Heart

Kara Swisher told a story of how hate speech became a dinner-time topic when her son, in about three clicks, went from Ben Shapiro to Neo Nazi videos.

The conversation on the #LWTSummit stage was, at its heart, about how trolls are targeting children via videos both of and for children, and why YouTube has responded in the way that it has. Clearly, protecting children from being exposed to … well, what ever their parents don’t want them exposed to … is very important. I don’t have kids, and this piece isn’t about me morally policing what parents choose to teach their children, so I’m not going to presume any specific topics here. I think we can all agree that neither ad-revenue-based social media companies nor internet trolls should be choosing what kids see online.

All you tech companies built cities, beautiful cities, but decided not to put in police, fire, garbage, street signs… and so it feels like the Purge every night.

Kara Swisher

Troll farms aren’t new. This has been going on for years, and receiving coverage even in the halls of Congress. All the platforms have begun working on solving this; here’s an announcement from Twitter, for example.

What’s new right now is the target: troll farms began posting inappropriate comments on videos of young children. Even a completely innocuous family video of a child can, when cast in an inappropriate light (eg. by insinuating sexualization in the comments) become a violation of the platform’s Terms of Service, objectionable to advertisers, and of course, very objectionable to the parents. The result: advertisers pulled out of YouTube. Money talks, so YouTube’s CEO had to respond.

Back to the stage, the conversation turned to the big question: what is YouTube gonna do about it? And why was deleting and disabling comments the best answer that YouTube (and by proxy, Google) could come up with?

This is where we didn’t really get a straight answer. Wojcicki deflected with semi-technical answers, and what became clear is that YouTube is fundamentally unable to employ humans to directly moderate the volume of video traffic that’s uploaded. She stated that 500 hours of video footage is uploaded every minute to YouTube. Every minute! To handle this scale, the only viable approach is to train machines to augment and amplify the reach of humans.

(Incidentally, this is why I prefer the term augmented intelligence rather than artificial intelligence.)

Machine Learning in 2019

So here we are. It’s 2019 and Machine Learning must be used to enable a small team of humans to moderate a very large amount of video content on YouTube. For those not deep in the AI/ML field, this is done through a technique called “supervised learning.” It goes something like this:

  • A team of moderators review videos and comments
  • They apply attribute tags based on their opinion (eg., what’s hate speech, what’s a cat video)
  • This data set (videos & comments & attribute tags) is called a “training set”
  • Machine learning is applied to the training set, creating a “trained model”.
  • The model’s primary characteristic is that it can view videos or comments, which were not in the training set, and predict what attributes the human moderators would have applied, if they had seen it
  • And, the model can be deployed at “cloud scale” to moderate all the videos and comments on a platform – even right at the moment you upload a new video.

That’s good, right? Humans teach machines how to spot abuse, ToS violations, spam, etc; the machines moderate the interwebs; there is less hate online; children are safer; and, naturally, YouTube keeps earning ad dollars and also saves money by paying a hundred employees to train machines instead of tens of thousands to do the moderation by hand. Business as usual continues. At least in the eyes of Capitalism, this is good…. but it doesn’t yet explain why they had to shut down comments so broadly.

A side effect of Machine Learning is that human bias is amplified. Imagine what would happen if everyone on that moderation team were a white man. Now imagine if they were all black women. Clearly, the resulting moderation would be … different. I’m not going to speculate on exactly how it would be different, but I hope anyone reading this can follow the analogy and see that there will always be differences in ranking and rating of content based on differences in the lived experiences of humans.

Technology doesn’t solve this bias – it only amplifies the bias because it amplifies the reach of the humans who built and trained the ML model. Building a training set that is diverse, inclusive, and not biased is … difficult, if not impossible. We, as an industry, are struggling with this topic right now; it was a recurrent topic at AI/Next Con 2018 and 2019, and also addressed directly at the ML4ALL 2018 conference. As far as I know, no one has a good answer yet.

Another side effect is that these algorithms have weaknesses. If you understand how ML works and have insight into the specific model or training set, you can often create discrete signals which trick the algorithms into classifying your content incorrectly. In other words, through an adversarial approach you can often fool the network and do things like trap autonomous vehicles, avoid facial detection, inflate the rank of books on Amazon, or potentially hack corporate chat bots. (The last link goes to the abstract for a talk that was just delivered on March 2nd. I’ll update this if I find a recording.)

Unfortunately, the discussion between Swisher and Wojcicki didn’t reach a great conclusion. YouTube needs to keep trolls from harming kids, so it needs to moderate content, and the only way to do that is with ML … but ML has weaknesses which are inherent in the technology, so the only solution is to be over-protective and, yes, some good comments and normal videos will get censored but no children will be harmed.

These two problems – the need to use Machine Learning to moderate the internet, and its weakness to training bias and adversaries gaming the algorithm – affect the whole industry. Amazon, Facebook, etc., need to moderate their platforms as well, and their ML models aren’t immune to these problems.

So Machine Learning, the solution to yesterday’s problem of scale, has created today’s problem. Trolls target a platform, game the ranking and moderation algorithms, harass children and minorities, and the best answer we have to this? Mass censorship. Disable comments and delete content that might be objectionable with an over-reaching definition because it’s better to be safe than sorry.

Let’s back up

Let’s back up a few months, because censorship is already alive and well on the Internet. It’s just that most people haven’t taken notice.

There was a brief stirring a few months ago when Tumblr decided to ban sensitive content on their platform. This happened on December 17th which, coincidentally, is recognized as the International Day to End Violence Against 53X Workers, despite an outcry from that community about the impact this would have on their businesses. (No, that’s not a typo, I’m intentionally avoiding certain words in the hopes this improves page rank.)

Who remembers the Craigslist Personals section? It was taken down a few months before, after being a part of the platform since the beginning of Craigslist.

All over the internet, our network and social platforms – previously immune to litigation for media served through them – have quietly been deploying censorship tools to protect themselves from two new laws (known as SESTA and FOSTA) which stripped a core internet protection, known as Section 230. This law had protected our ISPs and social networks for 20 years from any legal liability for content which users uploaded; the user was, of course, still held responsible if they broke the law. At the beginning of 2019, the liability shifted from users who break child endangerment or traff cking laws to the executives who operate the platform on which the content is shared.

We restrict the display of nudity or sexual activity because some people in our community may be sensitive to this type of content. Additionally, we default to removing sexual imagery to prevent the sharing of non-consensual or underage content.

Facebook Community Standards (2019)

In their gross overreach, these new laws make it a very serious crime for any platform (eg. Tumblr, Facebook) to participate in any way, even without its knowledge, in enabling the harming of minors via traff cking.

However, some things are fundamentally broken about this:

  • It is socially impossible to determine whether a picture of a person, posing in their underwear, posted online, is voluntary or not.
  • It is technically impossible to distinguish between: sand dunes and n00ds; male, female, or non-binary nupples; art and pr0n.

The result? Just like YouTube laying the ban-hammer on comments, several internet platforms had already banned anything 53Xual in nature. Even Facebook amended its Community Standards to exclude 53xuality-related pictures and speech, and I have heard anecdotal reports this is being applied to comments and discussions in private groups as well.

No one disputes the importance of protecting children from exposure or exploitation online.

Protect The Children

Two days after watching that discussion at the Lesbians Who Tech Summit, I had the chance to have an informal chat with some folks from the Electronic Frontier Foundation in person. It was difficult not to just fangirl at them, and I am grateful for their time chatting with me. I drew a shocking conclusion from the conversation.

Back in 2018, SESTA/FOSTA was pushed through Congress with almost no objection and little review period because of one strategy: congresspersons would not go on record opposing a bill that protected children. Only 2 representatives voted against it. Facebook’s changed Community Standards clearly says that it “defaults to removing sexual imagery to prevent the sharing of non-consensual or underage content.”

Now perhaps you see the connection? There’s a social lever which is available to any political party, either within our country or outside of it. Endanger children and, rightly, you can mobilize an unstoppable public force.

We must keep the internet free not only from ISPs but also from what the impact is of algorithms to decide what goes forward.

Nansi Pelosi

Censorship of comments and videos on YouTube is being necessitated to “protect the children”, and I suspect we are about to see new bills introduced which legislate this moderation wholesale across the internet. This tactic can be used against any online platform, and if Google can’t muster a response better than disabling all comments on any video related to children, the rest of us have no hope.

It’s Just Politics

Wednesday morning, Speaker of the House Nansi Pelosi introduced a new Net Neutrality bill. At the end of her speech, she pointed out that we must keep the Internet free from the impact of algorithms. Nowhere in the text of the bill is the word “algorithms” mentioned, and I keyed in on it during her speech. This struck a chord and was the moment I decided that I needed to write this article.

Why would anyone care enough about kids videos to create a distributed effort to post millions of fake, harassing comments on them? This isn’t coming from normal users, it’s coming from bots – automated trolls doing the bidding of someone to achieve something.

Our social media platforms already use algorithms – machine learning – to determine what posts we see; someone out there is using algorithms to target children online; companies are responding by trying to build better algorithms to combat the bots, but they’re failing, and Congress is taking notice.

I am concerned that the outcome of this techno-political battle will be another legislative bill necessitating the protection of children on the internet from hateful, abusive, or exploitative speech. As someone who works in this field, I know that any technical implementation we create will be chilling. Machine Learning is a new tool and it is still still not well understood – even within the Tech Sector. Allowing our freedom of speech to be controlled by a technology which only amplifies the bias of its creators will not be good for our Democracy. Nansi Pelosi was echoing the sentiments of Sir Tim Berners-Lee when she said that we must keep the internet free from algorithms, and I agree.

I’ll end on this note: who do you think would benefit from limiting free speech online?