Feb 4, 2021 7:00 AM

AI and the List of Dirty, Naughty, Obscene, and Otherwise Bad Words

It started as a way to restrict autocompletes on Shutterstock. Now it grooms search suggestions on Slack and influences Google's artificial intelligence research.

A painting by Roy Lichtenstein with the text

The AI Database →

Application

Content moderation

Text analysis

Company

Google

End User

Big company

Research

Sector

Research

Source Data

Speech

Text

Technology

Machine learning

Natural language processing

Comedian George Carlin had a list of Seven Words You Can’t Say on TV. Parts of the internet have a list of 402 banned words, plus one emoji, 🖕.

Slack uses the open source List of Dirty, Naughty, Obscene, and Otherwise Bad Words, found on GitHub, to help groom its search suggestions. Open source mapping project OpenStreetMap uses it to sanitize map edits. Google artificial intelligence researchers recently removed web pages containing any of the words from a dataset used to train a powerful new system for making sense of language.

LDNOOBW, as intimates know it, has been a low profile utility for years, but recently became more prominent. Blocklists try to bridge the gulf between the mechanical logic of software and the organic contradictions of human behavior and language. But such lists are inevitably imperfect and can spawn unintended consequences. Some AI researchers have criticized Google’s use of LDNOOBW as narrowing what its software knows about humanity. Another, similar, open source list of “bad” words caused chat software Rocket.Chat to censor attendees of an event called Queer in AI from using the word queer.

The initial List of Dirty, Naughty, Obscene, and Otherwise Bad Words was drawn up in 2012, by employees of stock photo site Shutterstock. Dan McCormick, who led the company’s engineering team, wanted a roll of the obscene or objectionable as a safety feature for the autocomplete feature of the site’s search box. He was happy for users to type whatever they wanted, but didn’t want the site to actively suggest terms people might be surprised to see pop up in an open office. “If someone types in B, you don’t want the first word that comes up to be boobs,” says McCormick, who left Shutterstock in 2015.

He and some coworkers took Carlin’s Seven Words, tapped the darkest corners of their brains, and used Google to learn sometimes bewildering slang for sexual acts. They posted their initial 342 entries to GitHub with a note inviting contributions and the suggestion that it could “spice up your next game of Scrabble :)”

Almost nine years later, LDNOOBW, as aficionados know it, is longer and more influential than ever. Shutterstock employees continued curating their list of crudities after McCormick’s departure, with help from outside suggestions, eventually reaching 403 entries for English. The list won users outside the company, including at OpenStreetMap and Slack. There are versions of the list in more than two dozen other languages, including three entries for Klingon—QI’yaH!—and 37 for Esperanto. Shutterstock declined to comment on the list and claimed it is no longer a company project, although it still bears the company’s name and copyright assertion on GitHub.

Artificial intelligence researchers at Google recently won LDNOOBW new fame—and infamy. In 2019, company researchers reported using the list to filter the web pages included in a collection of billions of words scraped from the web called the Colossal Clean Crawled Corpus. The censored collection powered a recent Google project that created the largest language AI system the company has revealed, showing strong results on tasks such as reading comprehension questions or tagging sentences from movie reviews as positive or negative.

Similar projects have created software that generates astonishingly fluid text. But some AI researchers question Google’s use of LDNOOBW to filter its AI input, saying that blacked out a lot of knowledge. Striking out pages featuring obscenities, racial slurs, anatomical terms or the word sex regardless of context would remove abusive forum postings—but also swaths of educational and medical material, news coverage about sexual politics, and information about Paridae songbirds. Google didn’t discuss that side effect in its research papers.

“Words on the list are many times used in very offensive ways but they can also be appropriate depending on context and your identity,” says William Agnew, a machine learning researcher at the University of Washington. He is a cofounder of the community group Queer in AI, whose web pages on encouraging diversity in the field would likely be excluded from Google’s AI primer for using the word sex on pages about improving diversity in the AI workforce. LDNOOBW appears to reflect historical patterns of disapproval of homosexual relationships, Agnew says, with entries including “gay sex” and “homoerotic.”

Agnew has had first-hand experience of the unintended consequences of such systems. When Queer in AI ran a workshop at a leading AI research conference last year, virtual attendees ran into problems using the conference’s virtual hangout on the service Rocket.Chat. Its optional content filter draws on another GitHub list called badwords, which at the time included the words lesbian and queer. “We couldn’t even type the name of our workshop,” Agnew says.

The list in question has since been updated; its creator declined to speak with WIRED. A spokesperson for Rocket.Chat said the company is investigating and updating its filtering feature “to ensure ‘queer’ is not blocked and there aren’t other ‘restricted’ words that would conflict with our values and commitment to diversity and inclusion.”

The WIRED Guide to Artificial Intelligence

Supersmart algorithms won't take all the jobs, But they are learning faster than ever, doing everything from medical diagnostics to serving up ads.

By Tom Simonite

Agnew’s questioning of Google’s use of LDNOOBW led to the practice being called out in a research paper released last month warning of ethical downsides to recent AI research; the paper led to prominent researcher Timnit Gebru’s sudden departure from Google. “If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light,” the paper says.

Gebru says she was fired by Google after refusing a manager’s demand to remove her name from or retract the paper before publication; Google has said she resigned and criticized the paper’s quality. The company did not respond to a request for comment about how some of its researchers use LDNOOBW as an oracle of what’s objectionable.

Vinay Prabhu, chief scientist at privacy startup UnifyID, who researches algorithmic bias, says the whole industry should be more transparent about exactly what is being fed into AI models, but that Google’s influence means it has a special responsibility. “Every idiosesyncratic thing Google does becomes the industry standard,” he says.

McCormick hadn’t heard of Google’s interest in his creation until WIRED called. He still uses the list to prevent unintentionally eye-catching search suggestions at his current company, Constructor.io, which provides search technology for online stores including beauty brand Sephora. He’s unsure it is well suited to filtering AI system’s view of the world. “It’s clear the world needs a few different versions,” he says. “Maybe I should start that next.”

bet365娱乐, bet365体育赛事, bet365投注入口, bet365亚洲, bet365在线登录, bet365专家推荐, bet365开户