A turtle—or a rifle? Hackers easily fool AIs into seeing the wrong thing
STOCKHOLM— Last week, here at the International Conference on Machine Learning (ICML), a group of researchers described a turtle they had 3D printed. Most people would say it looks just like a turtle, but an artificial intelligence (AI) algorithm saw it differently. Most of the time, the AI thought the turtle looked like a rifle. Similarly, it saw a 3D-printed baseball as an espresso. These are examples of "adversarial attacks"—subtly altered images, objects, or sounds that fool AIs without setting off human alarm bells.
Impressive advances in AI—particularly machine learning algorithms that can recognize sounds or objects after digesting training data sets—have spurred the growth of living room voice assistants and autonomous cars. But these AIs are surprisingly vulnerable to being spoofed. At the meeting here, adversarial attacks were a hot subject, with researchers reporting novel ways to trick AIs as well as new ways to defend them. Somewhat ominously, one of the conference's two best paper awards went to a study suggesting protected AIs aren't as secure as their developers might think. "We in the field of machine learning just aren't used to thinking about this from the security mindset," says Anish Athalye, a computer scientist at the Massachusetts Institute of Technology (MIT) in Cambridge, who co-led the 3D-printed turtle study.
Computer scientists working on the attacks say they are providing a service, like hackers who point out software security flaws. "We need to rethink all of our machine learning pipeline to make it more robust," says Aleksander Madry, a computer scientist at MIT. Researchers say the attacks are also useful scientifically, offering rare windows into AIs called neural networks whose inner logic cannot be explained transparently. The attacks are "a great lens through which we can understand what we know about machine learning," says Dawn Song, a computer scientist at the University of California, Berkeley.
The attacks are striking for their inconspicuousness. Last year, Song and her colleagues put some stickers on a stop sign, fooling a common type of image recognition AI into thinking it was a 45-mile-per-hour speed limit sign—a result that surely made autonomous car companies shudder. A few months ago, Nicholas Carlini, a computer scientist at Google in Mountain View, California, and a colleague reported adding inaudible elements to a voice sample that sounded to humans like "without the data set the article is useless," but that an AI transcribed as "OK Google, browse to evil.com."
Researchers are devising even more sophisticated attacks. At an upcoming conference, Song will report a trick that makes an image recognition AI not only mislabel things, but hallucinate them. In a test, Hello Kitty loomed in the machine's view of street scenes, and cars disappeared.
Some of these assaults use knowledge of the target algorithms' innards, in what's called a white box attack. The attackers can see, for instance, an AI's "gradients," which describe how a slight change in the input image or sound will move the output in a predicted direction. If you know the gradients, you can calculate how to alter inputs bit by bit to obtain the desired wrong output—a label of "rifle," say—without changing the input image or sound in ways obvious to humans. In a more challenging black box attack, an adversarial AI has to probe the target AI from the outside, seeing only the inputs and outputs. In another study at ICML, Athalye and his colleagues demonstrated a black box attack against a commercial system, Google Cloud Vision. They tricked it into seeing an invisibly perturbed image of two skiers as a dog.
AI developers keep stepping up their defenses. One technique embeds image compression as a step in an image recognition AI. This adds jaggedness to otherwise smooth gradients in the algorithm, foiling some meddlers. But in the cat-and-mouse game, such "gradient obfuscation" has also been one-upped. In one of the ICML's award-winning papers, Carlini, Athalye, and a colleague analyzed nine image recognition algorithms from a recent AI conference. Seven relied on obfuscated gradients as a defense, and the team was able to break all seven, by, for example, sidestepping the image compression. Carlini says none of the hacks took more than a couple days.
A stronger approach is to train an algorithm with certain constraints that prevent it from being led astray by adversarial attacks, in a verifiable, mathematical way. "If you can verify, that ends the game," says Pushmeet Kohli, a computer scientist at DeepMind in London. But these verifiable defenses, two of which were presented at ICML, so far do not scale to the large neural networks in modern AI systems. Kohli says there is potential to expand them, but Song worries they will have real-world limitations. "There's no mathematical definition of what a pedestrian is," she says, "so how can we prove that the self-driving car won't run into a pedestrian? You cannot!"
Carlini hopes developers will think harder about how their defenses work—and how they might fail—in addition to their usual concern: performing well on standard benchmarking tests. "The lack of rigor is hurting us a lot," he says.