Hi Lobsters! I’m one of the researchers who worked on this. Both Nicholas Carlini (a co-author on this paper) and I have done a bunch of work on machine learning security (specifically, adversarial examples), and we’re happy to answer any questions here!
Adversarial examples can be thought of as “fooling examples” for machine learning models. For example, for image classifiers, for a given image x classified correctly, an adversarial example is an image x* such that x* is visually similar to an image x, but x* is classified incorrectly.
We evaluated the security of 8 defenses accepted at ICLR 2018 (one of the top machine learning conferences) and we find that 7 are broken. Our attacks succeeded when others failed because we show how to work around defenses that cause gradient-descent-based attack algorithms to fail.
This is neat! Adversarial examples at the moment seem to be more of an arms race rather than anything deeply understood, so I’m not surprised that the new ways of making neural nets resistant to existing attacks would end up broken. I don’t think I expected it to be literally within days of the ICLR 2018 accepted paper list coming out though.
Do you have any sense of whether constructing more robust defenses is plausible? The only paper I’ve run across that I feel gave me any deeper theoretical insight into adversarial examples is another ICLR 2018 paper, “Adversarial Spheres” by Gilmer et al. Is there anything else out there?
As far as the “arms race” is concerned: I think it’s also a problem of many papers simply not considering adaptive attacks. In our paper, we noted that some papers (e.g. Xie et al. 2018) were already trivially circumvented using a technique that had been described half a year ago (the existence of robust adversarial examples trivially implies that most randomized input transformation-based defenses are probably broken). I would guess that there would be fewer defenses being published if they were thoroughly evaluated against adaptive attacks.
I think the Towards Deep Learning Models Resistant to Adversarial Attacks paper is pretty good at developing some theory: it presents a view of adversarial examples through the lens of robust optimization. I think adversarial training is the only defense technique that’s shown a demonstrable robustness against white-box attacks in a reasonable threat model (Madry’s paper considers white-box access and a bound on the l-infinity perturbation that the attacker is allowed to make).