How to fool a 27M-parameter model with a bit of Python Do you think it is impossible to fool the vision system of a self-driving Tesla car? Or that machine learning models used in malware detection software are too good to be evaded by hackers? Or that face recognition systems in airports are bulletproof? Like any of us machine learning enthusiasts, you might fall into the trap of thinking that deep models used out there are perfect. Well, you are WRONG. There are easy ways to build that can fool any deep learning model and create security issues. In this post, we will cover the following: adversarial examples What are adversarial examples? How do you generate adversarial examples? Hands-on example: let’s break Inception 3 How to defend your models against adversarial examples All the source code shown in this article is in ** **👇🏽 this Github repo Let’s start! 1. What are adversarial examples? 😈 In the last 10 years, deep learning models have left the academic kindergarten, become big boys, and transformed many industries. This is especially true for computer vision models. When hit the charts in 2012, the deep learning era officially started. AlexNet Nowadays, computer vision models are as good or better than human vision. You can find them in myriad places, including… self-driving cars face recognition medical diagnosis surveillance systems malware detection Until recently, researchers trained and tested machine learning models in a environment, such as in machine learning competitions and academic papers. Nowadays, when deployed in real-world scenarios, security vulnerabilities coming from model errors have become a real concern. laboratory Imagine for a moment that the state-of-the-art-super-fancy-deep learning vision system of your self-driving car was not able to recognize as a stop sign. this stop sign Well, . This stop sign image is an . Think of it as an optical illusion for the model. this is exactly what happened adversarial example Let us look at another example. Below you have two images of a dog that are indistinguishable for us humans. The image on the left is an original picture of a puppy taken by . The one on the right is a slight modification of the first, that I created by adding the noise vector in the central image. Dominika Roseclay Inception v3 correctly classifies the original image as a dog breed (a redbone). However, this same model thinks, with very high confidence, that the modified image I created is a paper towel. In other words, I created an adversarial example. And you will too, as this is the example we will work on later in section 3. An for a computer vision model is an input image with small perturbations, imperceptible to the human eye, that causes a wrong model prediction. adversarial example Do not think these 2 examples are rare edge-case examples found after spending tons of time and computing resources. There are easy ways to generate adversarial examples, and this opens the door to serious vulnerabilities of machine learning systems in production. Let’s see how you can generate an adversarial example and fool a state-of-the-art image classification model. 2. How to generate adversarial examples? Adversarial examples 😈 are generated by taking a clean image 👼 that the model correctly classifies, and finding a small perturbation that causes the new image to be misclassified by the ML model. White-box scenario Let’s suppose you have complete information about the model you want to fool. In this case, you can compute the loss function of the model where is the input image X is the output class, y and is the vector of the network parameters. θ This loss function is typically the negative log-likelihood for classification methods. Your goal is to find a new image that is close to the original and that produces a big change in the value of the loss function. X’ X Imagine you are inside the space of all possible input images, sitting on top of the original image . This space has dimensions , so I will excuse you if you cannot visualize it well 😜. X width x height x channels To find an adversarial example, you need to walk a little bit in some direction in this space until you find another image with a remarkably different loss. You want to choose the direction that maximizes the change in the loss function for a fixed small step . X’ J epsilon Now, if you refresh a bit your Maths Calculus course, the direction in the space where the loss function changes the most is precisely the gradient of with respect to the . X J X The gradient of a function with respect to one of its variables is precisely the direction of maximal change. And by the way, this is the reason people train Neural Networks using Stochastic Gradient Descend and not Stochastic Random-Direction Descend. Fast gradient sign method An easy way to formalize this intuition is as follows: We take only the sign of the gradient and scale it using a small parameter epsilon, to guarantee that the distortion between and is small enough to be imperceptible to the human eye. This method is called the . X X fast gradient sign method Black-box attack In most scenarios, it is very likely you will not have complete information about the model. Hence, the previous method is not useful as you cannot compute the gradient. However, there exists a remarkable property called of adversarial examples that malicious agents can exploit to break a model even if they do not know its internal architecture and parameters. transferability Researchers have repeatedly observed that quite well between models, meaning that they can be designed for a target model A, but end up being effective against any other model trained on a similar dataset. adversarial examples transfer Adversarial examples can be generated as follows: Query the targeted model with inputs X_i for i = 1 … n and store the outputs y_i. Use the training data (X_i, y_i) to build another model, called the substitute model. Use a white-box algorithm like the fast gradient sign to generate adversarial examples for the substitute model. Many of them are going to transfer successfully and become adversarial examples for the target model as well. A successful application of this strategy against a commercial Machine learning model is presented in . this Computer Vision Foundation paper 3. Hands-on example: let’s break Inception 3 Let’s get our hands dirty and implement a few attacks using Python and the great library PyTorch. It always comes in handy to know how the attacker thinks. You can find the complete code in . this Github repo Our target model is going to be Inception V3, a powerful image classification model developed by Google, which has around 27M parameters and was trained on around 14M images belonging to 20k categories. # Load pretrained model from the PyTorch hub # https://pytorch.org/hub/pytorch_vision_inception_v3/ from torchvision.models import inception_v3 model = inception_v3(pretrained=True) model.eval() # Count model parameters: 27,161,264 n_params = sum(p.numel() for p in model.parameters()) print(f'{n_params:,} parameters') We download the list of classes the model was trained on, and build an auxiliary dictionary that maps class ids to labels. # Download the txt file with the list of ImageNet classes the model was trained with # "!" magic to run shell commands from a Jupyter notebook !wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt # id2label maps classes ids to their human-readable names: e.g. id2label[1] = 'goldfish' with open("imagenet_classes.txt", "r") as f: categories = [s.strip() for s in f.readlines()] id2label = {} for idx, category in enumerate(categories): id2label[idx] = category Let us take the image of an innocent redbone dog as the starting image we will carefully modify to build adversarial examples: import requests import io from PIL import Image url = 'https://previews.123rf.com/images/meinzahn/meinzahn1211/meinzahn121100339/16350068-cute-tiger-cat-isolated-on-white.jpg' response = requests.get(url) img = Image.open(io.BytesIO(response.content)) img Inception V3 expects images with dimensions 299 x 299 and normalized pixel ranges between -1 and 1. Let us preprocess our beautiful dog image: import torch from torch import Tensor from torchvision import transforms def preprocess(img) -> Tensor: """ Inception V3 model from pytorch expects input images with pixel values between -1 and 1 and dimensions 299 x 299 """ mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] preprocess_fn = transforms.Compose([ transforms.Resize((299,299)), transforms.ToTensor(), transforms.Normalize(mean, std) ]) image_tensor = preprocess_fn(img) # add batch dimension: C x H x W ==> B x C x H x W image_tensor = image_tensor.unsqueeze(0) return image_tensor x = preprocess(img) and check that the model correctly classifies this image. from easydict import EasyDict import torch.nn.functional as F def get_predictions(img: Tensor) -> EasyDict: output = model.forward(img) class_idx = torch.max(output.data, 1)[1][0].item() label = id2label[class_idx] output_probs = F.softmax(output, dim=1) confidence = round(torch.max(output_probs.data, 1)[0][0].item(), 4) return EasyDict( id=class_idx, label=label, confidence=confidence, ) get_predictions(x) # {'id': 168, 'label': 'redbone', 'confidence': 0.8861} Good. The model works as expected and the dog is classified as a dog :-). redbone redbone Let’s move to the fun part and generate adversarial examples using the Fast Gradient Sign method. from typing import Tuple from torch.autograd import Variable def fast_gradient_sign(x: Tensor, eps: float) -> Tuple[Tensor, Tensor]: """""" # convert tensor into a variable, because we will need to compute gradients # of the loss function with respect to the image pixels img_variable = Variable(x, requires_grad=True) # forward pass on the original image output = model.forward(img_variable) # get predicted class y_true = torch.max(output.data, 1)[1][0].item() target = Variable(torch.LongTensor([y_true]), requires_grad=False) # backward pass to compute gradients loss_fn = torch.nn.CrossEntropyLoss() loss = loss_fn(output, target) # this will calculate gradient of each variable (with requires_grad=True) # that you can later access with "var.grad.data" # PyTorch does the heavy lifting, computing the gradient of the cross-entropy # with respect to the input image pixels. loss.backward(retain_graph=True) # sign of gradient of the loss func (with respect to input X) x_grad = torch.sign(img_variable.grad.data) # fast gradient sign formula x_adversarial = img_variable.data + eps * x_grad return x_adversarial, x_grad # keep epsilon small to generate slight changes to the original image epsilon = 0.02 x_adv, grad = fast_gradient_sign(x, epsilon) I have created an auxiliary function to both the original and the adversarial image. You can see the full implementation in this . visualize GitHub repository Well. It is interesting how the model prediction changed for the new image, which is almost indistinguishable from the original one. The new prediction is a , which is another dog breed with very similar skin color and big ears. As the puppy in question could be a mixed breed, the model mistake seems to be small, so we want to work further to really break this model. bloodhound One possibility is to play with different values of and try to find one that clearly gives a wrong prediction. Let’s try this. epsilon epsilons = [0.02, 0.2, 0.9] for epsilon in epsilons: x_adv, grad = fast_gradient_sign(x, epsilon) print('epsilon: ', epsilon) visualize(x, x_adv, grad, epsilon) As epsilon increases the change in the image becomes visible. However, the model predictions are still other dog breeds: and . We need to be smarter than this to break the model. bloodhound basset Remember the intuition behind the Fast gradient sign method, i.e. imagine yourself inside the space of all possible images (with dimension 299 x 299 x 3), right where the original image X is. The gradient of the loss function tells you the direction you need to step to increase its value and make the model less certain about the right prediction. The size of the step is epsilon. You step and you check if you are now sitting on an adversarial example. An extension of this would be to take many smaller steps, instead of a single one. After each step, you re-evaluate the gradient and decide the new direction you are going to walk towards This method is called the Iterative Fast Gradient method. What an original name! More formally: Where , and denotes clipping of the input in the range of [X−ϵ, X+ϵ]. X0 = X Clip X,ϵ An implementation in PyTorch is the following: def iterative_fast_gradient_sign(x_: Tensor, epsilon: float, n_steps: int, alpha: float): """""" # copy to avoid modifying the original tensor x = x_.clone().detach() for step in range(n_steps): # one step using basic FGSM x_adv, grad = fast_gradient_sign(x, alpha) # total perturbation total_grad = x_adv - x_ # force total perturbation to be lower than epsilon in # absolute value total_grad = torch.clamp(total_grad, -epsilon, epsilon) # add total perturbation to the original image x_adv = x_ + total_grad x = x_adv return x_adv, total_grad Now, let us try again to generate a good adversarial example starting with our innocent puppy. : bloodhound again Step 1 : beagle again Step 2 : mousetrap? Interesting. However, the model confidence on this prediction is only 16%. Step 3 Let’s go further. : One more dog breed, boring… Step 4 : beagle again.. Step 5 : Step 6 : redbone again. Step 7 Keep calm and continue walking in the image space. : Step 8 : BINGO! 🔥🔥🔥 Step 9 What an exotic paper towel. And the model is pretty confident about its prediction, almost 99%. If you compare the initial and the final image we found in step 9 you see they are essentially the same, a , but for the Inception V3 model they are two very different things. puppy I was very surprised the first time I saw such a thing. How small modifications to an image can cause such model misbehavior. This example is funny, but if you think about the implications for a self-driving car vision you might start to worry a bit. Deep learning models deployed in critical tasks need to properly handle adversarial examples, but how? 4. How to defend your models against adversarial examples? As of November 2021, there is no universal defense strategy you can follow to protect against adversarial examples. In other words, attacking a model with adversarial examples is easier than protecting it. The best universal strategy, that can protect against any known attack, is to use adversarial examples when training the model. If the model adversarial examples during training, its performance at prediction time will be better for adversarial examples generated in the same way. This technique is called . sees adversarial training You generate them on the fly, as you train the model, and adjust the loss function to look both at clean and adversarial inputs. For example, we could eliminate the adversarial examples we found in the previous section if we added the 10+ examples to the training set and labelled all of them as . redbone This defense is very effective against attacks that use the Fast Gradient Sign method. However, there exist more powerful attacks, that we did not cover in this post, that can bypass this defense. If you wanna know more I encourage you to read by Nicholas Carlini and David Wagner. this amazing paper This is the reason why adversarial defense is an open problem in Machine Learning and cybersecurity. Conclusion Adversarial examples are a fascinating topic at the intersection of cybersecurity and machine learning. However, it is still an open problem waiting to be solved. In this post, I gave a practical introduction to the topic with code examples. If you would like to work further I suggest , a Python library for adversarial machine learning developed by the great Ian Goodfellow and Nicolas Papernot, and currently maintained by the . cleverhans University of Toronto All the source code I shown in this article is in . this Github repo Wanna become a PRO in Machine Learning? I offer hands-on content on real-world Machine Learning, that helps you get a top job in the Data world. 👉🏽 Subscribe to the . datamachines newsletter 👉🏽 Follow me on and . LinkedIn Twitter Have a great day 🧡 Pau