Security of machine learning algorithms. Protect and test models using Python

In the previous article, we talked about such a machine learning problem as Adversarial examples and some types of attacks that allow them to be generated. This article will discuss algorithms for protection against this kind of effect and recommendations for testing models.

Protection

First of all, let us immediately explain one thing - it is impossible to fully defend oneself against such an effect, and this is quite natural. After all, if we solved the problem of Adversarial examples completely, then we would simultaneously solve the problem of constructing an ideal hyperplane, which, naturally, cannot be done without having a general body of data.

To protect the model of the machine can be in two stages:

Training - we train our algorithm to respond correctly to Adversarial examples.

Operation - we are trying to detect an Adversarial example during the operation phase of the model.

Immediately it should be said that you can work with the protection methods presented in this article using the Adversarial Robustness Toolbox from IBM.

Adversarial Training

If you ask a person who has just become familiar with the problem of Adversarial examples, the question: "How can you protect yourself from this effect?", Then certainly 9 out of 10 people will say: "Let's add the generated objects to the training set." This approach was immediately proposed in the article Intriguing properties of neural networks as early as 2013. It is in this article that this problem was first described and the L-BFGS attack, which allows to get Adversarial examples.

This method is very simple. We generate Adversarial examples using various kinds of attacks and add them to the training sample at each iteration, thereby increasing the "resistance" of the Adversarial model to examples.

The disadvantage of this method is quite obvious: at each iteration of training, for each example we can generate a very large number of examples, respectively, and the time for training the model increases many times.

You can apply this method using the ART-IBM library as follows.

from art.defences.adversarial_trainer import AdversarialTrainer trainer = AdversarialTrainer(model, attacks) trainer.fit(x_train, y_train)

Gaussian Data Augmentation

The following method, described in the article Efficient Defenses Against Adversarial Attacks , uses similar logic: he also suggests adding additional objects to the training set, but unlike Adversarial Training these objects are not Adversarial examples, but slightly noisy objects of the training set (Gaussian noise, hence the name of the method). And, indeed, it seems very logical, because the main problem of the models is precisely the poor resistance to noise.

This method shows results similar to Adversarial Training, while spending much less time generating objects for training.

You can apply this method using the GaussianAugmentation class in ART-IBM.

 from art.defences.gaussian_augmentation import GaussianAugmentation GDA = GaussianAugmentation() new_x = GDA(x_train)

Label Smoothing

The Label Smoothing method is very simple to implement, but nonetheless carries a lot of probabilistic meaning. We will not go into details of the probabilistic interpretation of this method; you can find it in the original article Rethinking the Inception Architecture for Computer Vision . But, to say briefly about this, Label Smoothing is an additional type of regularization of the model in the classification problem, which makes it more resistant to noise.

In fact, this method smoothes class labels. Making them, say, not 1, but 0.9. Thus, when training, models are penalized for a very large "confidence" in the label for a particular object.

The application of this method in Python can be seen below.

 from art.defences.label_smoothing import LabelSmoothing LS = LabelSmoothing() new_x, new_y = LS(train_x, train_y)

Bounded relu

When we talked about attacks, many may have noticed that some attacks (JSMA, OnePixel) depend on how strong the gradient is at one point or another in the input image. With this task, the Bounded ReLU method is trying to fight the simple and “cheap” (in terms of computational and time costs).

The essence of the method is as follows. Let's replace the activation function of the ReLU in the neural network with the same, but limited not only from below, but also from above, thus we smooth the gradient maps, and at specific points it will not be possible to get a splash, which will not allow changing the one pixel of the image to deceive the algorithm.

\ begin {equation *} f (x) =
\ begin {cases}
0, x <0
\\
x, 0 \ leq x \ leq t
\\
t, x> t
\ end {cases}
\ end {equation *}

This method was also described in the article Efficient Defenses Against Adversarial Attacks

Building Model Ensembles

It is not difficult to deceive one trained model. It is even more difficult to deceive two models at the same time with one object. And if such models N? This is the basis of the model ensemble method. We simply build N different models and aggregate their output into a single answer. If the models are also represented by different algorithms, then it is extremely possible to deceive such a system, but it is extremely difficult!

It is quite natural that the implementation of model ensembles is a purely architectural approach that asks many questions (What are the basic models to take? How to aggregate the outputs of the basic models? Is there a relationship between the models? And so on.). It is for this reason that this approach is not implemented in ART-IBM.

Feature squeezing

This method, described in the article Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks , works during the model operation phase. It allows to detect Adversarial examples.

The idea behind this method is as follows: if we train n models on the same data, but with different degrees of compression, then the results of their work will still be similar. At the same time, the Adversarial example that works on the source network is likely to fail on additional networks. Thereby, considering the pairwise difference of the outputs of the original neural network and additional ones, choosing the maximum from them and comparing it with a pre-selected threshold, we will be able to assert that the input object is either Adversarial or absolutely valid.

Below is a method to get compressed objects using ART-IBM.

 from art.defences.feature_squeezing import FeatureSqueezing FS = FeatureSqueezing() new_x = FS(train_x)

This concludes the protection methods. But it would be wrong not to understand one important point. If an attacker does not have access to the input and output of the model, he will not understand how the raw data is processed inside your system before entering the model. Then and only then all his attacks will be reduced to a random search of input values, which of course is unlikely to lead to the desired result.

Testing

Now let's talk about testing algorithms to counter Adversarial examples. Here, first of all, it is necessary to understand how we will test our model. If we assume that in some way an attacker can get full access to the entire model, then our model should be tested using WhiteBox attacks.

In another case, we assume that the attacker will never get access to the “insides” of our model, however, he will be able, though indirectly, to influence the input data and see the result of the model’s work. Then you should use the methods of BlackBox attacks.

The general testing algorithm can be described by the following example:

Let there be a trained neural network written in TensorFlow (TF NN). We expertly argue that our network can fall into the hands of an attacker by entering the system where the model is located. In this case, we need to conduct WhiteBox attacks. To do this, we define a pool of attacks and frameworks (FoolBox - FB, CleverHans - CH, Adversarial robustness toolbox - ART), which allow these attacks to be implemented. After that, considering how many attacks were successful, we calculate the succes rate (SR). If SR suits us, we end the testing, otherwise we use one of the protection methods, for example, those implemented in ART-IBM. Then again we carry out attacks and we consider SR. We do this operation cyclically, as long as SR does not suit us.

findings

At this point, I would like to finish with general information about attacks, defenses, and testing machine learning models. Summarizing the two articles, we can conclude the following:

You should not believe in machine learning as a kind of miracle that can solve all your problems.
Applying machine learning algorithms in your tasks, think about how this algorithm is resistant to such a threat as Adversarial examples.
The algorithm can be protected both from the side of machine learning and from the side of the system in which this model is operated.
Test your models, especially in cases where the result of the model directly affects the decision
Libraries like FoolBox, CleverHans, ART-IBM provide a convenient interface for attacking and protecting machine learning models.

Also in this article I would like to summarize the work with the FoolBox, CleverHans and ART-IBM libraries:

FoolBox is a simple and clear library for the use of attacks on neural networks, supporting many different frameworks.

CleverHans is a library that allows you to conduct attacks by changing the set of parameters of the attack, a little more complicated than FoolBox, it supports fewer frameworks.

ART-IBM is the only library described above that allows you to work with protection methods, so far only supports TensorFlow and Keras, but it is developing faster than others.

Here it is worth saying that there is another library for working with Adversarial examples from Baidu, but, unfortunately, it is suitable only for people who speak Chinese.

In the next article on this topic, we will examine part of the task that was proposed to be solved in the course of ZeroNights HackQuest 2018 by deceiving a typical neural network using the FoolBox library.

Source: https://habr.com/ru/post/438644/