Explaining the results of your machine learning models has been a default in recent years. It gives the users of these APIs more visibility into the model's decision-making process and builds the trustworthiness of the model outputs. However, if you've wondered whether to provide an explanation API with your machine learning APIs, then it begs how safe it is to external attacks?

There are various ways in which the attacker can exploit machine learning models deployed as APIs, but the one that I want to look at today is model extraction. Model extraction attacks are when an adversary, who has access to machine learning model's APIs, can use these APIs to create a surrogate model to be able to reproduce the results of the target model.

Let's say you have deployed a brand new model, AwesomeML, which is proprietary to your research. Even with a rate limit to your API, an attacker can create an identical model with just a couple of hundred/thousand queries. However, this is tedious and would require the attacker to understand the general model and the training data and deployment understanding. Now let's assume that with your AwsomeML model, you have also deployed its explanation model, AwesomeExplain, which is a post-hoc explanation model. This gives the attacker more information to estimate your model accurately, but it can also give the attacker a way to do the model extraction in much fewer queries.

There has been significant work done in model extraction that focuses on how these model APIs can easily be replicated with high fidelity and high accuracy. This could also have a large business impact, especially if you use your Machine Learning As Service models. Even complex models like BERT aren't immune to model extraction: THIEVES ON SESAME STREET!
MODEL EXTRACTION OF BERT-BASED APIS  (https://openreview.net/pdf?id=Byl5NREFDr)

In this blog, I would like to write on the work done in  Model Extraction From Counterfactual Explanation by Aïvodji, Ulrich and Bolot, Alexandre and Gambs, Sébastien. The work focuses on how an adversary can leverage counterfactual explanation (posterior explanation methods) to build high fidelity and high accuracy model extraction attacks. [reference: https://arxiv.org/abs/2009.01884]

What are the assumptions?
An adversary has domain knowledge of the API they are trying to exploit. They also have access to the model API, which provides explanations for the model output, and lastly, the adversary knows about the deployment. This information is generally available for many ML as service applications, and an attacker can leverage this to gain information.

What is the information attacker trying to gain out of this attack?

Model extraction might have multiple intents, but there are two main categories we would like to look at. Accuracy based model extraction and fidelity based model extraction attacks. The attacker would try to create a surrogate model that might have accuracy as close as possible to the target model's accuracy. This can give financial benefits to the attacker as they can use this surrogate model instead of paid API.

Fidelity based models will try to maximise fidelity with the surrogate model. Here, fidelity is described as following

  • Sa - Surrogate Model
  • B - Target Model
  • Xr ⊂ X - Reference set

Does explanation API increase the risk of model extraction, and how much?

In  Model Extraction From Counterfactual Explanation by Aïvodji, Ulrich and Bolot, Alexandre and Gambs, Sébastien. The experiments were conducted on multiple datasets like Adult Income, default credit and COMAS. The target model considered is MLP with 1-2 hidden layers and 50-75 neurons. This information is, however now available to the surrogate model (as well as the attacker in extension)

The paper goes through multiple scenarios with different assumptions about the attacker's knowledge about training data distribution. It evaluates either one counterfactual explanation per query or multiple/diverse counterfactual explanations per query.

When we look at the results, it is observed that the counterfactual explanations could be leveraged with just a small number of queries to achieve very good accuracy. Especially when the attacker has partial or complete knowledge of training data distribution, ~500 queries were enough to obtain high accuracy and high fidelity!

So what do you do now?

A couple of ways the authors recommend countermeasures are :

  • To deter an attacker who is motivated to steal the model for their benefit and then deploy it again, you embedding watermarks can be useful. These watermarks can be detected if the surrogate model is made publicly available.
  • Query Monitoring and auditing techniques can also be used to prevent this to a certain extent.
  • Auditing your explanation API and looking for patterns could also add another layer of security while adding a rate limit for consecutive/programmatic exploitation of the API.