Why Allowing Multiple Queries to a Dataset Weakens Differential Privacy

Differential privacy is a moderately complex security topic. Briefly, and loosely, if you have a dataset (such as Census data) you don’t want queries such as “What is the average age of people in the dataset?” to unintentionally reveal information about a specific person in the dataset.

One of the main ways to prevent security leakage is to add random noise to query results. For queries that return a numeric result, a common technique is to add a random value drawn from the Laplace distribution. (See my post at https://jamesmccaffrey.wordpress.com/2021/11/05/understanding-the-laplace-distribution-for-differential-privacy for an explanation). The idea is that the return result won’t be completely accurate but in many situations the approximate result is good enough to be useful.

However, if you allow users to repeatedly query a dataset, if enough queries are executed, a user can determine the true result, and the true result can potentially be used to reveal sensitive information. If many queries are issued, some of the noisy results will be greater than the true value and some of the noisy results will be less than the true value, but the average of the query results will approach the true value.

I coded up a quick demo. I set up an arbitrary true dataset value of 33. For queries, I returned the true value plus a Laplace noise with loc (mean) = 0 and scale (spread) = 1. For 100 queries, most of the return results were more than 1 away from the true value of 33. But the average of the query results was 32.62 — with 0.38 of the true value.

The moral of the story is that security is tricky and failure can have bad consequences.

Dog failure has fewer consequences than computer security failure.

Demo code:

# diff_priv_multiple_queries.py

import numpy as np

print("\nBegin multiple queries demo ")
print("\nSetting true dataset query result = 33 ")
print("Setting Laplace noise loc = 0.0, scale = 1.0 \n")

true_result = 33
sum_query_results = 0.0

for i in range (100):
  noise = np.random.laplace(loc=0.0, scale=3.0)
  query_result = true_result + noise

  if i % 10 == 0:
    print("query # %3d " % i, end="")
    print("query result = %6.2f " % query_result, end="")

    if np.abs(true_result - query_result) < 1.0:
      print("within 1.0 is TRUE ")
      print("within 1.0 is FALSE ")

  sum_query_results += query_result

avg_query_result = sum_query_results / (i+1)
print("\navg_query result = %6.2f " % avg_query_result)
if np.abs(true_result - avg_query_result) < 1.0:
  print("avg_query_result within 1.0 of true result is TRUE ")
  print("avg_query_result within 1.0 of true result is FALSE ")

print("\nEnd demo ")
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s