Machine Learning and the GDPR

Many of the client companies I talk to bring up the subject of the potential impact of the European Union GDPR (general data protection regulations) which is set to take effect on May 25, 2018. The bottom line is, even though I’m not an expert in this area, that I believe the GDPR rules with respect to machine learning are very vague, will likely impose huge costs on all companies that have European customers, and could open the way to massive amounts of litigation.

Loosely speaking, the GDPR sets up strict rules for personal customer information, any company that does business in Europe. This is a good thing in principle. Potential fines are astronomical (up to 4% of a company’s global annual “turnover” — but this being a European thing, the exact amount can be determined by the GDPR entity, unfortunately creating an incentive for illegal behavior by just about everyone involved.)

The ultimate aim of collecting personal information is to make use of it in some way. Although the GDPR rules are labyrinthine, briefly, 1.) personal data must be anonymized (possibly making it useless for many ML applications) or pseudonymized, 2.) algorithms that use personal data must be explainable (where exactly what that means isn’t clearly defined), 3.) decisions reached using personal data, even if they are completely neutral, cannot have an effect related to race, politics, gender, religion, etc., etc., etc.

From the aptly-named Web site

Each of these three areas is quite interesting and very complex. And the poor phrasing of GDPR “recitals” makes any concrete discussion impossible. But for data anonymization, suppose you are a hospital and you want to use a person’s medical information to determine the best set of treatments. If your data is completely anonymized, you may not be able to use it effectively.

For algorithm explanability, it’s not clear what this means at all. Do companies have to explain the exact algorithm used, thereby giving away a competitive advantage? Or do companies have to just say what class (such as a decision tree, or a neural network) they’re using? One interesting related topic here is called counterfactual information, where a company could say something like, “Your credit application would have been approved if your income had been $10,000 higher.” The requirement of algorithmic explanability could be a dream come true for unscrupulous lawyers.

The problems with disparate effect related to just about any personal category are overwhelming. Virtually any algorithm will have a varying impact on several classes of people. For example, a machine learning advertising recommendation system could discriminate against millionaires who have red hair and are left-handed, thereby making them victims and allowing them to alert the GDPR, which in turn could legally extract millions or even billions of dollars from the offending company.

Of course, my examples here are exaggerations. But the point is, the GDPR makes these crazy scenarios at least feasible. And the cost to a company of defending against such actions could easily put them out of business.

I suspect there will be all kinds of unintended consequences of the GDPR. The regulations could easily stifle machine learning innovation by big companies with lots to lose, and push innovation to small startups. The GDPR could greatly reduce mergers and acquisitions because a large company that acquires a small company that could be liable in some way, inherits the liability. And on and on.

The intentions of the GDPR may have be good, but the realization appears to be very weak. But I’m no lawyer (thank goodness) so only time will tell regarding the impact of GDPR.

Signage with unintended consequences

This entry was posted in Machine Learning. Bookmark the permalink.