News

See how machine learning can violate your privacy

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on telegram
Share on email
Share on reddit
Share on whatsapp
Share on telegram


Machine learning has pushed the boundaries in several fields including personalized medicine, autonomous cars It is personalized ads. Research has shown, however, that these systems memorize aspects of the data they were trained on in order to learn patterns, which raises privacy concerns.

In statistics and machine learning, the goal is to learn from past data to make new predictions or inferences about future data. To achieve this goal, the statistician or machine learning specialist selects a model to capture suspicious patterns in the data. A model applies a simplifying structure to data, which makes it possible to learn patterns and make predictions.

Complex machine learning models have some inherent pros and cons. On the positive side, they can learn much more complex patterns and work with richer data sets for tasks like image recognition It is predict how a specific person will respond to a treatment.

However, they also run the risk of overfit to the data. This means they make accurate predictions on the data they were trained on, but begin to learn additional aspects of the data that are not directly related to the task at hand. This leads to models that are not generalized, meaning they perform poorly on new data that is the same type but not exactly the same as the training data.

While there are techniques to deal with the predictive error associated with overfitting, there are also privacy concerns about being able to learn so much from data.

How Machine Learning Algorithms Make Inferences

Each model has a certain number of parameters. A parameter is an element of a model that can be changed. Each parameter has a value or setting that the model derives from the training data. Parameters can be thought of as different knobs that can be turned to affect the performance of the algorithm. Although a straight pattern has only two buttons, the slope and interceptmachine learning models have many parameters. For example, the language model GPT-3has 175 billion.

To choose the parameters, machine learning methods use training data with the aim of minimizing the predictive error in the training data. For example, if the goal is to predict whether a person would respond well to a certain medical treatment based on their medical history, the machine learning model would make predictions on the data where the model developers would know whether someone responded well or poorly. The model is rewarded for correct predictions and penalized for incorrect predictions, which leads the algorithm to adjust its parameters – that is, turn some “knobs” – and try again.

To avoid overfitting training data, machine learning models are checked against a validation dataset also. The validation dataset is a separate dataset that is not used in the training process. By checking the performance of the machine learning model on this validation dataset, developers can ensure that the model is capable of generalize your learning beyond the training data, avoiding overfitting.

Although this process can ensure good performance of the machine learning model, it does not directly prevent the machine learning model from memorizing information in the training data.

Privacy concerns

Due to the large number of parameters in machine learning models, there is a possibility that the machine learning method memorizes some data on which it was trained. In fact, this is a widespread phenomenon, and users can extract the memorized data from the machine learning model using custom queries to get the data.

If the training data contains sensitive information, such as medical or genomic data, the privacy of the people whose data was used to train the model could be compromised. Recent research has shown that in fact necessary for machine learning models to memorize aspects of training data in order to obtain optimal performance in solving certain problems. This indicates that there may be a fundamental trade-off between the performance of a machine learning method and privacy.

Machine learning models also make it possible to predict sensitive information using seemingly non-sensitive data. For example, Target was able to predict which customers were likely pregnant analyzing the purchasing habits of customers who signed up for Target’s baby registry. Once the model was trained on this dataset, it was able to send pregnancy-related ads to customers who suspected they were pregnant because they purchased items like supplements or unscented lotions.

Is privacy protection even possible?

Although there are many methods proposed to reduce memorization in machine learning methods, most have been largely ineffective. Currently, the most promising solution to this problem is to guarantee a mathematical threshold for privacy risk.

The state-of-the-art method for formal privacy protection is differential privacy. Differential privacy requires that a machine learning model not change much if an individual’s data changes in the training dataset. Differential privacy methods achieve this guarantee by introducing additional randomness into the learning algorithm that “masks” the contribution of any particular individual. Once a method is protected with differential privacy, there is no possibility of attack may violate this guarantee of privacy.

Even if a machine learning model is trained using differential privacy, that doesn’t stop it from making sensitive inferences, like in the Target example. To prevent these privacy violations, all data transmitted to the organization needs to be protected. This approach is called local differential privacyIt is Litter It is Google implemented it.

Because differential privacy limits how much the machine learning model can depend on an individual’s data, it impedes memorability. Unfortunately, it also limits the performance of machine learning methods. Because of this trade-off, there is criticism about the usefulness of differential privacy, as it often results in a significant impact. drop in performance.

From now on

Because of the tension between inferential learning and privacy concerns, there is ultimately a social question about what is more important in what contexts. When data does not contain sensitive information, it is easy to recommend using the most powerful machine learning methods available.

When working with sensitive data, however, it is important to weigh the consequences of privacy leaks, and it may be necessary to sacrifice some machine learning performance to protect the privacy of the people whose data trained the model.

This article was republished from The conversationan independent, nonprofit news organization that brings you facts and analysis to help you understand our complex world.

It was written by: Jordan Awan, Purdue University.

See more information:

Jordan Awan receives funding from the National Science Foundation and the National Institute of Health. He also serves as a privacy consultant for the federal nonprofit MITER.



Source link

Support fearless, independent journalism

We are not owned by a billionaire or shareholders – our readers support us. Donate any amount over $2. BNC Global Media Group is a global news organization that delivers fearless investigative journalism to discerning readers like you! Help us to continue publishing daily.

Support us just once

We accept support of any size, at any time – you name it for $2 or more.

Related

More

Vermont governor vetoes data privacy bill

June 14, 2024
The governor of Vermont vetoed a broad data privacy law This would have been one of the country’s strongest crackdowns on companies’ use of personal data online, allowing
1 2 3 6,014

Don't Miss