Tyler Kurpanek, Chris Lum, Bradley Nathanson, Trey Scheid ()
Mentor: Yu-Xiang Wang
In today's data-driven world, the need to protect individual privacy while maintaining the utility of data analysis has become increasingly crucial. Differential Privacy (DP) emerges as a mathematical framework that provides strong privacy guarantees while allowing meaningful statistical analysis
At its core, differential privacy ensures that the presence or absence of any individual's data in a dataset does not significantly affect the results of any analysis performed on that dataset. This is achieved by carefully introducing random noise into the computation process, making it virtually impossible to reverse-engineer individual records...
...all while maintaining the big picture!
But what just happened? We added noise to the image of Mona Lisa by probabilistically flipping each pixel. This way, you can still see the big picture, but the individual pixels' have some deniability as to what their original data was
Try for yourself!
What you see is a Differential Privacy technique called Randomized Response. Individual privacy is protected while maintaining some statistical patterns of the whole dataset. It's a simple method that satisfies the Definition of Differential Privacy
Definition of Differential Privacy states that for any two datasets that differ in exactly one record, the probability of getting the same output from an private algorithm should be similar. This paves the way for a and a way to quantify the privacy of an algorithm.
A randomized algorithm is -differentially private if for all pairs of adjacent datasets and , and for all sets of possible outputs :
Where:
Read more: Wikipedia
Tyler Kurpanek, Chris Lum, Bradley Nathanson, Trey Scheid ()
Mentor: Yu-Xiang Wang
In today's data-driven world, the need to protect individual privacy while maintaining the utility of data analysis has become increasingly crucial. Differential Privacy (DP) emerges as a mathematical framework that provides strong privacy guarantees while allowing meaningful statistical analysis
At its core, differential privacy ensures that the presence or absence of any individual's data in a dataset does not significantly affect the results of any analysis performed on that dataset. This is achieved by carefully introducing random noise into the computation process, making it virtually impossible to reverse-engineer individual records...
...all while maintaining the big picture!
But what just happened? We added noise to the image of Mona Lisa by probabilistically flipping each pixel. This way, you can still see the big picture, but the individual pixels' have some deniability as to what their original data was
Try for yourself!
What you see is a Differential Privacy technique called Randomized Response. Individual privacy is protected while maintaining some statistical patterns of the whole dataset. It's a simple method that satisfies the Definition of Differential Privacy
Definition of Differential Privacy states that for any two datasets that differ in exactly one record, the probability of getting the same output from an private algorithm should be similar. This paves the way for a and a way to quantify the privacy of an algorithm.
A randomized algorithm is -differentially private if for all pairs of adjacent datasets and , and for all sets of possible outputs :
Where:
Read more: Wikipedia
What's important to note is that Differential Privacy is a property of an algorithm, not a property of the data. Another intuitive Privacy-Enhancing Technology might be to anonymize data by removing personally identifiable information, but this isn't enough to guarantee privacy on its own.
In 2006, Netflix created the Netflix Prize, a competition to improve Netflix's movie recommendation algorithm. The dataset used in the competition contained 100 million ratings from 480,000 users on 17,770 movies, anonymized by removing personally identifiable information. One year later, using iMDB ratings as a reference, two researchers from UT Austin were able to deanonymize the 99% of the users in the dataset. This is why the rigor of DP is so important. Have you considered what someone could do with your personal watch history, ratings, browsing data, or worse?
Another simple way to privatize data is to add noise to each of the values you plan to release. Here, we're releasing the raw counts of each category by adding noise to each of the counts.
Raw Data ? Notice how having more data means that the noise affects the big picture less. Scale = 1x | Introduce Noise ? The noise follows a Laplace distribution with scale proportional to 1/ε. This mechanism satisfies (ε, 0)-differential privacy, meaning that the probability of the privacy guarantee failing is 0. ε = 1.0 | Privatized Data |
---|---|---|
16 | 0.00 | 16.00 |
32 | 0.00 | 32.00 |
24 | 0.00 | 24.00 |
Differential Privacy can be used for more complex queries too, such as training a machine learning model. A data scientist may do this because they don't want to . The question then becomes:
When are differentially private methods practical and useful?
How effective is differential privacy when applied in practice?
When ChatGPT 3.5 was released, trained on many datasets publicly and privately available. Clever prompters called "Agents" were able to gather SSN's for individuals which the model would produce with perfect accuracy! A differentially private algorithm guarantees that model outputs are not significantly different wether your SSN was in the training data or not, which means that the model would not reveal any information about the training data.
Here, we've trained two logistic regression models, one privately and one non-privately, on the same task: predicting whether an image is a hotdog or not. We trained these models on the same training data and both achieve similar test accuracies. Plotted below is the model's confidence that the image provided is a hotdog. Try to identify which image was used in the training set! (Hint)Does either model predict an image with a weirdly high confidence? Also, notice that all the privately released predictions probabilities seem relatively unconfident, but the test accuracy shows promising generalizability! (Answer)Image 2 was in the training set! Notice in the non-private model how image 2 is exceptionally condfident. An attacker could identify this datum as part of the training set (or incorrectly assume so) and proceed to harm that individual.
Overall, differential privacy is a powerful tool for protecting individual privacy while still allowing for useful data analysis. There exist many different ways to implement differential privacy, which requires asking questions like, "Where should we add noise?" or "Are two neighboring datasets defined by a single entry, or a single user's worth of entries?"
A lot of the research in DP has been finding the best ways to add noise while maintaining utilty and our following study focuses on how applicable it actually is in practice.
We sought to assess how feasible it is to apply DP to real-world data analysis tasks. Adding noise often leads to reduced utility, so we focused on recreating four different papers that did not use DP and compared their results to our DP-applied counterparts. Our focus was on tasks using Intel telemetry data to create realistic, high-data volume analyses. We found that for high-privacy settings () the utility loss is often too great to be practical.
So why would we want to use DP on telemetry data? Telemetry data is data collected by devices such as CPUs or hard drives to monitor things like temperature, power usage, crash metrics, and much, much more. It is often collected in large volumes and is used for diagnostic analysis to improve user experiences.
Each piece of data is typically attributed to a specific device ID, which if somebody could link these to a specific user, they could know a lot about that user based on how they use their device. High use of their GPU from noon to 1 PM? Maybe they're a video editor or they're playing a video game. Maybe you know a bunch of devices that are used as AWS servers. Knowing how they operate could be valuable to a competitor.
This is why we want to apply DP to telemetry data. We want to collect data in a way that allows us to perform useful analyses without compromising user privacy.
For our study, we looked at four different papers that used a variety of different methods in their analyses. We first recreated the study non-privately with the same volume of data to serve as a baseline. Then, based on the methods used in each paper, we applied differential privacy at a variety of different privacy levels to assess how different the final analysis results were from our baseline.
Below you can look at the results of each paper.
Primary contributor: Chris
Algorithm used: Logistic Regression and Significance Testing
This paper sought to assess whether a certain feature was significantly present on the same day that an uncorrected error occurred (think blue screen of death). There are many different types of uncorrected errors, so they looked at the top 30. They looked at two different features, daily max temperature and presence of a corrected error (an error that the OS manages to resolve). For each of these features, they made a univariate logistic regression model (two total times 30 uncorrected error types = 60 total models) to predict whether an uncorrected error occurred on a given day and with that coefficient they took a statistical test to assess whether or not that variable was statistically significant with an alpha = 0.05.
We looked at how corrected errors predict uncorrected errors and dropped daily max temperature. We trained on the top thirty uncorrected errors as well, and dropped one due to large compute times, for a total of 29 different logistic regression models. Privacy was applied in the form of differentially private gradient descent, which means that for each step of gradient descent, a small amount of noise is added. This does mean that compute scales linearly with epsilon, as higher epsilons means more noisy gradient descent steps, thus we stopped at epsilon 1.5 for each of the models.
In order to ensure that our alpha would have the same statistical power as the non-private version, we did a permutation test for each of the logistic regression models to empirically find a p-value for the private logistic regression models. Because of the long compute times with epsilon, we took 200 permutations for each private logistic regression model to compare our result against.
Our main method of comparing against the baseline was by calculating the set intersection over union for the private models and the non-private models. Firstly, for a "strong DP" (epsilon = 1), the model severely under-shot the number of significant models. This means that it failed to find the correct relationship between corrected and uncorrected errors in a third (0.31) of the models.
Overall, as the epsilon increased from zero to 0.5, the intersection over union tended to increase. However, between 0.5 and 1.5, the intersection over union failed to approach the non-private model and failed to improve.
Due to high compute costs, we were only able to get up to epsilon 1.5 for each of the 29 models. As we expected, the utility would increase from very low epsilon but around ε = 0.5, it is unclear whether or not the utility would continue to increase. At ε = 1, the model does somewhat poorly as seen in Figure 1. The model has an intersect over union of around 0.60 and notably identifies a majority of the models as not significant, a result contrary to the nonprivate model. Overall, this analysis task did not seem to be replicable privately, at least under the strict privacy constraints.
It is important to note that this analysis would not be nearly as computationally hungry if we weren't trying to compare the models to a non-private model. In the practical setting, we wouldn't need to do permutation testing to find our p-values empirically. Similarly to the non-private setting, our alpha would be a hyperparameter that we would be able to select ourselves. This means that we may be able to use much higher values of epsilon in practice
For each of our tasks, the privacy-utility tradeoff is not identical. Each task has different sensitivities to added noise. Some tasks can tolerate higher amounts of noise without significantly affecting the utility of the results, while others rely on precision and can degrade very fast with even slight noise. Given by our combined plot, each task has signifcantly different curves as epsilon increases. One potential factor that also is not considered is that the added noise may reduce overfitting and could help the model generalize better.
So how feasible is it to apply DP to telemetry data? Over the course of this project, we found that there's no one-size-fits-all solution. Each of our different methods had different levels of utility lost from differential privacy. Below is our discussion of our findings.
A part of assessing the feasibility of applying differential privacy is necessarily going to be a discussion surrounding the process of applying DP itself. We have gotten together to discuss what went well, what went poorly, and what limitations we had to concede.
In the process of applying differential privacy, we found three main things the most helpful: mathematical foundations of DP result in a consistent comparison across methods, differential privacy algorithms are intuitive at a high level, and that many papers exist on different DP algorithms ready to be implemented.
Comparing across methods was made easy because of the groundedness of DP in its definition and its reliance on epsilon. We could easily compare across tasks and observe that some tasks work well with an epsilon of 1 and some tasks didn't. The structure enabled us to have a strong idea that each of our privacy guarantees were identical.
These mathematical foundations do mean that a lot of research is theory and math oriented, but we found that at a high level, DP algorithms are intuitive and straight forward. DP algorithms rely on three main things: add noise/randomness, bound sensitivity, and privacy accounting. The specific math of how much noise to add or what to clip might be tough, but boiling down an algorithm is often as simple as knowing where the data gets clipped and where the noise gets added.
Many traditional analysis tasks have been privatized and written about. Applying differential privacy oneself often relies on finding a paper detailing the mechanism and molding it for your specific use case. The authors of the papers have been especially nice as well, being available to email and talk to about their methods. One small thing to note, several times have we found minor errors in papers which did make applying the methods occasionally difficult, but overall, the methods already existed and we just needed to implement it.
In the process, we also found two main difficulties that hindered our ability to complete our analysis tasks: epsilon is difficult to interpret and it is hard to quantify utility loss.
Epsilon, as a value in the differential privacy equation, is straightforward with how it compares two probabilities. The issue is what this really looks like in real life. It is hard to get an intuition for what an arbitrarily bad event is and at what probability that would occur. We may know that our epsilons are the same, but what protections does that practically assure us? We know that an epsilon of 10 is bad, but how bad is it really? Sure, e to the 10th is a massive value, but what is the probability that something terrible actually happens?
On the other hand, we sought to measure utility loss as compared to a baseline. This had a set of difficulties in its own right. For some analysis tasks, the non-private version may not be the ground truth. For example, a lot of deep learning models generalize better when noise is added during training. Establishing what is exactly maximum utility was a long conversation. Secondly, for a given amount of utility, it's hard to quantify how bad is bad. For example in our paper, the logistic regression models had an IOU of around 0.60. This is an example of a task that does have a more solid baseline, but how solid terrible is it to have 0.60 IOU? At a more abstract level, what if being 1% off is the difference betewen 100 million and 99 million lives saved? It's difficult to have a good intution of what exactly we're losing.
There were a couple of limitations surrounding our ability to forge our analyses: a general lack of knowledge and replication vs. novel analysis.
Six months ago, differential privacy was a new concept to each of us. None of our backgrounds delved greatly into the rigor of mathematical proofs. Telemetry data was new to us and we suffered from lack of domain knowledge. Researchers or analysts who wish to accomplish similar comparisons may benefit greatly from more knowledge in either differential privacy and/or the domain in question itself.
The applicability of our study as a commentary of the feasibility of DP methods must be framed knowing that we replicated papers and did not seek to do novel analyses from scratch. Having the guidance of the original paper meant that there were some steps that we did not attempt or do privately ourselves. We did not try to tune hyperparameters privately, a task that would rely on high amounts of domain knowledge or using some of the privacy budget in order to find valuable hyperparameters. Further, we already knew what features we wanted, private EDA might take up plenty of privacy budget itself. One could argue that the analyst implementing a DP algorithm is already private and need not consider privacy in their analysis, but then the question arises, whom are we protecting against?
Additionally on replicating papers, some papers were difficult to replicate due to obscurity in their writing or lack of general information. A common pitfall was not knowing exactly which "temperature" a paper was referring to. One of us had to assess several papers before being able to find one that would be able to be replicated.
Overall, we found mixed results on how feasible applying differential privacy was. Some tasks were hardly affected and others would result in much different conclusions. There seems to be no universal solution for applying DP in tasks, it is task dependent. Different tasks have different success criteria, different methods have varying levels of ability to be privatized.
The feasibility of applying differential privacy seems to rely heavily on the practitioner's knowledge of both DP methods and their own domain. There is a high barrier to entry with differential privacy. An analyst who is familiar with their domain but completely new to DP would struggle greatly switching their workflow from their non-private methods to their private counterparts. Further, if the guarantees of DP aren't adequately understood, there would be a lack of desire in putting in the effort to lose utility and gain privacy. A path forward to private analyses across the board would not be able to be done bottom up, smart people would need to hold the hand of the typical analyst.
We recreated baseline models and algorithms used in previous research papers with their associated private models in a practical setting, providing valuable insights into how these privacy-preserving techniques perform in real-world applications. As we are not PhD-level researchers, with more academic rigor it could lead to more promising findings and a deeper understanding of the privacy-utility balance in applied machine learning. Nevertheless, our work demonstrates that with just a few months of practice and an understanding of differential privacy, it is possible to implement privacy-preserving methods that showcase the best epsilon that balances privacy and utility. As DP becomes even more accessible, it will make implementation faster, improving both performance and computation.
Future research could explore alternative differential privacy methods for our tasks, such as applying Lasso regression using the Functional Mechanism to improve utility while maintaining privacy. Additionally, investigating different privacy accounting regimes, such as Rényi differential privacy or zero-Concentrated DP, could provide a more flexible trade-off between privacy and accuracy, optimizing the overall performance of the model.
Future work could focus on privatizing additional data tasks to enhance privacy while maintaining analytical utility. One potential task for future privatization is identifying the owning group for addressing a telemetry-detected issue, which could benefit from group-level differential privacy. This approach would help protect sensitive organizational information while still enabling efficient issue resolution.
We could explore several directions to improve and expand differential privacy applications. One avenue is scaling up computations and applying privatization methods to different domains, enabling broader adoption in diverse fields such as gaming analytics, hardware performance, and behavioral studies. Additionally, investigating tasks with varying sensitivity levels could lead to more nuanced privacy strategies, where higher-sensitivity tasks receive stronger protections while lower-sensitivity tasks maintain higher utility.
Another promising direction is leveraging off-the-shelf differential privacy packages, such as Google's DP library or PySyft, to streamline implementation and improve accessibility. This could facilitate the more widespread adoption and standardization of privacy-preserving methods.
Beyond technical advancements, think-aloud studies and longitudinal research could provide valuable insights into how users interact with differentially private systems in real-world settings. By observing users over time, we can refine privacy mechanisms to better align with practical workflows. Finally, validating utility results through alternative testing methods would help ensure that privacy-preserving models maintain effectiveness across different evaluation metrics, strengthening confidence in their real-world applicability
Thank you for taking your time to read through our project! If you are interested in continuing our work, feel free to reach out to us or check out our project repository and notes.
Special thanks to our advisor, Dr. Yu-Xiang Wang, for his help and confidence in our work. We also want to thank ENCORE for hosting the workshop "Workshop on Defining Holistic Private Data Science for Practice" which helped greatly with our broad understanding of the state of the field of differential privacy in practice.
Click here to view our project repository!
Click here to catch up on our notes!
TL;DR: It's nuanced.