Much of the research conducted in my lab relies on using sensors and other data streams generated by smartphones, online services and IoT devices to understand human memory. Naturally, people are protective of this data. We have developed a ecosystem that strives to collect data in ways that maximise the control people have over their data both before and after it leaves devices or services (Dennis, Yim, Garrett, Sreekumar & Stone, 2018) and to ensure that they continue to benefit from it well after it has been collected.
From a privacy perspective, the weak link in the research pipeline is analysis. Almost universally researchers are able to see the data on which they conduct analyses. As each additional researcher gains access to the data the opportunity for nefarious use increases. In the Cambridge Analytica case, for instance, facebook provided data to a researcher who appeared legitimate and in fact still maintains he did nothing wrong.
As I argued in a previous blog post, current data analysis languages are not equipped to deal with this problem and it would be difficult if not impossible to retrofit to make them privacy sensitive. As a consequence, we have started developing the Private language.
Private is a declarative language and uses a variable dependency tree to induce the privacy status of many variables. However, it is also necessary to be able to assess the amount of information that would be released by samples from probabilistic variables. The variable dependency tree can be used to establish when certain variables are definitively public or private, but it cannot be used to determine if probabilistic variables that depend on private information are private.
In Dennis et al. (2018), we proposed an algorithm to estimate approximations to the differential privacy of a set of samples generated by a probabilistic program. While this algorithm provides a strong link to previous work in computational privacy, practically it suffers from two major drawbacks. Firstly, it relies on the estimation of the probability density function of the sampled distribution. That requires differentiating the cumulative density function and that requires that one generates several orders of magnitude more samples than one will ultimately release. Given the complexity of the models we are trying to estimate and the size of the data to which they are typically applied that isn’t practical, especially given that these models must all be estimated N+1 times (where N is the number of participants) to assess the impact of removing any one of them. Secondly, the problem with estimation of the derivative is exacerbated when one moves to multiple dimensions – a common scenario in practical data analysis.
To address these issues, we employ a new technique we call manifold privacy. Manifold privacy is inspired by procedures for calculating the correlation dimension common in dynamical systems domains (Grassberger & Procaccia, 1983) and the two sample Kolmogorov–Smirnov test.
The central idea behind differential privacy is that information loss is minimised if the probability of a given output changes little when any participant is removed from the data set. In our procedure then, we start by generating sets of samples from N+1 MCMC chains. One of the chains is exposed to all of the data and the other N chains are exposed to data from all but one of the participants.
Taking inspiration from the correlation dimension, we calculate the set of distances between the samples from the chain exposed to all participants’ data. These distances can then be ordered to construct an empirical cumulative density function (see Figure 1). Next we take the sets of samples corresponding to each participant (i.e. that were generated by removing that participant from the data set) and calculating the distances between those points and the points of the all participant sample. From these distances we calculate a second empirical probability density function.
The maximum difference between these distributions is our statistic. If these were independent samples of two distributions this would be the Kolmogorov Smirnov statistic. Because our length samples are dependent, we can’t employ the KS test. Instead, we set a criterion directly on the difference statistic. The all participant samples are then released if the statistic for all N participants sits below the criterion.
Because the empirical cdfs are derived from lengths and we have O(s^2) lengths where s is the number of samples, relatively small numbers of samples (e.g. 1000) generate quite stable cdf estimates. This property is shared with the correlation dimension and is the reason that the correlation dimension is often favoured in empirical dynamical systems analyses when data is sparse.
Also, the procedure works well as one increases the number of dimensions as lengths can be calculated quite efficiently. In practice, however, we have found that the procedure is quite conservative if all variables in typical models are assessed together. As a consequence, we assess each named variable independently and permute the samples to avoid information being disclosed through the joint distributions. If analysts need to have joint distributions (e.g. when sampling GPS coordinates), then they can construct array valued variables. In this case, the test is conducted over all dimensions, but permutation occurs at the vector level so joint information is preserved.
Manifold privacy is currently implemented in the Private language and we think provides a reasonable trade off between the stronger assurances provided by differential privacy and the demands of producing a relatively computationally efficient approach that is applicable to whatever script the analyst might code.
If you are aware of better approaches or can see ways in which we can improve upon manifold privacy, then please drop us a line.
Dennis, S., Yim, H., Garrett, P., Sreekumar, V., & Stone, B. (2018). A system for collecting and analyzing experience-sampling data. Behavior Research Methods, 1-15.
Grassberger, P., & Procaccia, I. (1983). Measuring the strangeness of strange attractors. Physica D: Nonlinear Phenomena, 9(1-2), 189-208.