Increasingly, the human sciences are employing big dense data sets to better characterize biological, psychological and social processes. The models required to capture these processes are becoming sophisticated and compute intensive requiring clusters and distributed data stores to realize. Rather than focus on the development of the models and the interpretation of results, researchers must become familiar with a host of big data technologies.
At the same time, the data that researchers are utilizing is becoming more sensitive. Privacy preservation has entered the global consciousness as a result of incidents such as that involving facebook and Cambridge Analytica. The power of big data is driving many of the most successful companies at this point of time, and it is clear that academic progress will similarly be related to our ability to utilise private information. But the opportunities for breaches increase as more researchers engage with the data. Today, best practice is to rely on the integrity of researchers and the policy and legal frameworks surrounding data. This approach is manifestly inadequate as we saw in the case of Cambridge Analytica, where data was leaked via a legitimate researcher.
The development of data analysis languages has yet to respond to these trends. Most analysis is conducted in languages such as python or R which were developed under the assumption that computation would occur locally in serial fashion and that researchers would and should have direct access to the data they are analysing. These languages have been retrofitted and new languages (such as julia) have been developed with mechanisms to allow for large scale parallel computation. But the analyst must still concern themselves with how parallelism will be achieved and these languages provide no mechanisms to assess what should be released to coders or to enforce these privacy constraints.
Private aims to address these deficiencies. It is a declarative language based on BUGS/JAGS/Stan and Python. Models are specified in a Bayesian generative form and the language employs a Markov Chain Monte Carlo back end (pyMC3) to generate samples of probabilistic variables. Computation is automatically parallelised and the code is analysed to identify variables that depend upon private information. Those that might reveal too much are not released.
In the following sections, we first provide an example using the Private language to provide a feel for how Private programs operate. Then we discuss the thought processes behind Private’s design.
Inferring a distribution of latitudes
The following code demonstrates how one might establish the distribution of the latitudes of a sample of participants:
latitudes = [e.latitude for e in Events if e.hasField("latitude")] latitudes ~ Normal(mu, sigma) mu ~ Normal(0, 100) sigma ~ HalfNormal(10)
In Private, data is never loaded or queried from a database. Rather the data that is available for analysis in the current environment automatically appears in several lists (Events in this example) to which the analyst has access on the platform in which the language is embedded. Authority to analyse this data must be established by virtue of contractual agreements between the data owners and the analyst on the platform. For instance, in the http://www.unforgettable.me environment one can negotiate with participants for access to their data in a marketplace. Alternatively, a research, commercial or government institution might grant authority to researchers to analyse their data. Authority to access though does not mean authority to see. Private data is never made available and therefore cannot be copied. Rather researchers are provided with the Private language with which to run analyses.
The first line of the code above uses a list comprehension to extract the subset of latitudes for analysis. The events available in this context incorporate dummy data designed to mimic the data available at unforgettable.me. These include many kinds of events, some of which have latitudes and some which do not. To avoid errors we use the if clause of the list comprehension to filter out any events that do not have a latitude field. The next line
latitudes ~ Normal(mu, sigma)
specifies our model of the latitudes. In this case, we propose that the latitudes have been generated from a normal distribution with mean mu and standard deviation sigma. As with the deterministic assignment operator (“=”), the probabilistic assignment operator (“~”) is automatically vectorized. The above statement means that each latitude in the latitude list has a Gaussian distribution. Finally, we define our prior beliefs about the values of mu and sigma before we see the data as follows:
mu ~ Normal(0, 100) sigma ~ HalfNormal(10)
Now that the code forms a complete model, samples of mu and sigma will be generated. The coder does not need to (and cannot) run the sampler themselves. As soon as all dependencies required to compute a variable have been defined then computation will start on that variable automatically.
To examine the results, one can issue the sv command which stands for show variables. sv shows both the definitions and values of the variables.
> sv latitudes = [e.latitude for e in Events if e.hasFi Private latitudes ~ Normal(mu, sigma) Private mu ~ Normal(0, 100) Private sigma ~ HalfNormal(10) [1.411 … 1.428]
latitudes is deterministically related to Events which is a private variable and is therefore identified as private. As a consequence, it is not revealed. mu is a probabilistic variable so its privacy status must be determined by analysing the samples returned from the sampler. We will discuss how privacy is determined in the next section, but in this case it has been determined that mu is private and its value is not revealed. If we try to calculate a deterministic function based on mu it will similarly be private:
> mean(mu) mu is private
sigma, however, does pass the privacy test and therefore is revealed.
> sigma [1.411 1.432 1.403 ... 1.404 1.404 1.428] > mean(sigma) 1.3937270281369984 > len(sigma) 2000
Determining the Privacy Status of Variables
To establish the privacy status of the variables in the Private program we employ the followig observations:
A) Any variable that the coder initialises in the code is public. If the coder provided the value of the variable then they already know its value and nothing can be lost by revealing it.
B) Any variable that comes from the data sources associated with the current installation will have a privacy status defined for the current user (e.g. the Events variable used above is private).
C) Any variables that rely only on public variables are public. Note this must be true because if the coder has access to all of the variables on which this variable depends, then they could run the equivalent computation outside the system and generate this variable.
D) Any variable that relies deterministically on a private variable is private. Even if a deterministic variable is combined with a probabilistic variable it is possible to use the resultant variable to reveal information. For example, the following case:
PopesYearOfBirth = [s.YearOfBirth for s in Subjects if s.name == "The Pope"] y ~ Binomial(0.5, 1000) z = y + PopesYearOfBirth * .00001
will generate samples of z that are integers with .01936 added to them because the Pope was born in 1936, thus revealing potentially private information.
E) Any remaining variables must be subjected to a privacy test. This involves running the sampler with all sets of data with one subject removed and checking that it converges with the chain containing all subjects’ data. Currently, Private uses a form of privacy test called manifold privacy. Manifold privacy is beyond the scope of this blog post. I plan to cover this algorithm in a subsequent blog post.
Rules C and D are applied every time we establish the status of a variable through a privacy test to avoid running unnecessary tests on derived variables.
In addition to the analysis of the variable dependency graph and distributional tests, Private takes further precautions. All chains are initialised with the same seeds each run so that no additional information can be revealed by rerunning the same code (because the same sequence of samples will be produced).
Perhaps the most critical difference between Private and languages like R or Python is that Private is declarative rather than procedural. This choice has substantial consequences for both privacy analysis and parallelisation.
Because Private is declarative each variable has only one privacy status. It is either public or private. In procedural languages (e.g. Jeeves), the privacy status of a variable can change as a function of state. You may initialise a variable to a standard value, in which case it poses no privacy threat. However, subsequently the value of the variable may be altered based on private information and its value may then become private. In Private, one can establish the privacy status of many deterministic variables by analysing the variable dependency tree rather than at run time making the underlying interpreter code simpler and improving execution efficiency.
The declarative nature of Private also means that automating parallelisation is simplified as there are no sequential dependencies to consider. There are variable dependencies, but these can be extracted syntactically. If, for instance, the following code is executed:
latitudes = [e.latitude for e in Events if e.hasField("latitude")] longitudes = [e.longitude for e in Events if e.hasField("longitude")]
the Private interpreter identifies that both variables can be calculated as all of their dependencies have been computed and therefore will calculate them in parallel on separate machines. The coder does not have to identify this opportunity.
Private is designed to be used in situations in which data is large and computations will often take too long for users to wait for computation to complete in an interactive session. Rather, the shell incrementally makes changes to the variable dependency graph. If a dependency is entered using either = or ~ then it will replace any dependencies that already exist for that variable and all variables that depend on this variable will be recomputed. Control returns immediately to the shell rather than waiting for computation to complete (as would typically be the case in a procedural scripting language), so new alterations to the variable dependency graph including new visualisations can be added while computation is taking place.
Many Bayesian languages are declarative or largely declarative in nature (e.g. BUGS, jags, Stan) as it is a more natural way to express generative models. However, they can be cumbersome because they typically outsource data loading and manipulation to procedural languages (R, python or matlab) and then use those languages to invoke the the sampler on their code. Similarly, samples are typically returned to the procedural language for subsequent analyses and visualisation. In Private, however, data loading is not necessary, data manipulation and visualisation is built into the language and the shell is incremental. That is, one can code incrementally in the Private shell in the same way that one can run code snippets in R, python or matlab. However, in procedural languages there is always the possibility that the user’s understanding of the current state is inaccurate. Perhaps another section of code should have been run in order to ensure the state was up to date. In Private, the code used to define a variable is retained and whenever a new definition is added (or an old one edited) the interpreter determines which code must be re-executed to ensure that its value is consistent with the current definition. This feature should reduce the opportunities for human error.
In a privacy sensitive world, business as usual is not cutting it. New languages are going to be required if we are to realise the full potential of big data without seriously compromising the rights of the people who provide it. Private provides a solution to these challenges.