Bayesian Estimation with Private Data (Real Data Edition)

In the last tutorial, we covered simple examples that aimed to build the conceptual understanding of probabilistic programming and introduce the Private programming language. In this tutorial, we are going use the Private language to extract information about potentially private experience sampling data.

In this version (the Real Data Edition) we will be using a project called Demo Project (The impact of emotion on episodic memory). To access this project you will need to have an Unforgettable researcher account. Once you have logged in, go to Marketplace, Researcher/Projects. Find the project and click on the Analyze button.

Inferring the probability of rain

In the first part of the tutorial, we are going to estimate the probability of rain given a set of events. To start with, let’s review how we might go about estimating a probability/rate based on a True/False observations.

To create 100 samples of a Bernoulli variable with a rate parameter of 0.3. type:

data = Bernoulli(0.3, 100)

Recall that a Bernoulli variable is one which can have one of two values with a given probability or rate. Now type data and you should see your 100 samples:

[0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 1 1 0
 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
 1 1 1 0 1 1 1 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0]

You can imagine that each of these is an indication of whether it rained during a specific hour period. 1 means it did rain, 0 means it did not.

So, we have the data that we are going to model. Now we want to define the model. We start by describing ‘data’:

data ~ Bernoulli(rate)

rate will be the probability variable. Before Private can start its calculations, we need to also define the prior on rate as follows:

rate ~ Uniform(0,1)

You can see all of the definitions by typing sv. Note r is now computing because all definitions have been provided.

Events = Private
DemoEvents = [{'AccelerometryCount': 5, 'type': '__App__'}]
data = Bernoulli(0.3, 100) [0.000 … 0.000]
data ~ Bernoulli(rate) [0.000 … 0.000]
rate ~ Uniform(0, 1) Computing

While we wait for Private to finish its calculations, we can define variables to store the mean and standard deviation of the rate.

meanrate = mean(rate)
stdrate = std(rate)

If Private has finished its calculations you’ll see the following when you type sv:

Events = Private
DemoEvents = [{'AccelerometryCount': 5, 'type': '__App__'}]
data = Bernoulli(0.3, 100) [0.000 … 0.000]
meanrate = mean(rate) 0.245
stdrate = std(rate) 0.045
data ~ Bernoulli(rate) [0.000 … 0.000]
rate ~ Uniform(0, 1) [0.313 … 0.287]

The samples of rate are now available and meanrate and stdrate have been calculated. Note meanr is 0.245, which is a reasonable approximation to the rate that was used to generate the data 0.045.

We have now established how to estimate a rate parameter. Next we want to try this out on some data. All Private projects have a variable called Events. Events is a list containing the data associated with that project that was generated by the participants. Type:

Events

and you should see:

Events is private

A private variable like Events is not released (shown) to the researcher, because the value could be used to reveal sensitive information.

In the Demo project, we will be working with fake data and so there are no privacy concerns that we need to worry about. For tutorial purposes, a second variable called DemoEvents is built in. DemoEvents is exactly the same as Events except that it is not private. That makes it easier for us to understand what is going on. We’ll start by estimating the probability of rain with DemoEvents and then do the same thing with Events.

Firstly, let’s take a look at what a typical event looks like. Type:

d0 = DemoEvents[0]
d0

to see the first event in the list. You will see an event like this:

{'AccelerometryCount': 5,
'AccelerometryDataFiles': [{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/accel_20190616141000Z_b3944396-b87e-4b29-9acc-a05d00ed868f.bin',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/accel_20190616142000Z_0983b4e0-8cab-4d29-bd78-e33278c0095f.bin',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/accel_20190616143001Z_1503e036-bd86-4a29-87e3-b1ff73ec7929.bin',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/accel_20190616144001Z_0e8df858-c8b3-4af7-ba45-020af013dfaa.bin',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/accel_20190616145001Z_c2377e54-bc6d-40a3-b634-a325ce0a6c27.bin',
'type': 'localfs'}],
'AccelerometryDataFilesItr': ,
'AudioProcessedCount': 5,
'AudioProcessedDataFiles': [{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/audio_20190616141007Z_28868a90-f7bd-4ad5-9264-e35c9b36e109.mfcc',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/audio_20190616142007Z_53f62074-5bc5-4902-9f5e-841d25e3684f.mfcc',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/audio_20190616143008Z_020d149b-8e99-485f-9337-7b43d9096169.mfcc',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/audio_20190616144008Z_6c9a97cf-8d2f-454b-8b21-be03f6949968.mfcc',
'type': 'localfs'},
{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/audio_20190616145008Z_87feceae-29ca-4b1e-82f2-99621a652ebb.mfcc',
'type': 'localfs'}],
'AudioProcessedDataFilesItr': ,
'BatteryCount': 5,
'BatteryDataFiles': [{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/battery_20190616140000Z_ffc4284d-7aa3-415e-a3dd-b9125c556388.csv',
'type': 'localfs'}],
'BatteryDataFilesItr': ,
'BatteryLevel': 66,
'EndDateTime': '2019-06-16T14:59:59Z',
'EndDateTimeLocal': '2019-06-17T00:59:59Z',
'GpsDataFiles': [{'filepath': '/data/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/16/location_ffc4284d-7aa3-415e-a3dd-b9125c556388.csv',
'type': 'localfs'}],
'GpsDataFilesItr': ,
'GpsLocations': [{'lat': -37.7929697,
'lon': 144.9888915,
'time': '2019-06-16 14:01:25Z',
'time_local': '2019-06-17 00:01:25'},
{'lat': -37.7929776,
'lon': 144.988879,
'time': '2019-06-16 14:12:15Z',
'time_local': '2019-06-17 00:12:15'},
{'lat': -37.792762010358274,
'lon': 144.98914819210768,
'time': '2019-06-16 14:22:17Z',
'time_local': '2019-06-17 00:22:17'},
{'lat': -37.7930073,
'lon': 144.9889067,
'time': '2019-06-16 14:32:12Z',
'time_local': '2019-06-17 00:32:12'},
{'lat': -37.7929881,
'lon': 144.9889107,
'time': '2019-06-16 14:42:15Z',
'time_local': '2019-06-17 00:42:15'},
{'lat': -37.7929881,
'lon': 144.9889107,
'time': '2019-06-16 14:52:13Z',
'time_local': '2019-06-17 00:52:13'}],
'Keywords': ['Monday',
'June',
'2019',
'Night',
'Cloudy',
'autumn',
'full',
'audio_home',
'audio_street'],
'Kilometers': 0.072,
'LocationCount': 6,
'MoonAge': 14.0,
'MoonIllumination': 0.99,
'Name': ['["Clifton Hill Primary School, 185, Gold Street, CliftonHill, City '
'of Yarra, Victoria, 3068, Australia", "Clifton Hill Primary '
'School", "185", "Australia", "City of Yarra", "Clifton Hill", '
'"Victoria", "3068", "Gold Street"]'],
'StartDateTime': '2019-06-16T14:00:00Z',
'StartDateTimeLocal': '2019-06-17T00:00:00Z',
'StreetViewImage': 'https://s3-us-west-1.amazonaws.com/unforgettable-dev-usw1/userhome/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/08/streetview_full_4a28b74a-1608-448b-b8e4-7831b6d5f858.jpeg',
'StreetViewThumbnail': 'https://s3-us-west-1.amazonaws.com/unforgettable-dev-usw1/userhome/ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6/2019/06/08/streetview_thumb_4a28b74a-1608-448b-b8e4-7831b6d5f858.jpeg',
'Suggestions': [],
'Temperature': 10.05,
'UserId': '2f50d2dead5a5192945b47889a5c9788f7a4ff70f12979732076114327067b17',
'UserImages': [],
'Weather': 'Mostly Cloudy',
'aws_profile': {'name': 'ume', 'region_name': 'us-west-1'},
'hasAccelerometryDataFiles': True,
'hasAccelerometryDataFilesItr': ,
'hasAudioProcessedDataFiles': True,
'hasAudioProcessedDataFilesItr': ,
'hasGpsLocations': True,
'id': 'ap-northeast-1:7fe8c602-eac8-4172-99ca-b3e07d27f1e6::ffc4284d-7aa3-415e-a3dd-b9125c556388',
'latitude': -37.792988,
'longitude': 144.988911,
'type': '__App__'}

There are a number of different kinds of data available:

type : Unforgetable.me allows you to collect data from many different services and devices. This field indicates that this event came from the unforgettable.me app. All of the events in our fake data set are App events. The app collects data in hour chunks, so each event represents one hour of data.

UserId: The user associated with this data. This is a random string that is recreated for each data set.

StartDateTime, StartDateTimeLocal, EndDateTime, EndDateTimeLocal : The start and end date and time of the event in Universal Time Coordinated (UTC) and local time, respectively.

latitude and longitude: Are the GPS coordinates of the event. Note sometimes it is not possible for the app to get GPS coordinates because the user’s phone is in flight mode or is otherwise unable to locate GPS satellites or a wifi connection.

AccelerometryCount, AudioProcessedCount, BatteryCount, LocationCount: These variables indicate how many samples of each kind of data we collected. For instance, in this event we collected 5 sets of accelerometry samples. In the case of accelerometry and audioprocessed, the count refers to how many data files we collected as these sources sample very rapidly and the events would become very large if we listed every single sample.

From the raw data listed above, a number of variables can be derived that can be used also:

Name: The GPS coordinates are used to look up the address of the event.

Weather, Temperature: The GPS coordinates and the time are used to look up a weather database.

Keywords: A number of tags are automatically calculated. In particular, tags like audio_home are determined by running a machine learning classifier on the audio samples.

Kilometers: The total distance travelled is calculated from the GPS coordinates.

MoonAge, MoonIllumination: Provided for those times when lycanthrope activity may be a significant covariate in your analysis.

The “Events” and “DemoEvents” variables are both a stream of participant records associated with the query provided by the researcher. Dot notation in Private allows researchers to isolate specific properties of an event for analysis. For example, to extract the latitude of the first event in DemoEvents type:

lat = DemoEvents[0].latitude
lat
-37.792988

In the following exercise, we are going to estimate the probability of rain in our sample. Start by typing “clear” to remove the user defined variables we just created.

We are going to focus on the Weather field, so we want to work with just those events that have information about the weather. There may be __App__ events that do not have a Weather field – typically because the app was not able to establish a GPS lock during that period. We can extract just those events that do have a Weather field using a list comprehension (A list comprehension takes one list and turns it into another list. In this case, we are taking the list of events in DemoEvents and turning it into a list of True/False values).

To do this, type:

rain = ["Rain" in e.Weather for e in DemoEvents if e.hasField("Weather")]

The first expression in the line above (“Rain” in e.Weather), indicates how to calculate the elements of the output. “Rain” in e.Weather specifies that we should take the Weather field of the event e and compare it to “Rain”. If “Rain” is in the event then that element of the resulting list will be True, otherwise it will be False. The next expression (for e in DemoEvents) indicates that we want to take each event in DemoEvents and assign it to a new variable which we will call ‘e’ so that we can refer to it in the other parts of the comprehension. The final expression (if e.hasField(“Weather”) is a condition that indicates which elements of the first list to transform. In this case, we want to keep only those events that have a Weather field – if the event does not have any weather information, it will be dropped.

Also, whenever you see square brackets [ ], this means we have (or want to create) a list.

Now type rain and you should see:

[True, False, False, True, False, False, False, False, True, 
False, False, True, False, False, False, False, False, False, True, 
False, False, False, True, False, False, False, False, False, False, 
False, False, False, True, False, True, True, False, False, False, 
False, False, True, False, False, True, False, False, False, True, 
... True, False, False, False]

That is, we have an entry for each event that had Weather as a field in our new list rain that is either True or False depending on whether it rained during that hour period.

To estimate the rate of rain we can define the model as we did before:

rain ~ Bernoulli(rate)
rate ~ Uniform(0,1) 

And then we can see the result:

Events = Private
DemoEvents = [{'AccelerometryCount': 5, 'type': '__App__'}]
rain = ["Rain" in e.Weather for e in DemoEvents if [False, False, False, False, False, False, …]
rain ~ Bernoulli(rate) [False, False, False, False, False, False, …]
rate ~ Uniform(0, 1) [0.064 … 0.067] 

Now add the variable meanrate as follows:

meanrate = mean(rate) 

and type meanrate to see its value:

0.06347268126296493 

Our estimate of the probability of rain is 0.06.

Exercise 1: Using DemoEvents calculate the mean temperature for the samples.

Exercise 2: Using DemoEvents create a list comprehension that indicates whether each event has an audio_car keyword. (Hint: You can use “in” to determine if an element is in a list, for example, 2 in [1,2,3,4,5] will give True. Or for the String data type, “Tree” in [“Tree”, “Plant”, “Flower”) will give True, whereas “Dog” in [“Tree”, “Plant”, “Flower”] will give False.

Of course, we would not typically be able to see the values of rain, as the rain variable is deterministically related to Events and is therefore a potential privacy exploit. Let’s replace DemoEvents with Events, to see what would normally happen if we were using real data. (Remember, with real data, the Events variable is a private variable, instead of our fake data where our DemoEvents variable has not been set to private).

rain = ["Rain" in e.Weather for e in Events if e.hasField("Weather")]

you should see:

Events = Private
DemoEvents = [{'AccelerometryCount': 5, 'type': '__App__'}]
rain = ["Rain" in e.Weather for e in Events if e.h Private
meanrate = mean(rate) Stale
rain ~ Bernoulli(rate) Private
rate ~ Uniform(0, 1) Computing

Notice the values of rain are not visible because Private has determined that the ‘rain’ variable is to be kept private. After Private has completed the computation of rate, it will start the computations to determine if rate is private. In this example, Private has concluded that releasing the samples of rate is not ok, and so you can not see them or any values that depend on them (like meanrate) when you type sv:

Events = Private
DemoEvents = [{'AccelerometryCount': 5, 'type': '__App__'}]
rain = ["Rain" in e.Weather for e in Events if e.h Private
meanrate = mean(rate) Private
rain ~ Bernoulli(rate) Private
rate ~ Uniform(0, 1) Private

Well that is good from the point of view of protecting our users’ privacy, but not so good if we want to know the results of the analysis. DON’T PANIC. There are ways that we can change our model and our data to make it less sensitive to each individual’s data. Before we explore those, however, let’s take a look at how Private made its decision.

Computing privacy

Privacy issues will arise if any particular user has data that is substantially different from the rest of the users, because if one participant is unique in the sample, it makes it possible to mathematically reconstruct that individual’s data from the data in the sample. Private checks to make sure the reconstruction is not possible before it releases the information to the user.

How did Private decide what to release? The algorithm starts by assuming all defined variables (i.e. those that aren’t built in) are of unknown privacy – that is, we don’t yet know whether they are private or public. However, some built in variables are private, notably ‘Events’, while many are public. Any variable that depends deterministically on a private variable is set to private. For instance, the variable ‘rain’ is immediately set to private because it depends on Events. If there were any variables that depended on ‘rain’ they would also be set to private. Any deterministic variables that depend only on public variables are set to public. Deterministic variables that depend on a mixture of public and unknown variables remain unknown.

Determining the privacy status of probabilistic variables like rate is more complicated. To ensure that the data of any individual participant cannot be reconstructed from this information, we must run tests to determine whether the distribution of samples we obtain is sensitive to any of the participants. To do this, Private runs calculations for each participant. Each of these calculations is identical to the original calculation, except that the data for the corresponding participant is omitted. If the distribution when each participant is omitted is very similar to the distribution with all participants included, then very little information about that individual is being revealed by the samples, and the variables are made public. Otherwise, they are made private.

After we have determined the privacy status of the probabilistic variables, we then consider any deterministic variables to see if we might now be able to determine their privacy status.

Whenever a new variable is defined, it is possible that the privacy calculations will change and so the privacy algorithm is run again before you get to see any results.

Back to the rain

So why did Private stop us from seeing the samples? We can use DemoEvents to take a look at the data and provide some insight. The privacy algorithm considers data sets which don’t contain each of the users in turn. Problems with privacy will arise if any of the users are markedly different from the rest of the users in terms of the amount of rain they experienced. To investigate, let’s calculate the probability of rain for each of the participants. Firstly let’s define a variable that includes all of the unique participants:

We’ll start by extracting the users associated with each event that has a Weather field:

usersWithWeatherInfo = [e.UserId for e in DemoEvents if e.hasField("Weather")]

In the list usersWithWeatherInfo (if you wish to see the list, just type usersWithWeatherInfo), many of the users are repeated because we have multiple events associated with them. So we don’t have duplicates, we use the set function:

uniqueUsers = set(usersWithWeatherInfo)

Now let’s define a variable which gives the counts of the events with Weather fields for each individual:

numEventsByUser = array([len([e for e in DemoEvents if e.hasField("Weather") and e.UserId == user]) for user in uniqueUsers])

The array() function turns a list of values into an array of values. Operations like addition work differently with lists as opposed to arrays. For instance, if we let:

L1 = [1, 2, 3, 4] 
L2 = [5, 6, 7, 8]

then L1 + L2 equals

[1, 2, 3, 4, 5, 6, 7, 8]

However, if we let:

L1 = array([1, 2, 3, 4]) 
L2 = array([5, 6, 7, 8])

Then L1 + L2 equals:

[ 6. 8. 10. 12.]

In this case, we want to use division, which does not work at all with lists, so we need to convert our lists to arrays.

The len() function gives us the length of the dataset, which is effectively a count of how many data points we have.

Let’s define a another variable that gives the counts of the rain events for each individual:

numRainEventsByUser = array([len([e for e in DemoEvents if (e.hasField("Weather") and e.UserId == user) and "Rain" in e.Weather]) for user in uniqueUsers])

Now we look at the proportions of rain by individual:

propsByUser = numRainEventsByUser / numEventsByUser

to see them type:

propsByUser

And you should see:

[0.104 0.08 0.032 0.068 0.032 0.048 0.043 0.116 0.062 0.057 0.069 0.065]

There is a lot of variability in proportions of rain by individual and there are not very many individuals in our dataset (only 12), so eliminating any individual could affect the estimated proportion a great deal – meaning we have a potential privacy exploit!

So what can we do? The most obvious and best approach is to increase the number of participants. Someone trying to breach privacy may attempt to isolate one particular user’s data point by removing the selected user from the data set, and comparing the distribution with that user removed, to the entire distribution with the user included. With more participants, the impact of any given user on the summary statistic distribution will become smaller. So adding more participants means that if any one user was removed from the pool of users, it would be less likely that the individual’s data could be reconstructed (unless you add an outlier person).

Sometimes, however, it isn’t practical to increase the sample size either because there just aren’t more individuals that meet the selection criteria or because time or financial resources have been exhausted.

In these cases, you can define your model differently in order to estimate variables of interest. In particular, you can employ a hierarchical model.

In a hierarchical model, instead of trying to estimate the rate of rain directly, we assume that each participant has a rate associated with them (presumably as a consequence of where they live, whether they pay homage to Halie the goddess of rain etc.). Furthermore, we assume that these rates are in turn drawn from a distribution representing the population of individual’s rain rates. We can define the model as follows:

rain = ["Rain" in e.Weather for e in Events if e.hasField("Weather")]
subjects = [e.UserId for e in Events if e.hasField("Weather")]
rain[subjects] ~ Bernoulli(rate[subjects])
rate[subjects] ~ Beta(rateHier, 0.1)
rateHier ~ Uniform(0, 1)

Note this code is similar to our previous model, but now we have created a subjects variable that indicates which subject generated each event. When we define rain and rate we index them by subject.

rate[subjects] defines 12 rate variables, one for each subject. Each of these variables is defined as a Beta variable with a mean of rateHier and a standard deviation of 0.1. A Beta variable is a distribution that gives the probability of a probability. We can’t use the uniform distribution here because we want our estimates to change across individuals (the uniform distribution from 0 to 1 always has the same mean and standard deviation).

Now when the model is run we get estimates of the hierarchical rate, defined as rateHier.

Events = Private
DemoEvents = [{'AccelerometryCount': 5, 'type': '__App__'}]
rain = ["Rain" in e.Weather for e in Events if e.h Private
subjects = [e.UserId for e in Events if e.hasField Private
rain[subjects] ~ Bernoulli(rate[subjects]) Private
rate[subjects] ~ Beta(rateHier, 0.1) Private
rateHier ~ Uniform(0, 1) [0.063 … 0.137]

If you type:

meanRateHier = mean(rateHier)
meanRateHier

You will see that we get a reasonable estimate of the rate of rain (although quite a bit higher because the influence of the prior is much stronger in hierarchical models):

0.17449149921074988

So did we just get a free lunch? Can we always define a hierarchical model to get around the privacy test? Firstly, it doesn’t always work. If you have too strong an outlier and not enough data you will still be prevented from seeing the results. Secondly, the quantity that you have estimated now is different from the one we were originally estimating. Instead of asking, ‘for all our samples, what is the probability of rain’, in this case we are asking ‘what is the average of the rate of rain for each of our subjects’. In particular, our hierarchical variable has a larger standard deviation than our nonhierarchical model. That is, we can draw our conclusions with less certainty from the hierarchical model. Often though we are interested in the properties of our subjects rather than the samples directly and so the hierarchical model is the right one to use anyway.

Exercise 3: Using Events, estimate the mean temperature. (Hint: This is similar to the example which estimated the rate of rain, except now you will need to use a Normal variable instead of a Bernoulli variable. The Normal takes a mu and sigma as parameters and these will each need to have priors defined. The prior for mu can be a Normal and the prior for sigma should be a HalfNormal – as it must always be positive).

Exercise 4: Using Events, estimate the mean latitude.

Summary

In this tutorial, we started using the Private language to extract information about experiencing sample data. You should now be able to see how we can use these kinds of models to answer questions of practical interest. In particular, you now know how we can protect a participant’s privacy while still being able to extract useful information from our analyses. You have also built your first Hierarchical Bayesian model and your first logistic regression.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s