Bayesian Estimation with Private Data

In the last tutorial, we covered simple examples that aimed to build the conceptual understanding of probabilistic programming and introduce the Private programming language. In this tutorial, we are going use the Private language to extract information about potentially private experience sampling data.

You can access a Private shell at https://private.mall-lab.com/ but note it will timeout after 2 hours of inactivity.

Inferring the probability of rain

In the first part of the tutorial, we are going to estimate the probability of rain given a set of events. To start with, let’s review how we might go about estimating a probability/rate based on a True/False observations.

To create 100 samples of a Bernoulli variable with a rate parameter of 0.3. type:

data = Bernoulli(0.3, 100)

Recall that a Bernoulli variable is one which can have one of two values with a given probability or rate. Now type data and you should see your 100 samples:

[0 0 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 1 1 0
 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
 1 1 1 0 1 1 1 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0]

You can imagine that each of these is an indication of whether it rained during a specific hour period. 1 means it did rain, 0 means it did not.

So, we have the data that we are going to model. Now we want to define the model. We start by describing ‘data’:

data ~ Bernoulli(rate)

rate will be the probability variable. Before Private can start its calculations, we need to also define the prior on rate as follows:

rate ~ Uniform(0,1)

You can see all of the definitions by typing sv. Note r is now computing because all definitions have been provided.

data = Bernoulli(0.3, 100)     [0.000000 … 0.000000]
data ~ Bernoulli(rate)         [0.000000 … 0.000000]
rate ~ Uniform(0, 1)           Computing

While we wait for Private to finish its calculations, we can define variables to store the mean and standard deviation of the rate.

meanrate = mean(rate)
stdrate = std(rate)

If Private has finished its calculations you’ll see the following when you type sv:

data = Bernoulli(0.3, 100)  [0.000 … 0.000]    
meanrate = mean(rate)       0.390    
stdrate = std(rate)         0.047    
data ~ Bernoulli(rate)      [0.000 … 0.000]    
rate ~ Uniform(0, 1)        [0.417 … 0.475]

The samples of rate are now available and meanrate and stdrate have been calculated. Note meanr is 0.39, which is a reasonable approximation to the rate that was used to generate the data 0.3.

We have now established how to estimate a rate parameter. Next we want to try this out on some data. All Private projects have a variable called Events. Events is a list containing the data associated with that project that was generated by the participants. Type:

Events

and you should see:

Events is private

A private variable like Events is not released (shown) to the researcher, because the value could be used to reveal sensitive information.

In the Demo project, we will be working with fake data and so there are no privacy concerns that we need to worry about. For tutorial purposes, a second variable called DemoEvents is built in. DemoEvents is exactly the same as Events except that it is not private. That makes it easier for us to understand what is going on. We’ll start by estimating the probability of rain with DemoEvents and then do the same thing with Events.

Firstly, let’s take a look at what a typical event looks like. Type:

DemoEvents[0]

to see the first event in the list. You will see an event like this:

{'AccelerometryCount': 5,
 'AudioProcessedCount': 3,
 'BatteryCount': 5,
 'BatteryLevel': 88,
 'EndDateTime': '2019-01-11 14:00:00',
 'EndDateTimeLocal': '2019-01-12 01:00:00',
 'Kilometers': 3.496023774372081,
 'LocationCount': 6,
 'MoonAge': 21.015028782921352,
 'MoonIllumination': 0.5602955128363618,
 'StartDateTime': '2019-01-11 13:00:00',
 'StartDateTimeLocal': '2019-01-12 00:00:00',
 'Temperature': 19.939566248607413,
 'UserId': 'd5fac460-bb57-4434-9897-4b7a173da16e',
 'Weather': 'rain',
 'address': u'89 Rodney Motorway\nSandovalbury, SA, 0222',
 'keywords': ['January',
              'Saturday',
              2019,
              'Summer',
              'audio_home',
              'rain',
              'waxing_gibbous'],
 'latitude': -36.55568688296506,
 'longitude': 143.52565783284203,
 'type': 'App'}

This is event is fake but is designed to mimic the kind of data that is available if you run a study on the unforgettable.me system. There are a number of different kinds of data available:

type (line 26 of the output above): Unforgetable.me allows you to collect data from many different services and devices. This field indicates that this event came from the unforgettable.me app. All of the events in our fake data set are App events. The app collects data in hour chunks, so each event represents one hour of data.

UserId (line 14 of the output above): The user associated with this data. This is a random string that is recreated for each data set.

StartDateTime, StartDateTimeLocal, EndDateTime, EndDateTimeLocal (line 5,6 and 11,12 of the output above): The start and end date and time of the event in Universal Time Coordinated (UTC) and local time, respectively.

latitude and longitude (line 24 & 25 of the output above): Are the GPS coordinates of the event. Note sometimes it is not possible for the app to get GPS coordinates because the user’s phone is in flight mode or is otherwise unable to locate GPS satellites or a wifi connection.

AccelerometryCount, AudioProcessedCount, BatteryCount, LocationCount (line 1, 2, 3, & 8 of the output above): These variables indicate how many samples of each kind of data we collected. For instance, in this event we collected 5 sets of accelerometry samples. In the case of accelerometry and audioprocessed, the count refers to how many data files we collected as these sources sample very rapidly and the events would become very large if we listed every single sample.

From the raw data listed above, a number of variables can be derived that can be used also:

address (line 16 of the output above): The GPS coordinates are used to look up the address of the event.

Weather, Temperature (line 15 & 13 of the output above): The GPS coordinates and the time are used to look up a weather database.

keywords (line 17 of the output above): A number of tags are automatically calculated. In particular, tags like audio_home are determined by running a machine learning classifier on the audio samples.

Kilometers (line 7 of the output above): The total distance travelled is calculated from the GPS coordinates.

MoonAge, MoonIllumination (line 9 & 10 of the output above): Provided for those times when lycanthrope activity may be a significant covariate in your analysis.

The “Events” and “DemoEvents” variables are both a stream of participant records associated with the query provided by the researcher. Dot notation in Private allows researchers to isolate specific properties of an event for analysis. For example, to extract the latitude of the first event in DemoEvents type:

DemoEvents[0].latitude
-36.55568688296506

In the following exercise, we are going to focus on the Weather field, which in this case is set to “cloudy”, so we want to work with just those events that have information about the weather.

There may be events that do not have a Weather field – typically because the app was not able to establish a GPS lock during that period. We can extract just those events that do have a Weather field using a list comprehension (A list comprehension takes one list and turns it into another list. In this case, we are taking the list of events in DemoEvents and turning it into a list of True/False values).

To do this, type:

rain = [e.Weather == "rain" for e in DemoEvents if e.hasField("Weather")]

The first expression in the line above (e.Weather == “rain”), indicates how to calculate the elements of the output. e.Weather == “rain” specifies that we should take the Weather field of the event e and compare it to “rain”. If it is “rain” then that element of the resulting list will be True, otherwise it will be False. The next expression (for e in DemoEvents) indicates that we want to take each event in DemoEvents and assign it to a new variable which we will call ‘e’ so that we can refer to it in the other parts of the comprehension. The final expression (if e.hasField(“Weather”) is a condition that indicates which elements of the first list to transform. In this case, we want to keep only those events that have a Weather field – if the event does not have any weather information, it will be dropped.

Also, whenever you see square brackets [ ], this means we have (or want to create) a list.

Now type rain and you should see:

[True, False, False, True, False, False, False, False, True, 
False, False, True, False, False, False, False, False, False, True, 
False, False, False, True, False, False, False, False, False, False, 
False, False, False, True, False, True, True, False, False, False, 
False, False, True, False, False, True, False, False, False, True, 
... True, False, False, False]

That is, we have an entry for each event that had Weather as a field in our new list rain that is either True or False depending on whether it rained during that hour period.

To estimate the rate of rain we can define the model as we did before:

rain ~ Bernoulli(rate)
rate ~ Uniform(0,1) 

And then we can see the result:

rain = [e.Weather == "rain" for e in DemoEvents if  [True, False, False, True, False, False, …]    
rain ~ Bernoulli(rate)                              [True, False, False, True, False, False, …]    
rate ~ Uniform(0, 1)                                [0.230 … 0.244] 

Now add the variable meanrate as follows:

meanrate = mean(rate) 

and type meanrate to see its value:

0.2373135617381946 

Our estimate of the probability of rain is 0.24.

Exercise 1: Using DemoEvents calculate the mean temperature for the samples.

Exercise 2: Using DemoEvents create a list comprehension that indicates whether each event has an audio_car keyword. (Hint: You can use “in” to determine if an element is in a list, for example, 2 in [1,2,3,4,5] will give True. Or for the String data type, “Tree” in [“Tree”, “Plant”, “Flower”) will give True, whereas “Dog” in [“Tree”, “Plant”, “Flower”] will give False.

Of course, we would not typically be able to see the values of rain, as the rain variable is deterministically related to Events and is therefore a potential privacy exploit. Let’s replace DemoEvents with Events, to see what would normally happen if we were using real data. (Remember, with real data, the Events variable is a private variable, instead of our fake data where our DemoEvents variable has not been set to private).

rain = [e.Weather == "rain" for e in Events if e.hasField("Weather")]

you should see:

rain = [e.Weather == "rain" for e in Events if e.h  Private
meanrate = mean(rate)                               Stale
rain ~ Bernoulli(rate)                              Private
rate ~ Uniform(0, 1)                                Computing

Notice the values of rain are not visible because Private has determined that the ‘rain’ variable is to be kept private. After Private has completed the computation of rate, it will start the computations to determine if rate is private. In this example, Private has concluded that releasing the samples of rate is not ok, and so you can not see them or any values that depend on them (like meanrate) when you type sv:

rain = [e.Weather == "rain" for e in Events if e.h  Private
meanrate = mean(rate)                               Private
rain ~ Bernoulli(rate)                              Private
rate ~ Uniform(0, 1)                                Private

Well that is good from the point of view of protecting our users’ privacy, but not so good if we want to know the results of the analysis. DON’T PANIC. There are ways that we can change our model and our data to make it less sensitive to each individual’s data. Before we explore those, however, let’s take a look at how Private made its decision.

Computing privacy

Privacy issues will arise if any particular user has data that is substantially different from the rest of the users, because if one participant is unique in the sample, it makes it possible to mathematically reconstruct that individual’s data from the data in the sample. Private checks to make sure the reconstruction is not possible before it releases the information to the user.

How did Private decide what to release? The algorithm starts by assuming all defined variables (i.e. those that aren’t built in) are of unknown privacy – that is, we don’t yet know whether they are private or public. However, some built in variables are private, notably ‘Events’, while many are public. Any variable that depends deterministically on a private variable is set to private. For instance, the variable ‘rain’ is immediately set to private because it depends on Events. If there were any variables that depended on ‘rain’ they would also be set to private. Any deterministic variables that depend only on public variables are set to public. Deterministic variables that depend on a mixture of public and unknown variables remain unknown.

Determining the privacy status of probabilistic variables like rate is more complicated. To ensure that the data of any individual participant cannot be reconstructed from this information, we must run tests to determine whether the distribution of samples we obtain is sensitive to any of the participants. To do this, Private runs calculations for each participant. Each of these calculations is identical to the original calculation, except that the data for the corresponding participant is omitted. If the distribution when each participant is omitted is very similar to the distribution with all participants included, then very little information about that individual is being revealed by the samples, and the variables are made public. Otherwise, they are made private.

After we have determined the privacy status of the probabilistic variables, we then consider any deterministic variables to see if we might now be able to determine their privacy status.

Whenever a new variable is defined, it is possible that the privacy calculations will change and so the privacy algorithm is run again before you get to see any results.

Back to the rain

So why did Private stop us from seeing the samples? We can use DemoEvents to take a look at the data and provide some insight. The privacy algorithm considers data sets which don’t contain each of the users in turn. Problems with privacy will arise if any of the users are markedly different from the rest of the users in terms of the amount of rain they experienced. To investigate, let’s calculate the probability of rain for each of the participants. Firstly let’s define a variable that includes all of the unique participants:

We’ll start by extracting the users associated with each event that has a Weather field:

usersWithWeatherInfo = [e.UserId for e in DemoEvents if e.hasField("Weather")]

In the list usersWithWeatherInfo (if you wish to see the list, just type usersWithWeatherInfo), many of the users are repeated because we have multiple events associated with them. So we don’t have duplicates, we use the set function:

uniqueUsers = set(usersWithWeatherInfo)

Now let’s define a variable which gives the counts of the events with Weather fields for each individual:

numEventsByUser = array([len([e for e in DemoEvents if e.hasField("Weather") and e.UserId == user]) for user in uniqueUsers])

The array() function turns a list of values into an array of values. Operations like addition work differently with lists as opposed to arrays. For instance, if we let:

L1 = [1, 2, 3, 4] 
L2 = [5, 6, 7, 8]

then L1 + L2 equals

[1, 2, 3, 4, 5, 6, 7, 8]

However, if we let:

L1 = array([1, 2, 3, 4]) 
L2 = array([5, 6, 7, 8])

Then L1 + L2 equals:

[ 6. 8. 10. 12.]

In this case, we want to use division, which does not work at all with lists, so we need to convert our lists to arrays.

The len() function gives us the length of the dataset, which is effectively a count of how many data points we have.

Let’s define a another variable that gives the counts of the rain events for each individual:

numRainEventsByUser = array([len([e for e in DemoEvents if (e.hasField("Weather") and e.UserId == user) and e.Weather == "rain"]) for user in uniqueUsers])

Now we look at the proportions of rain by individual:

propsByUser = numRainEventsByUser / numEventsByUser

to see them type:

propsByUser

And you should see:

[0.301 0.257 0.179 0.24 0.247 0.243 0.242 0.162 0.239 0.26 ]

There is a lot of variability in proportions of rain by individual and there are not very many individuals in our dataset (only 10), so eliminating any individual could affect the estimated proportion a great deal – meaning we have a potential privacy exploit!

So what can we do? The most obvious and best approach is to increase the number of participants. Someone trying to breach privacy may attempt to isolate one particular user’s data point by removing the selected user from the data set, and comparing the distribution with that user removed, to the entire distribution with the user included. With more participants, the impact of any given user on the summary statistic distribution will become smaller. So adding more participants means that if any one user was removed from the pool of users, it would be less likely that the individual’s data could be reconstructed (unless you add an outlier person).

Sometimes, however, it isn’t practical to increase the sample size either because there just aren’t more individuals that meet the selection criteria or because time or financial resources have been exhausted.

In these cases, you can define your model differently in order to estimate variables of interest. In particular, you can employ a hierarchical model.

In a hierarchical model, instead of trying to estimate the rate of rain directly, we assume that each participant has a rate associated with them (presumably as a consequence of where they live, whether they pay homage to Halie the goddess of rain etc.). Furthermore, we assume that these rates are in turn drawn from a distribution representing the population of individual’s rain rates. We can define the model as follows:

rain = [e.Weather == "rain" for e in Events if e.hasField("Weather")]
subjects = [e.UserId for e in Events if e.hasField("Weather")]
rain[subjects] ~ Bernoulli(rate[subjects])
rate[subjects] ~ Beta(rateHier, 0.1)
rateHier ~ Uniform(0, 1)

Note this code is similar to our previous model, but now we have created a subjects variable that indicates which subject generated each event. When we define rain and rate we index them by subject.

rate[subjects] defines 10 rate variables, one for each subject. Each of these variables is defined as a Beta variable with a mean of rateHier and a standard deviation of 0.1. A Beta variable is a distribution that gives the probability of a probability. We can’t use the uniform distribution here because we want our estimates to change across individuals (the uniform distribution from 0 to 1 always has the same mean and standard deviation).

Now when the model is run we get estimates of the hierarchical rate, defined as rateHier.

rain = [e.Weather == "rain" for e in Events if e.h  Private    
subjects = [e.UserId for e in Events if e.hasField  Private    
rain[subjects] ~ Bernoulli(rate[subjects])          Private    
rate[subjects] ~ Beta(rateHier, 0.1)                Private    
rateHier ~ Uniform(0, 1)                            [0.242 … 0.376] 

If you type:

mean(rateHier)

You will see that we get a reasonable estimate of the rate of rain:

0.28350241575036755

So did we just get a free lunch? Can we always define a hierarchical model to get around the privacy test? Firstly, it doesn’t always work. If you have too strong an outlier and not enough data you will still be prevented from seeing the results. Secondly, the quantity that you have estimated now is different from the one we were originally estimating. Instead of asking, ‘for all our samples, what is the probability of rain’, in this case we are asking ‘what is the average of the rate of rain for each of our subjects’. In particular, our hierarchical variable has a larger standard deviation (0.117) than our nonhierarchical model (0.01). That is, we can draw our conclusions with less certainty from the hierarchical model. Often though we are interested in the properties of our subjects rather than the samples directly and so the hierarchical model is the right one to use anyway.

Exercise 3: Using Events, estimate the mean temperature. (Hint: This is similar to the example which estimated the rate of rain, except now you will need to use a Normal variable instead of a Bernoulli variable).

Exercise 4: Using Events, estimate the mean latitude.

When are you anxious?

We are going to finish off our tutorial with an example of logistic regression. Logistic regression is used when you have an outcome variable that is binary – either True or False. In this example, we are going to estimate the relationship between the day of the week and being anxious using data from IFTTT button presses. IFTTT is a service that allows one to connect services together. One of the services it offers is to activate a notification when you press an icon on your smartphone. These can be linked to an unforgettable account to provide access to the button press data through Private.

The events in DemoEvents (or Events) can be of many types. Let’s start by clearing your workspace and extracting just those that were generated by button presses on participants’ smartphones:

ButtonEvents = [e for e in DemoEvents if e.type == "Button"]

If we now type ButtonEvents[0] we will see the first button press in our dataset:

{'EndDateTime': '2019-01-11 19:00:00',
 'EndDateTimeLocal': '2019-01-12 06:00:00',
 'StartDateTime': '2019-01-11 19:00:00',
 'StartDateTimeLocal': '2019-01-12 06:00:00',
 'UserId': 'd5fac460-bb57-4434-9897-4b7a173da16e',
 'keywords': ['January',
              'Saturday',
              2019,
              'Summer',
              'Button',
              'Happy',
              'Button'],
 'latitude': -37.23519063048495,
 'longitude': 145.7866159540925,
 'type': 'Button'}

We know that this datapoint is IFTTT button data, because the “type” section (which is a section that is present in all unforgettable.me events), is listed as ‘Button’. The exact IFTTT button that was pressed is stored as a keyword. If you look at the keywords of the data above, you will notice that the keywords include the word “Happy”, which indicates that this was the button that the participant pressed at this time. You will also see the times and the coordinates where the datapoint was generated (as a latitude and a longitude).

For each of these events we need to determine whether it involved the “Anxious” button, as follows:

Anxious = ["Anxious" in e.keywords for e in ButtonEvents]

Anxious then becomes our outcome variable.

Now we need to define our predictor variable. We could choose any day, but for this example, we’ll focus on Wednesday:

Wednesday = ["Wednesday" in e.keywords for e in ButtonEvents]

Wednesday is a list of True or False values indicating whether each button event occurred on a Wednesday or not.

Our outcome variable is binary and therefore we define it as a Bernoulli random variable with parameter p:

Anxious ~ Bernoulli(p)

Now how should we define p? We can’t simply use a linear equation to define it as we did in the linear regression of the previous tutorial, because p is a probability and must lie between 0 and 1. So, we are going to use the Sigmoid function. The Sigmoid function looks as follows:

Figure 1: Sigmoid function. Source: https://en.wikipedia.org/wiki/Logistic_function#/media/File:Logistic-curve.svg

Notice it maps numbers that range from negative infinity to positive infinity to a number between 0 and 1 – just what we need. Now we can define p as follows:

p ~ Sigmoid(bWednesday * Wednesday + Intercept)

bWednesday is the parameter that determines the degree to which being a Wednesday influences a person’s anxiety and the Intercept captures the baseline level of anxiety.

To complete the model we need to define priors on the weight and the intercept. We don’t want to make strong assumptions about these values, so we set them to Normal distributions which can be both positive and negative, with largish standard deviations.

bWednesday ~ Normal(0, 10)
Intercept ~ Normal(0, 10)

Once computation has completed you should see:

ButtonEvents = [e for e in DemoEvents if e.type ==  [{'EndDateTime…pe': 'Button'}, {'EndDateTime…pe': 'Button'}, {'EndDateTime..    
Anxious = ["Anxious" in e.keywords for e in Button  [False, False, False, False, True, True, …]    
Wednesday = ["Wednesday" in e.keywords for e in Bu  [False, False, False, False, True, True, …]    
Anxious ~ Bernoulli(p)                              [False, False, False, False, True, True, …]    
p ~ Sigmoid(bWednesday * Wednesday + Intercept)     'Not retained.'    
bWednesday ~ Normal(0, 10)                          [11.719 … 19.582]    
Intercept ~ Normal(0, 10)                           [-9.535 … -5.543]

You can get an indication abou thte relationship between it being wednesday and anxiety by calculating the mean of the weight and the intercept:

mean(bWednesday) 
10.02779816577924
mean(Intercept)
-9.084067748217498

It seems anxiety is quite high on Wednesdays!

Exercise 5: Add the other days of the week to the same model as additional predictors. On which days are people most anxious?

Summary

In this tutorial, we started using the Private language to extract information about experiencing sample data. You should now be able to see how we can use these kinds of models to answer questions of practical interest. In particular, you now know how we can protect a participant’s privacy while still being able to extract useful information from our analyses. You have also built your first Hierarchical Bayesian model and your first logistic regression.

Solutions to Exercises

Exercise 1: Using DemoEvents calculate the mean temperature for the samples.

temperatures = [e.Temperature for e in DemoEvents if e.hasField("Temperature")] 
mean(temperatures)
20.63875445643994

Exercise 2: Using DemoEvents create a list comprehension that indicates whether each event has an audio_car keyword. (Hint: You can use “in” to determine if an element is in a list, for example, 2 in [1,2,3,4,5] will give True. Or for the String data type, “Tree” in [“Tree”, “Plant”, “Flower”) will give True, whereas “Dog” in [“Tree”, “Plant”, “Flower”] will give False.

 cars = ["audio_car" in e.keywords for e in DemoEvents] 

Exercise 3: Using Events, estimate the mean temperature. (Hint: This is similar to the example which estimated the rate of rain, except now you will need to use a Normal variable instead of a Bernoulli variable).

The simplest way to attack this problem is by defining a nonhierarchical model as follows:

temperatures = [e.Temperature for e in Events if e.hasField("Temperature")]
temperatures ~ Normal(mu, sigma)
mu ~ Normal(0, 100)
sigma ~ HalfNormal(100)

Again, however, mu is private and so we cannot calculate the mean:

mean(mu)
mu is private

So, let’s try a hierarchical version:

temperatures = [e.Temperature for e in Events if e.hasField("Temperature")]
subjects = [e.UserId for e in Events if e.hasField("Temperature")]
temperatures[subjects] ~ Normal(mu[subjects],sigma[subjects])
mu[subjects]~Normal(muMu, muSigma)
sigma[subjects]~HalfNormal(sigmaSigma)
muMu ~ Normal(0,100)
sigmaSigma ~ HalfNormal(100)
muSigma ~ HalfNormal(100)

muMu is the mean of the means and mySigma is the standard deviation of the means.

temperatures = [e.Temperature for e in Events if e  Private    
subjects = [e.UserId for e in Events if e.hasField  Private    
temperatures[subjects] ~ Normal(mu[subjects], sigm  Private    
mu[subjects] ~ Normal(muMu, muSigma)                Private    
sigma[subjects] ~ HalfNormal(sigmaSigma)            Private    
muMu ~ Normal(0, 100)                               [22.053 … 21.454]    
sigmaSigma ~ HalfNormal(100)                        [3.841 … 4.341]    
muSigma ~ HalfNormal(100)                           [4.873 … 4.466]

Now you will be allowed to calculate the mean:

mean(muMu)
20.672844245074362

Notice there are two standard deviation variables – sigma and sigmaSigma. If you can understand the difference then you have a good handle on hierarchical modelling. sigma is the variability of the temperatures for each of the subjects. sigmaSigma is the variability of the mean temperatures of the subjects.

Exercise 4: Using Events, estimate the mean latitude.

This problem is similar to the previous one on temperatures and so we can attempt to address it the same way with a nonhierarchical model:

latitudes = [e.latitude for e in Events if e.hasField("latitude")]
latitudes ~ Normal(mu, sigma)
mu ~ Normal(0, 100)
sigma ~ HalfNormal(100)

We have used a large standard deviation (100) for the prior distribution on both mu and sigma as we don’t want to impose too much constraint on our estimates.

When you do this you will find that mu is private and therefore you will not be allowed to see its mean:

mean(mu)
mu is private

You can try to address the issue by defining a hierarchical model as follows:

latitudes = [e.latitude for e in Events if e.hasField("latitude")]
subjects = [e.UserId for e in Events if e.hasField("latitude")]
latitudes[subjects] ~ Normal(mu[subjects], sigma[subjects])
muMu ~ Normal(0, 100)
sigmaSigma ~ HalfNormal(100)
muSigma ~ HalfNormal(100)
sigma[subjects] ~ HalfNormal(sigmaSigma)
mu[subjects] ~ Normal(muMu, muSigma)

Unfortunately, muMu is still too sensitive to release:

latitudes = [e.latitude for e in Events if e.hasFi  Private    
subjects = [e.UserId for e in Events if e.hasField  Private    
latitudes[subjects] ~ Normal(mu[subjects], sigma[s  Private    
muMu ~ Normal(0, 100)                               Private    
sigmaSigma ~ HalfNormal(100)                        [1.696 … 1.431]    
muSigma ~ HalfNormal(100)                           Private    
sigma[subjects] ~ HalfNormal(sigmaSigma)            Private    
mu[subjects] ~ Normal(muMu, muSigma)                Private

Our only option, in this case, is to increase our sample size.

Exercise 5: Add the other days of the week to the same model as additional predictors. On which days are people most anxious?

ButtonEvents = [e for e in DemoEvents if e.type == "Button"]
Anxious = ["Anxious" in e.keywords for e in ButtonEvents]
Sunday = ["Sunday" in e.keywords for e in ButtonEvents]
Monday = ["Monday" in e.keywords for e in ButtonEvents]
Tuesday = ["Tuesday" in e.keywords for e in ButtonEvents]
Wednesday = ["Wednesday" in e.keywords for e in ButtonEvents]
Thursday = ["Thursday" in e.keywords for e in ButtonEvents]
Friday = ["Friday" in e.keywords for e in ButtonEvents]
Saturday = ["Saturday" in e.keywords for e in ButtonEvents]
meanSunday = mean(bSunday)
meanMonday = mean(bMonday)
meanTuesday = mean(bTuesday)
meanWednesday = mean(bWednesday)
meanThursday = mean(bThursday)
meanFriday = mean(bFriday)
meanSaturday = mean(bSaturday)
meanIntercept = mean(Intercept)
Anxious ~ Bernoulli(p)
p ~ Sigmoid(bSunday * Sunday + bMonday * Monday + bTuesday * Tuesday + bWednesday * Wednesday + bThursday * Thursday + bFriday * Friday + bSaturday * Saturday + Intercept)
bSunday ~ Normal(0, 10)
bMonday ~ Normal(0, 10)
bTuesday ~ Normal(0, 10)
bWednesday ~ Normal(0, 10)
bThursday ~ Normal(0, 10)
bFriday ~ Normal(0, 10)
bSaturday ~ Normal(0, 10)
Intercept ~ Normal(0, 10)

Results:

meanSunday = mean(bSunday)                          -4.065    
meanMonday = mean(bMonday)                          -4.049    
meanTuesday = mean(bTuesday)                        -4.265    
meanWednesday = mean(bWednesday)                    12.224    
meanThursday = mean(bThursday)                      -3.940    
meanFriday = mean(bFriday)                          -4.054    
meanSaturday = mean(bSaturday)                      -4.119    
meanIntercept = mean(Intercept)                     -11.283    

So, Wednesday is the day on which people are most anxious in our fake data set.

One thought on “Bayesian Estimation with Private Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s