-
Notifications
You must be signed in to change notification settings - Fork 0
/
noisymax.tex
253 lines (233 loc) · 11.2 KB
/
noisymax.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
Noisy Max is a simple yet useful data publishing mechanism in
differential privacy~\cite{DR:14:AFDP,DWWZK:18:DVDP}. Consider $n$
queries of the same range, say, the number of patients for $n$
different diseases in a hospital. We are interested in knowing which
has the maximal value among the $n$ queries. A simple
privacy-respecting way to release the information is to add
independent noises to every query results and then return the index of
the maximal noisy results.
Algorithm~\ref{algorithm:noisy-max} shows an instance of Noisy Max by
applying the $\frac{1}{2}$-geometric mechanism to each query with
the discrete range $\{ 0, 1, 2 \}$. The algorithm first computes
a noisy result $\tilde{v}_i$ for $v_i$ using the geometric
mechanism. It then returns the index $r$ of the maximal
$\tilde{v}_r$ among $\{ \tilde{v}_1, \tilde{v}_2, \ldots, \tilde{v}_n
\}$.
% It is important to return indices. Otherwise, this is not DP.
\begin{algorithm}
\begin{algorithmic}[1]
\Require{$0 \leq v_1, v_2, \ldots, v_n \leq 2$}
\Ensure{The index $r$ with the maximal $v_r$ among $v_1, v_2, \ldots, v_n$}
\Function{NoisyMax}{$v_1, v_2, \ldots, v_n$}
\For{each $v_i$}
\Match{$v_i$}
\Comment{apply $\frac{1}{2}$-geometric mechanism to $v_i$}
\lCase{$0$}
{$\tilde{v}_i \leftarrow 0, 1, 2$ with probability
$\frac{2}{3}, \frac{1}{6}, \frac{1}{6}$}
\lCase{$1$}
{$\tilde{v}_i \leftarrow 0, 1, 2$ with probability
$\frac{1}{3}, \frac{1}{3}, \frac{1}{3}$}
\lCase{$2$}
{$\tilde{v}_i \leftarrow 0, 1, 2$ with probability
$\frac{1}{6}, \frac{1}{6}, \frac{2}{3}$}
\EndMatch
\EndFor
\State{find $1 \leq r \leq n$ such that
$\tilde{v}_r \geq \tilde{v}_j$ for every $1 \leq j \leq n$}
\State{\Return {$r$} }
\EndFunction
\end{algorithmic}
\caption{Noisy Max}
\label{algorithm:noisy-max}
\end{algorithm}
To check Algorithm~\ref{algorithm:noisy-max}, consider the Markov
chain in Figure~\ref{figure:hmm-noisy-max}. The possible values of
$v_i$ are shown on the top. In the figure, node labels denote the
values of $v_1$, $v_2$, and $v_3$. For instance, the node labeled
$011$ represents the inputs $(v_1, v_2, v_3) = (0, 1, 1)$. Similarly,
the inputs $(v_1, v_2, v_3) = (1, 2, 0)$ and $(v_1, v_2, v_3) = (2, 0,
2)$ are shown.
The noisy values are the labels of the nodes at the bottom.
Hence the node labeled $0\underline{2}1$ represents the noisy values
$\tilde{v}_1 = 0$,
$\tilde{v}_2 = 2$, and $\tilde{v}_3 = 1$. The underline denotes a
maximal noisy value. For the node labeled $0\underline{2}1$, the index
$2$ is returned (or observed) because $\tilde{v}_2$ has the maximal
value among $\tilde{v}_1$, $\tilde{v}_2$, and $\tilde{v}_3$.
Arrows again denotes probabilistic transitions. Hence the probability
of moving to $(\tilde{v}_1, \tilde{v}_2, \tilde{v}_3) = (0, 2, 1)$
from $(v_1, v_2, v_3) = (0, 1, 1)$ is $\frac{2}{3} \cdot \frac{1}{3}
\cdot \frac{2}{3}$ by the $\frac{1}{2}$-geometric
mechanism. Similarly, the probability of obtaining the same noisy
values from $(v_1, v_2, v_3) = (1, 2, 0)$ is $\frac{1}{3} \cdot
\frac{2}{3} \cdot \frac{1}{3}$.
\begin{figure}
\centering
\resizebox{.8\columnwidth}{!}{
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node
distance=2cm,node/.style={circle,draw}]
\node at (-2.3, 1) { $\cdots$ };
\node[node,scale=.67] (v011) at (-1.6, 1) { $011$ };
\node at ( -.8, 1) { $\cdots$ };
\node[node,scale=.67] (v120) at ( 0, 1) { $120$ };
\node at ( 1.0, 1) { $\cdots$ };
\node[node,scale=.67] (v202) at ( 2.0, 1) { $202$ };
\node at ( 2.8, 1) { $\cdots$ };
\node at (-1.5, -1) { $\cdots$ };
\node[node, scale=0.67] (021) at (0, -1) { $0\underline{2}1$ };
\node at ( 1.5, -1) { $\cdots$ };
\path
(v011) edge [left,below] node [rotate=310]
{ $\frac{2}{3} \cdot \frac{1}{3} \cdot \frac{2}{3}$ }
(021)
(v120) edge [right,above] node [rotate=270]
{ $\frac{1}{3} \cdot \frac{2}{3} \cdot \frac{1}{3}$ }
(021)
(v202) edge [right,below] node [rotate=45]
{ $\frac{1}{6} \cdot \frac{1}{6} \cdot \frac{1}{3}$ }
(021)
;
\end{tikzpicture}
}
\caption{Hidden Markov Model for Noisy Max}
\label{figure:hmm-noisy-max}
\end{figure}
In the literature of differential privacy~\cite{DR:14:AFDP,DWWZK:18:DVDP},
Noisy Max is proved to effectively protect privacy for neighboring data sets. For instance, the input
nodes at the top of Figure \ref{figure:hmm-noisy-max} labeled $011$ and $021$ are neighbors
so that they have similar probabilities of having same observations. What's more,
in the hidden Markov model, we can compute if some attacker has prior information
about the database distribution, whether privacy is still preserved. The attacker may know
a range of things in prior, such as the probability $p$ of contracting some disease $A$,
some strongly infectious disease $B$ which may infect the whole family, some members in the
data set who are strongly correlated, etc.
\noindent
\textit{Example 7.1} We describe a classic scenario of applying Noisy Max. Assume that there
are $3$ diseases $\{A,B,C\}$ and $2$ members ${John, Alice}$ in the data set. People can observe which disease
infects the most people.
In order to protect privacy, Noisy Max algorithm is applied to the results.
We can also set some rule that if more than one disease infects the most people, the smallest index will be observed.
There's one attacker who has some prior knowledge about some of the diseases and members in the data set:
For disease $A$, usually people will contract the disease with probability $p_1$;
for disease $B$, it is highly contagious so that the family members John and Alice will contract
it at the same time with probability $p_2$ or neither will contract it;
the attacker doesn't have any information about disease $C$. Now we want to figure out,
is there much difference in observations between John contracts disease $C$ and John doesn't with those prior knowledge?
That is, we want to know whether this model satisfies some $\epsilon$-Pufferfish privacy.
We simulate this scenario in our model.
The hidden Markov model is just the same as in Figure \ref{figure:hmm-noisy-max}, except that
this time there are initial distributions among states caused by the attacker's prior knowledge.
We use $\pi$ and $\tau$ to represent the distributions when John contracts $C$ and not.
We can compute under $\pi$ and $\tau$ the probability starting from every node labeled $(i,j,k)$ at the top of Figure \ref{figure:hmm-noisy-max},
where $i$ persons contract $A$, $j$ persons contract $B$ and $k$ persons contract $C$.
For instance, $\pi (0,2,1) = (1-p_1)^2 p_2 \times \frac{1}{2}$. Here the reason for the $\frac{1}{2}$
is that we only assume that John contracts $C$ and thus $k$ could only be $1$ or $2$. Without other information
about disease $C$, a uniform
distribution on $k$ among $\{1,2\}$ is taken. More starting probability could be computed by using the
tables below.
\vspace{0.05\textwidth}
\begin{minipage}{\textwidth}
\centering
\begin{minipage}{0.4\textwidth}
\centering
\makeatletter\def\@captype{table}\makeatother\caption{Probabilities for node labeled $(i,j,k)$ under $\pi$ }
\label{table:pi}
\begin{tabular}{c|ccc}
& 0 & 1 & 2 \\
\hline
$i\vphantom{p^2}$
& $(1-p_1)^2$
& $2p_1(1-p_1)$
& $p_1^2$
\\
$j\vphantom{p^2}$
& $(1-p_2)$
& 0
& $p_2$
\\
$k\vphantom{p^2}$
& 0
& $\frac{1}{2}$
& $\frac{1}{2}$
\end{tabular}
\end{minipage}
\hspace{0.1\textwidth}
\begin{minipage}{0.4\textwidth}
\centering
\makeatletter\def\@captype{table}\makeatother\caption{Probabilities for node labeled $(i,j,k)$ under $\tau$ }
\label{table:tau}
\begin{tabular}{c|ccc}
& 0 & 1 & 2 \\
\hline
$i\vphantom{p^2}$
& $(1-p_1)^2$
& $2p_1(1-p_1)$
& $p_1^2$
\\
$j\vphantom{p^2}$
& $(1-p_2)$
& 0
& $p_2$
\\
$k\vphantom{p^2}$
& $\frac{1}{2}$
& $\frac{1}{2}$
& 0
\end{tabular}
\end{minipage}
\end{minipage}
\vspace{0.05\textwidth}
By setting values to $p_1$ and $p_2$ and using Algorithm \ref{algorithm:pufferfish-check}, we can get a solution
showing whether it satisfies $\epsilon$-Pufferfish privacy under these values.
In our implementation, we set $p_1 = 0.2$ and $p_2 = 0.4$, and use \zpython to check whether
$\ln(2)$-Pufferfish privacy is satisfied. The SMT solver returns a ``sat'' result, together with an observation
sequence $0$. This shows that $\ln(2)$-Pufferfish privacy is violated under the observation sequence $0$.
What's more, we can also set $p_1$ and $p_2$ as variables and slightly change Algorithm \ref{algorithm:pufferfish-check}
so that SMT solver will return a solution of $p_1$ and $p_2$ such that $\epsilon$-Pufferfish privacy
is preserved under these values.
\noindent
\textit{Example 7.2} There's a variant version of Noisy Max, where it follows the usual settings
except at the last step the maximal
noisy result will be returned, as described in Algorithm \ref{algorithm:noisy-max-bad}.
This version has been considered to be wrong(not strictly proved) in the sense that it doesn't preserve differential privacy.
We can use HMM to model this algorithm.
To be specific, the Markov chain part of this model is as same as in Algorithm \ref{algorithm:noisy-max}.
The difference is that the maximal value, instead of corresponding index, is observed in hidden Markov
model. For instance in the noisy-added state labeled $210$,
$2$ as the maximal value will be observed, while in the correct version of Noisy Max the corresponding index $1$ of $\tilde{v}_1$ is observed.
Using our Pufferfish checking algorithm, we can easily check that whether
Algorithm \ref{algorithm:noisy-max-bad} satisfies differential privacy.
One important thing to be noted is that, in some cases where $n$ is small,
differential privacy is still preserved.
For instance our implementation returns ``unsat'' when $n=5$, meaning that
$\ln(2)$-differential privacy is still preserved under this case.
This result doesn't violate the general consideration of the algorithm;
instead, it shows that specific privacy scenarios can be verified using our algorithm.
\begin{algorithm}
\begin{algorithmic}[1]
\Require{$0 \leq v_1, v_2, \ldots, v_n \leq 2$}
\Ensure{The maximal $v_r$ among $v_1, v_2, \ldots, v_n$}
\Function{NoisyMax}{$v_1, v_2, \ldots, v_n$}
\For{each $v_i$}
\Match{$v_i$}
\Comment{apply $\frac{1}{2}$-geometric mechanism to $v_i$}
\lCase{$0$}
{$\tilde{v}_i \leftarrow 0, 1, 2$ with probability
$\frac{2}{3}, \frac{1}{6}, \frac{1}{6}$}
\lCase{$1$}
{$\tilde{v}_i \leftarrow 0, 1, 2$ with probability
$\frac{1}{3}, \frac{1}{3}, \frac{1}{3}$}
\lCase{$2$}
{$\tilde{v}_i \leftarrow 0, 1, 2$ with probability
$\frac{1}{6}, \frac{1}{6}, \frac{2}{3}$}
\EndMatch
\EndFor
\State{find $\tilde{v}_r $ such that $1 \leq r \leq n$,
$\tilde{v}_r \geq \tilde{v}_j$ for every $1 \leq j \leq n$}
\State{\Return {$\tilde{v}_r$} }
\EndFunction
\end{algorithmic}
\caption{Wrong Version of Noisy Max}
\label{algorithm:noisy-max-bad}
\end{algorithm}