-
Notifications
You must be signed in to change notification settings - Fork 387
Aggregate File Creation: Skip packages if payload too small #108
Comments
Maybe combine with #140 |
Alternative suggestion: Pad packages with random dummy keys if too small Thinking a bit further into the hoped for future, when new infections are becoming rare, this might lead to unnecessary (possibly quite long) delays in notifications being sent. Could one instead use random dummy keys to pad every package that is too small to maybe 50 or 100 keys (or possibly padding to a randomised amount so an outsider can't even determine whether there any dummies are in the set). This would impact required data volumes marginally - in the worst case, if 75 dummy keys are added every hour, this would represent only 50 kiB per day (or 3% of your bandwidth example in the architecture document) if I did not miscalculate. Since this would only occur in a low load situation anyway, this should be acceptable. |
Threhold is now: 140 |
I like the padding idea - @MKusber do you think we can do this from a DPP/security perspective? |
Hey @MikeJayDee, after some investigation, @michael-burwig and me decided that we will go for the "shifting" approach instead of the "padding" approach for this release. For clarification: "Shifting" approach:
"Padding" approach:
As you have rightly pointed out, the main disadvantage of the "shifting" approach is that it introduces an artificial delay in the distribution of keys. This means that people will not be warned about a potential COVID exposure as quickly as possible, but that the CWA server will wait with the distribution until enough submissions have been made. Especially as the number of submissions decreases, this delay will become significant. This disadvantage, of course, doesn't exist with the "padding" approach. However, the purpose of this whole "threshold" feature is to preserve peoples anonymity and to not allow anyone to create movement profiles. In order to achieve that, it must be impossible to match a specific temporary exposure key (TEK) to any particular person. In the "shifting" approach, this is achieved by ensuring that a single person's TEKs are always mashed together with enough other people's real TEKs to get lost in the shuffle. In the "padding" approach, this is achieved by mashing a single person's TEKs together with artificially/randomly generated TEKs. In order to make the "padding" approach equally as secure and privacy preserving, we would need to make real TEKs and artificially/randomly generated TEKs absolutely indistinguishable from one another. We decided for the "shifting" approach for this release because we are not 100% certain that we can achieve that - for the following reasons: There are 4 fields in a TEK:
"Key data" is random by nature and can easily be generated - no problem there. "Rolling period" is always 144 - no problem here either. However, in the real world, "transmission risk levels" and "rolling start interval numbers" each follow some statistical distribution. In order to make real and fake TEKs indistinguishable to any reasonably certainty, we would need to make sure, that the fake TEKs follow the same distribution as the real ones. The problem is, that we are not sure what the exact properties of that distribution are (and how they change over time). We can thus not confidently guarantee that one wouldn't be able to distinguish random and real TEKs through statistical analysis. Since data privacy and protection are the foundation of this whole project, we decided to not take this risk and go with the safer but less accurate solution. Now, it might be entirely possible that there is a solution to this problem that we just don't see. But since time is of the essence right now (and we can not wait for a scientifically substantiated analysis of the "padding" approach), we decided to go for the "shifting" approach for the initial release. Anyways. Thank you very much for sharing your idea, @MikeJayDee. We forwarded your idea and the open questions outlined above to our scientific partners and will keep you posted on any updates 👍 |
Thanks a lot for this explanation, which I fully understand. It is probably worth keeping this in mind though and monitor how often packages are actually skipped. I would expect lots of skipping during the night but hopefully not that often during the day, and then not for more than 1 or 2 hours. If you notice that delays are getting long (6 hours or more during the day), one could investigate the padding again - maybe a reasonable distribution for transmission risk levels and rolling start interval number can be calculated after the system has been running for a few weeks. I'm not even sure that a lot of information can be gleaned from a file with padded data if the padded data is not completely out of line with the real data. |
@pithumke I think an acceptable solution for "Padding" would be to expand every "short" TEK list that is uploaded (with less than 14 days) to a "full" 14-day TEK list with dummy key data(*). The "Transmission risk level" is assigned by the apps that your team creates, e.g. here, based on a fixed profile. (*) You may also force every "Rolling start interval number" to the standard 00:00 UTC, in case a TEK that started "during the day" were reported for the first use day in iOS, which I don't know. Android does report all the TEKs starting at 00:00 UTC, even if the first enabling didn't happen at exactly that time. |
@pithumke , as @MikeJayDee mentioned it may be good to re-evaluate whether it would be possible to switch to "padding" solution after some real TEKs have already been generated (perhaps it's already possible as I've seen many Diagnosis Keys on CDN before the app launched, I'm not sure whether they were real, coming from real-world testers, or synthetic though). But as @mh- suggestion shows, I would not expect it to be a big problem to just come up mathematically with proper distribution of Diagnosis Keys indistinguishable from real ones. (@mh- if I understood specification correctly it follows directly from the specification that "Rolling start interval number" corresponds always to 00:00 UTC) To make a right decision between shifting and padding approach we need to consider that:
That's a philosophical / moral problem so I can only share my personal view here: I'd favor "shifting" approach only if there would be certainty that privacy violations associated with "padding" approach would be severe enough to undermine trust in the system, and in turn result in more lives lost. Currently I don't see enough evidence that it would be so big problem for privacy, it would require sophisticated attacker able to distinguish synthetic keys from real ones, and determined enough to gather enough data to construct some movement profiles. It seems quite unlikely, especially considering that there are much more severe issues with privacy inherent to this system, like identifying if some specific person was in contact with infected person just by registering BLE beacons received by this person (this in turn would reveal when and where contact with infected person took place). I think that this threat is unlikely to undermine trust in the system, and threat of exploiting "padding" approach seems to be weaker by at least the order of magnitude. |
Can somebody explain how you can create a movement profile from the gathered data, please? |
@mohe2015 you could e.g. distribute BLE scanners across the city and listen to all BLE beacons transmitted by CWA. In case there are very few infections and the server would e.g. include only one real set of 14 Diagnosis Keys in a package (rest fake to pad this package), then maybe attacker would be able to single-out relevant Diagnosis Keys by analyzing statistics of keys included in this package, and then attacker would be able to reveal whereabouts of infected person in the last 14 days (and of course identify this person based on this info). But attacker would need to operate quite sophisticated network of BLE scanners, probably there are orders of magnitude cheaper methods (just bribe the doctor, hire someone to follow a person around). It's quite abstract privacy concern for me, but maybe someone could come up with cheaper / more practical attack method. |
@kbobrowski As far as I understand the system I think this would work independently of the number of keys if you distribute that many bluetooth scanners. Or am I wrong? |
The key here is I think being able to connect 14 different Diagnosis Keys that belong to the same person. The less Diagnosis Keys in the package the easier it is (it's trivial case if there are only 14 Diagnosis Keys in the package). This attack will be fuzzed by Diagnosis Keys from e.g. 10 people in the same package (140 keys - current limit), assuming these 10 people roughly share time and space. But if they are spatially distributed then it becomes easy again. We can think of this BLE scanners in different ways - it can be rogue app collecting BLE data on user devices, or community effort by users to share received RPIs in order to facilitate identifying infected person whereabouts. I'm not really sure how packaging only 140 keys together helps much, since TEK changes only once per day it would be quite trivial to still divide this package in 14 keys of 10 different people based on e.g. place where they live (in the morning and in the evening the same TEK will be producing RPIs in the same place, and this pattern repeats for 14 days). I feel it's all quite abstract and academic, but I have not gave it much thought, perhaps there are other methods. Maybe I'm missing something since this attack does not require statistical analysis to single-out Diagnosis Keys ;) |
There is some hint about this in solution architecture:
It seems like trying to solve the same issue which "shifting" or "padding" approaches are solving, only when there are much more keys in the database. If attacker is able to connect Diagnosis Keys it would potentially be possible to track whereabouts of infected person during last 14 days, but even without connecting it, it's still possible to track whereabouts during single day since TEKs rotate daily. So these efforts of shuffling TEKs in database or applying "shifting" approach can merely reduce tracking time from 14 days down to 1 day. I'm not sure if it would make big difference if someone's location would be revealed for the last 14 days or for the last 1 day - probably it would deal equal blow to the reputation of the system (that said, I think this scenario is highly unlikely). |
Thanks a lot for the explanation and for looking in the documentation. I agree that this seems to be an unlikely issue if your assumptions are correct. |
@christian-kirschnick @pithumke There is one corner case where it would actually be better if you padded each key with a random number of keys (not a fixed
User A can now check Diagnosis Keys for day 2. If there are exactly 10 keys with Transmission Risk Level 8 on day 2 (and exactly 10 keys with Transmission Risk Level 5 on day 1), this proves that user B was close to user A on day 2. This seems to be a rare corner case. But also changing |
Hey everyone, thank you for the great input and please excuse my belated reply. I will try to cover the topics and questions one by one, but before that, I would like to mention a few things for context:
That's true, nice idea! This would also work with "padding" implementations in both submission and distribution. If you see that the existing padding implementation would still benefit from this, I humbly encourage you to open up an enhancement request or even create a PR.
Yes, we know the algorithm by which the transmission risk levels are calculated, but that algorithm can only output values with a correct statistical distribution if it is fed input values that follow a correct statistical distribution. And we know the statistical distributions of neither inputs nor outputs with absolute certainty yet.
As I just recently found out, that's already happening in both mobile APIs. The TEKs that are generated on the mobile devices are valid from 00:00 to 00:00, regardless of when the app was installed / the API was first activated. If this would not have been the case, simply "fixing" the rolling interval on submission would have caused incorrect exposure notifications, though.
I agree with your assessment that this boils down to a philosophical question. In order to make a choice here, we analyzed the potential privacy implications in detail and decided together with our project partners and stakeholders to go for a mixed approach: to keep the 140-key threshold, but to also add configurable padding (on submission).
I'm afraid, I don't understand how the scenario you described could be prevented from randomizing the number of padded keys. As far as I can see, the insight "User A was close to User B" only stems from the received beacons, but not from analyzing the random keys. On which day did which user upload how many keys and what information can the other user extract from that? Could you please clarify? Again, I would like to thank all participants of this conversation very much for your valuable questions, ideas and comments. However, this discussion is beginning to cover more and more topics in parallel (some of which might have even become "obsolete" since #609) and it's all happening inside a PR that was already closed 20 days ago (which reduces participation and exposure). Because of this, I want to ask you to please open up separate issues with your respective questions/concerns/proposals/ideas so we can keep discussions focused and visible. Thanks! - Pit |
Hi @pithumke - thanks a lot for your detailed reply.
Maybe I don't fully understand what you mean by inputs and outputs in this case. However, as I explained here, I don't think this added "key noise" will actually prevent an attacker - who recorded RPIs in multiple locations - from tracking the movement of spatially distributed Diagnosis Key reporters, because they cannot match those additional random keys to any RPIs, and they would anyhow correlate the Diagnosis Keys based on the location where they recorded matching RPIs, not by analyzing statistical properties of the Diagnosis keys..
I don't think it could have caused incorrect exposure notifications, because fixing the time to 00:00 means expanding the validity time frame backwards, and before the time when a new TEK is generated from SecureRandom, nobody could have sent RPIs derived from that new TEK.
On day 2, user A can initially not be certain that the reporting user was actually user B, because they didn't see each other (user B was hiding in this example). User A will only get a warning caused by an anonymous Diagnosis Key, which should be unlinkable to Diagnosis Keys on other days. The scenario now assumes that only a single user has reported a Diagnosis Key for day 1 and day 2 with unique Transmission Risk Levels from the RKI profile (as mentioned above in this comment). Having only few reporting users might not be the intended scenario, but that's what we are apparently seeing right now, and that's why you did the work on the padding in the first place. And again, I agree that this is a corner case. Having a fixed multiplication factor will allow user A to deduce that there was only 1 user who reported both the matching Diagnosis Keys on day 1 and day 2. |
I agree with @mh- , as I can see no scenario where padding with a fixed multiplication factor increases privacy, since a valid TEK can always be correlated to a captured RPI, regardless of the amount of fake keys. Moreover by using a padding factor of 10, the number of required submissions gets reduced by the same factor. Thus only 14 real keys are required for reaching the threshold of 140 entries. This is step down compared to the previous requirement of 50 real keys in the beginning of this issue-discussion. |
Consider User B using the app for 14 days before becoming infected. He uploads his 14 real keys, they get padded to 140 real and fake keys. Let's say he's the only User to upload on that day. As the threshold of 140 is reached, it gets published by the end of the day. User A sees this data set. He is aware of the padding scheme and sees that there are exactly 10 keys for each day. i.e. 10 keys for day 1, 10 keys for day 2, and so on. He can deduce that this data set consists of 14 real keys that belong to the same user. In theory, he could track User B across multiple days. |
For reference, Google also now recommends padding with a random number of keys, which they call "jitter": google/exposure-notifications-internals#9 |
Change the aggregate file creation logic in a way, in which packages are skipped in case the payload (amount of keys) is too small. The threshold should be configurable, and defaulted to 50 for now.
The text was updated successfully, but these errors were encountered: