Skip to content
This repository has been archived by the owner on May 16, 2023. It is now read-only.

Aggregate File Creation: Skip packages if payload too small #108

Closed
christian-kirschnick opened this issue May 15, 2020 · 21 comments · Fixed by #406
Closed

Aggregate File Creation: Skip packages if payload too small #108

christian-kirschnick opened this issue May 15, 2020 · 21 comments · Fixed by #406
Milestone

Comments

@christian-kirschnick
Copy link
Member

Change the aggregate file creation logic in a way, in which packages are skipped in case the payload (amount of keys) is too small. The threshold should be configurable, and defaulted to 50 for now.

@christian-kirschnick christian-kirschnick added this to the Beta milestone May 15, 2020
@pithumke
Copy link
Member

Maybe combine with #140

@MikeJayDee
Copy link

Alternative suggestion: Pad packages with random dummy keys if too small

Thinking a bit further into the hoped for future, when new infections are becoming rare, this might lead to unnecessary (possibly quite long) delays in notifications being sent. Could one instead use random dummy keys to pad every package that is too small to maybe 50 or 100 keys (or possibly padding to a randomised amount so an outsider can't even determine whether there any dummies are in the set).

This would impact required data volumes marginally - in the worst case, if 75 dummy keys are added every hour, this would represent only 50 kiB per day (or 3% of your bandwidth example in the architecture document) if I did not miscalculate. Since this would only occur in a low load situation anyway, this should be acceptable.

@christian-kirschnick
Copy link
Member Author

Threhold is now: 140

@christian-kirschnick
Copy link
Member Author

I like the padding idea - @MKusber do you think we can do this from a DPP/security perspective?

@pithumke
Copy link
Member

pithumke commented May 31, 2020

Hey @MikeJayDee, after some investigation, @michael-burwig and me decided that we will go for the "shifting" approach instead of the "padding" approach for this release. For clarification:

"Shifting" approach:

  • If a package is too small (number of keys submitted in a timeframe is below a configurable threshold), then it will not be distributed (there will be zero keys in that package)
  • The keys that would have been distributed in that run are "shifted" to the next distribution run
  • Rinse and repeat until a package can be assembled that contains enough keys (above the configurable threshold)

"Padding" approach:

  • If a package is too small, then it will be padded with random / fake keys until the threshold is matched (or exceeded)

As you have rightly pointed out, the main disadvantage of the "shifting" approach is that it introduces an artificial delay in the distribution of keys. This means that people will not be warned about a potential COVID exposure as quickly as possible, but that the CWA server will wait with the distribution until enough submissions have been made. Especially as the number of submissions decreases, this delay will become significant.

This disadvantage, of course, doesn't exist with the "padding" approach. However, the purpose of this whole "threshold" feature is to preserve peoples anonymity and to not allow anyone to create movement profiles. In order to achieve that, it must be impossible to match a specific temporary exposure key (TEK) to any particular person. In the "shifting" approach, this is achieved by ensuring that a single person's TEKs are always mashed together with enough other people's real TEKs to get lost in the shuffle. In the "padding" approach, this is achieved by mashing a single person's TEKs together with artificially/randomly generated TEKs. In order to make the "padding" approach equally as secure and privacy preserving, we would need to make real TEKs and artificially/randomly generated TEKs absolutely indistinguishable from one another. We decided for the "shifting" approach for this release because we are not 100% certain that we can achieve that - for the following reasons:

There are 4 fields in a TEK:

  • Key data
  • Transmission risk level
  • Rolling start interval number
  • Rolling period

"Key data" is random by nature and can easily be generated - no problem there. "Rolling period" is always 144 - no problem here either. However, in the real world, "transmission risk levels" and "rolling start interval numbers" each follow some statistical distribution. In order to make real and fake TEKs indistinguishable to any reasonably certainty, we would need to make sure, that the fake TEKs follow the same distribution as the real ones. The problem is, that we are not sure what the exact properties of that distribution are (and how they change over time). We can thus not confidently guarantee that one wouldn't be able to distinguish random and real TEKs through statistical analysis.

Since data privacy and protection are the foundation of this whole project, we decided to not take this risk and go with the safer but less accurate solution.

Now, it might be entirely possible that there is a solution to this problem that we just don't see. But since time is of the essence right now (and we can not wait for a scientifically substantiated analysis of the "padding" approach), we decided to go for the "shifting" approach for the initial release.

Anyways. Thank you very much for sharing your idea, @MikeJayDee. We forwarded your idea and the open questions outlined above to our scientific partners and will keep you posted on any updates 👍

@MikeJayDee
Copy link

Thanks a lot for this explanation, which I fully understand. It is probably worth keeping this in mind though and monitor how often packages are actually skipped. I would expect lots of skipping during the night but hopefully not that often during the day, and then not for more than 1 or 2 hours.

If you notice that delays are getting long (6 hours or more during the day), one could investigate the padding again - maybe a reasonable distribution for transmission risk levels and rolling start interval number can be calculated after the system has been running for a few weeks. I'm not even sure that a lot of information can be gleaned from a file with padded data if the padded data is not completely out of line with the real data.

@mh-
Copy link

mh- commented Jun 19, 2020

@pithumke I think an acceptable solution for "Padding" would be to expand every "short" TEK list that is uploaded (with less than 14 days) to a "full" 14-day TEK list with dummy key data(*).
This will take care of the "Rolling start interval number".

The "Transmission risk level" is assigned by the apps that your team creates, e.g. here, based on a fixed profile.
So I think you can safely assume that you exactly know the statistical properties and simply imitate them for the padding data.

(*) You may also force every "Rolling start interval number" to the standard 00:00 UTC, in case a TEK that started "during the day" were reported for the first use day in iOS, which I don't know. Android does report all the TEKs starting at 00:00 UTC, even if the first enabling didn't happen at exactly that time.

@kbobrowski
Copy link

kbobrowski commented Jun 19, 2020

@pithumke , as @MikeJayDee mentioned it may be good to re-evaluate whether it would be possible to switch to "padding" solution after some real TEKs have already been generated (perhaps it's already possible as I've seen many Diagnosis Keys on CDN before the app launched, I'm not sure whether they were real, coming from real-world testers, or synthetic though). But as @mh- suggestion shows, I would not expect it to be a big problem to just come up mathematically with proper distribution of Diagnosis Keys indistinguishable from real ones.

(@mh- if I understood specification correctly it follows directly from the specification that "Rolling start interval number" corresponds always to 00:00 UTC)

To make a right decision between shifting and padding approach we need to consider that:

  • padding approach for sure will result in more lives saved, and less people getting the virus, since contact people will be notified faster
  • shifting approach maybe will result in better privacy protection (avoid creating movement profiles)

That's a philosophical / moral problem so I can only share my personal view here: I'd favor "shifting" approach only if there would be certainty that privacy violations associated with "padding" approach would be severe enough to undermine trust in the system, and in turn result in more lives lost.

Currently I don't see enough evidence that it would be so big problem for privacy, it would require sophisticated attacker able to distinguish synthetic keys from real ones, and determined enough to gather enough data to construct some movement profiles. It seems quite unlikely, especially considering that there are much more severe issues with privacy inherent to this system, like identifying if some specific person was in contact with infected person just by registering BLE beacons received by this person (this in turn would reveal when and where contact with infected person took place). I think that this threat is unlikely to undermine trust in the system, and threat of exploiting "padding" approach seems to be weaker by at least the order of magnitude.

@mohe2015
Copy link

Can somebody explain how you can create a movement profile from the gathered data, please?

@kbobrowski
Copy link

kbobrowski commented Jun 19, 2020

@mohe2015 you could e.g. distribute BLE scanners across the city and listen to all BLE beacons transmitted by CWA. In case there are very few infections and the server would e.g. include only one real set of 14 Diagnosis Keys in a package (rest fake to pad this package), then maybe attacker would be able to single-out relevant Diagnosis Keys by analyzing statistics of keys included in this package, and then attacker would be able to reveal whereabouts of infected person in the last 14 days (and of course identify this person based on this info). But attacker would need to operate quite sophisticated network of BLE scanners, probably there are orders of magnitude cheaper methods (just bribe the doctor, hire someone to follow a person around). It's quite abstract privacy concern for me, but maybe someone could come up with cheaper / more practical attack method.

@mohe2015
Copy link

mohe2015 commented Jun 19, 2020

@kbobrowski As far as I understand the system I think this would work independently of the number of keys if you distribute that many bluetooth scanners. Or am I wrong?

@kbobrowski
Copy link

kbobrowski commented Jun 19, 2020

The key here is I think being able to connect 14 different Diagnosis Keys that belong to the same person. The less Diagnosis Keys in the package the easier it is (it's trivial case if there are only 14 Diagnosis Keys in the package). This attack will be fuzzed by Diagnosis Keys from e.g. 10 people in the same package (140 keys - current limit), assuming these 10 people roughly share time and space. But if they are spatially distributed then it becomes easy again.

We can think of this BLE scanners in different ways - it can be rogue app collecting BLE data on user devices, or community effort by users to share received RPIs in order to facilitate identifying infected person whereabouts. I'm not really sure how packaging only 140 keys together helps much, since TEK changes only once per day it would be quite trivial to still divide this package in 14 keys of 10 different people based on e.g. place where they live (in the morning and in the evening the same TEK will be producing RPIs in the same place, and this pattern repeats for 14 days). I feel it's all quite abstract and academic, but I have not gave it much thought, perhaps there are other methods.

Maybe I'm missing something since this attack does not require statistical analysis to single-out Diagnosis Keys ;)

@kbobrowski
Copy link

There is some hint about this in solution architecture:

It could be possible to identify temporary exposure keys that belong together, i.e. belong to the same user, because they are posted together which results in them being in a sequential order in the database. To prevent this, the aggregated files are shuffled, e.g. by using an ORDER BY RANDOM on the database while fetching the data for the corresponding file.

It seems like trying to solve the same issue which "shifting" or "padding" approaches are solving, only when there are much more keys in the database. If attacker is able to connect Diagnosis Keys it would potentially be possible to track whereabouts of infected person during last 14 days, but even without connecting it, it's still possible to track whereabouts during single day since TEKs rotate daily. So these efforts of shuffling TEKs in database or applying "shifting" approach can merely reduce tracking time from 14 days down to 1 day. I'm not sure if it would make big difference if someone's location would be revealed for the last 14 days or for the last 1 day - probably it would deal equal blow to the reputation of the system (that said, I think this scenario is highly unlikely).

@mohe2015
Copy link

Thanks a lot for the explanation and for looking in the documentation. I agree that this seems to be an unlikely issue if your assumptions are correct.

@mh-
Copy link

mh- commented Jun 22, 2020

@christian-kirschnick @pithumke There is one corner case where it would actually be better if you padded each key with a random number of keys (not a fixed random-key-padding-multiplier: 10 as it has been implemented now):

  • User B uploads Diagnosis Keys and hopes that the keys cannot be correlated across multiple days
  • User A had a long meeting with only User B on day 1, and can therefore identify user B when the app warns user A about a possible infection on day 1.
  • On day 2, user B went to a place close to user A, where user B was not allowed to be, so he/she hid from user A and hoped that user A would not find out.
  • User A is also warned about a possible infection on day 2.

User A can now check Diagnosis Keys for day 2. If there are exactly 10 keys with Transmission Risk Level 8 on day 2 (and exactly 10 keys with Transmission Risk Level 5 on day 1), this proves that user B was close to user A on day 2.

This seems to be a rare corner case. But also changing random-key-padding-multiplier from a fixed value of 10 to a random value in the range of e.g. 1 to 20 for each day doesn't seem to hurt either.

@pithumke
Copy link
Member

pithumke commented Jun 22, 2020

Hey everyone,

thank you for the great input and please excuse my belated reply. I will try to cover the topics and questions one by one, but before that, I would like to mention a few things for context:
Our initial thinking regarding the "padding" approach, was to either include it in the distribution run or to put it into a DB script which which we could just run every now and then. Because of this, we had to rule out a lot of options regarding random key generation (because of the statistical analysis issues explained above). It did not cross our minds at the time (around a month ago) to simply add the "padding" to the submission logic - where we implemented it now. By simply "duplicating" submitted diagnosis keys N times and only randomizing the key data of N-1 of them, we can now avoid "statistical analysis" issues of the randomly generated data entirely. So, thanks for the active discussions and tremendous help in shaping the "random key padding" implementation that we have today in PR #609.

@mh-

I think an acceptable solution for "Padding" would be to expand every "short" TEK list that is uploaded (with less than 14 days) to a "full" 14-day TEK list with dummy key data(*).
This will take care of the "Rolling start interval number".

That's true, nice idea! This would also work with "padding" implementations in both submission and distribution. If you see that the existing padding implementation would still benefit from this, I humbly encourage you to open up an enhancement request or even create a PR.

@mh-

The "Transmission risk level" is assigned by the apps that your team creates, e.g. here, based on a fixed profile.
So I think you can safely assume that you exactly know the statistical properties and simply imitate them for the padding data.

Yes, we know the algorithm by which the transmission risk levels are calculated, but that algorithm can only output values with a correct statistical distribution if it is fed input values that follow a correct statistical distribution. And we know the statistical distributions of neither inputs nor outputs with absolute certainty yet.

@mh-

(*) You may also force every "Rolling start interval number" to the standard 00:00 UTC, in case a TEK that started "during the day" were reported for the first use day in iOS, which I don't know. Android does report all the TEKs starting at 00:00 UTC, even if the first enabling didn't happen at exactly that time.

As I just recently found out, that's already happening in both mobile APIs. The TEKs that are generated on the mobile devices are valid from 00:00 to 00:00, regardless of when the app was installed / the API was first activated. If this would not have been the case, simply "fixing" the rolling interval on submission would have caused incorrect exposure notifications, though.

@kbobrowski

That's a philosophical / moral problem so I can only share my personal view here: I'd favor "shifting" approach only if there would be certainty that privacy violations associated with "padding" approach would be severe enough to undermine trust in the system, and in turn result in more lives lost.

I agree with your assessment that this boils down to a philosophical question. In order to make a choice here, we analyzed the potential privacy implications in detail and decided together with our project partners and stakeholders to go for a mixed approach: to keep the 140-key threshold, but to also add configurable padding (on submission).

@mh-

There is one corner case where it would actually be better if you padded each key with a random number of keys (not a fixed random-key-padding-multiplier: 10 as it has been implemented now):
[...]
User A can now check Diagnosis Keys for day 2. If there are exactly 10 keys with Transmission Risk Level 8 on day 2 (and exactly 10 keys with Transmission Risk Level 5 on day 1), this proves that user B was close to user A on day 2.

I'm afraid, I don't understand how the scenario you described could be prevented from randomizing the number of padded keys. As far as I can see, the insight "User A was close to User B" only stems from the received beacons, but not from analyzing the random keys. On which day did which user upload how many keys and what information can the other user extract from that? Could you please clarify?

Again, I would like to thank all participants of this conversation very much for your valuable questions, ideas and comments. However, this discussion is beginning to cover more and more topics in parallel (some of which might have even become "obsolete" since #609) and it's all happening inside a PR that was already closed 20 days ago (which reduces participation and exposure). Because of this, I want to ask you to please open up separate issues with your respective questions/concerns/proposals/ideas so we can keep discussions focused and visible. Thanks! - Pit

@mh-
Copy link

mh- commented Jun 22, 2020

Hi @pithumke - thanks a lot for your detailed reply.
I would like to clarify potential misunderstandings:

The "Transmission risk level" is assigned by the apps that your team creates, e.g. here, based on a fixed profile.
So I think you can safely assume that you exactly know the statistical properties and simply imitate them for the padding data.

Yes, we know the algorithm by which the transmission risk levels are calculated, but that algorithm can only output values with a correct statistical distribution if it is fed input values that follow a correct statistical distribution. And we know the statistical distributions of neither inputs nor outputs with absolute certainty yet.

Maybe I don't fully understand what you mean by inputs and outputs in this case.
My understanding (from fixing corona-warn-app/cwa-app-android#678) is that the inputs are the age of the TEK in days when submitted, and the outputs are a fixed function (RKI profile) of said age: 5, 6, 8, 8, 8, 5, 3, 1, 1, 1, 1, ...
I agree that it's only easy to imitate this at the time when keys are submitted - and then the simplest implementation is what you do right now. That said, you could as well run a script later, doing the same (duplicating each key with random key bytes).

However, as I explained here, I don't think this added "key noise" will actually prevent an attacker - who recorded RPIs in multiple locations - from tracking the movement of spatially distributed Diagnosis Key reporters, because they cannot match those additional random keys to any RPIs, and they would anyhow correlate the Diagnosis Keys based on the location where they recorded matching RPIs, not by analyzing statistical properties of the Diagnosis keys..

(*) You may also force every "Rolling start interval number" to the standard 00:00 UTC, in case a TEK that started "during the day" were reported for the first use day in iOS, which I don't know. Android does report all the TEKs starting at 00:00 UTC, even if the first enabling didn't happen at exactly that time.

As I just recently found out, that's already happening in both mobile APIs. The TEKs that are generated on the mobile devices are valid from 00:00 to 00:00, regardless of when the app was installed / the API was first activated. If this would not have been the case, simply "fixing" the rolling interval on submission would have caused incorrect exposure notifications, though.

I don't think it could have caused incorrect exposure notifications, because fixing the time to 00:00 means expanding the validity time frame backwards, and before the time when a new TEK is generated from SecureRandom, nobody could have sent RPIs derived from that new TEK.

There is one corner case where it would actually be better if you padded each key with a random number of keys (not a fixed random-key-padding-multiplier: 10 as it has been implemented now):
[...]
User A can now check Diagnosis Keys for day 2. If there are exactly 10 keys with Transmission Risk Level 8 on day 2 (and exactly 10 keys with Transmission Risk Level 5 on day 1), this proves that user B was close to user A on day 2.

I'm afraid, I don't understand how the scenario you described could be prevented from randomizing the number of padded keys. As far as I can see, the insight "User A was close to User B" only stems from the received beacons, but not from analyzing the random keys. On which day did which user upload how many keys and what information can the other user extract from that? Could you please clarify?

On day 2, user A can initially not be certain that the reporting user was actually user B, because they didn't see each other (user B was hiding in this example). User A will only get a warning caused by an anonymous Diagnosis Key, which should be unlinkable to Diagnosis Keys on other days.

The scenario now assumes that only a single user has reported a Diagnosis Key for day 1 and day 2 with unique Transmission Risk Levels from the RKI profile (as mentioned above in this comment). Having only few reporting users might not be the intended scenario, but that's what we are apparently seeing right now, and that's why you did the work on the padding in the first place. And again, I agree that this is a corner case.

Having a fixed multiplication factor will allow user A to deduce that there was only 1 user who reported both the matching Diagnosis Keys on day 1 and day 2.

@markuspi
Copy link

I agree with @mh- , as I can see no scenario where padding with a fixed multiplication factor increases privacy, since a valid TEK can always be correlated to a captured RPI, regardless of the amount of fake keys.

Moreover by using a padding factor of 10, the number of required submissions gets reduced by the same factor. Thus only 14 real keys are required for reaching the threshold of 140 entries. This is step down compared to the previous requirement of 50 real keys in the beginning of this issue-discussion.

@markuspi
Copy link

Consider User B using the app for 14 days before becoming infected. He uploads his 14 real keys, they get padded to 140 real and fake keys. Let's say he's the only User to upload on that day. As the threshold of 140 is reached, it gets published by the end of the day.

User A sees this data set. He is aware of the padding scheme and sees that there are exactly 10 keys for each day. i.e. 10 keys for day 1, 10 keys for day 2, and so on. He can deduce that this data set consists of 14 real keys that belong to the same user. In theory, he could track User B across multiple days.

@mh-
Copy link

mh- commented Jun 24, 2020

@markuspi I think you can't get 14 keys out of the Google / Apple EN Framework, max. 13 keys because they will not release today's still valid key, but it's a problem nevertheless. As requested by @pithumke I created a new issue.

@mh-
Copy link

mh- commented Jul 31, 2020

For reference, Google also now recommends padding with a random number of keys, which they call "jitter": google/exposure-notifications-internals#9

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants