-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interoperable Private Attribution (IPA) #9
Comments
Section 4.1 states that the "aim is for IPA to be compatible with the Privacy Preserving Measurement (PPM) specification". Does that mean you intend to express IPA as a VDAF? |
Yes, our intention is to work towards that. There are a few major components, some which likely require expression as a new VDAF, but we hope to leverage the existing work with prio3 and/or poplar1 where possible. |
Consider me a strong +1 in support of this getting on the agenda. |
In case it helps, I spent a bunch of time working through the math of IPA in some detail at: https://educatedguesswork.org/posts/vaccine-tracking/. I was interested in another application, but if you found the ElGamal blinding and shuffling a bit had, this might help. |
@eriktaubeneck It seems like permissions are required to view the draft on google docs. Is a publicly accessible version of the draft available anywhere else? |
@ansuz document is back up, though it is now read-only. |
While the proposal mentions:
It would be nice for the proposal to directly compare itself with such prior work (and also the Privacy CG's Private Click Measurement). As it is, it is not immediately clear as to what the motivations are behind this proposal rather than furthering work on developing the other proposals. |
Is there a reason all the questions on the doc were removed + ability to comment revoked? |
Presumably because the internet is currently extremely angry about this? |
The document was completely defaced, fully deleted with "suggestions" and replaced with vulgarities. As such, the document is now "read-only" access. |
Ah, I see, sorry to hear that. Is the plan to move it to GitHub? Last time I read it there were a few undefined terms and the flow was not entirely clear to me, would be good to get clarifying answers. |
@benjaminsavage Does this lead you to consider docs unsuitable as a collaboration tool for this work or do you think it can be avoided going forward? I don't want to continue advocating for their use if the latter is not the case. |
|
Having looked at only the non-technical presentation, I have a couple of comments / questions.
|
@medicinalcocaine3434 the document was using suggested edits and comments, it was with suggested edits that the document was defaced. @santirely if you take a look at the technical proposal, you'll find details on your question. Briefly:
We are proposing a new read-only API, which the browser/OS would expose.
Any website/app is able to write a match key, so it's not dependent on any set of companies. However, the more cross-device coverage a given companies match key has, the more accurate attribution that uses that match key will be.
We are proposing that any site/app be able to reference any match key. Match keys are not shared, and the ability to reference them is not controlled by the companies that set them. |
Ok, that's interesting. A couple follow ups:
This is true but a significant portion of the value created by the proposal is cross-device tracking, and these companies adopting the solution would be important for that to actually work. I was aiming at something like: What if other smaller apps / publishers could pool their match keys in a way that benefits everyone? For example, a gaming studio like say Epic will have tons of mobile and CTV match keys but almost no web based ones. Opposite is true for someone like The New York Times. Could they create a sort of coop there?
This seems ideal, but isn't that a potential drawback for someone that can provide cross-device atribution by itself? What's Meta's or Google's incentive to be the world's match key providers there? |
The proposal also seems to assume that the user is logged in to at least one match key provider. What happens if the user is not? Does the browser make up a device identifier? Does it encrypt null? (Presumably collisions could abound there) or does it return an error to the calling script? |
To give a specific example, say I'm a new user. I install Firefox for the first time and see a full screen advert for Pocket and go "wow this looks great" and decide to sign up for the premium service. Since my browser is in a completely fresh state I won't have any match keys set up at all. How is Mozilla going to know if buying pocket was a good idea or not? |
In the absence of 3rd party cookies, cross-site (including cross-device) attribution won't be possible "by itself". It will require a new purpose constrained API. All companies incentives to participate will be to power their own attribution (with the side effect of enabling all attribution.) |
This is still yet to be determined. One idea is that the device could generate a random match key, which would at least default to "same device attribution".
This would entirely depend which match key providers the source sites and trigger sites are using. If those sites are using match key providers that the user is not logged into, then they attribution would likely be missed. 100% coverage have never been possible, but this API is designed to create as much coverage as possible, without enabling user level tracking. |
So what you're saying is that Meta for example would be OK sharing their "match keys" with say Twitter (since Meta's reach is significantly higher, why would Twitter use its own?) because that way the advertiser can also use the match key (and Meta wouldn't be able to match its conversions with the advertiser otherwise). And since they can't decide to share only with the advertiser, they'd be fine with sharing with everyone else. Seems pretty far-fetched to be honest. Also, isn't Meta's or Google's reach enough that they can match against advertiser's first party logged in data and get even better results? |
If Twitter wants to use Facebook's match keys, that means all the users they show adverts to also need to be logged in to Facebook. This means Twitter needs to incentivise its users to log in to Facebook which directly benefits Facebook. That's the entire reason Facebook has come up with this proposal - in order for it to work the vast majority of the internet needs to have been identified in some fashion by Facebook, so everyone that uses it will pass their users via Facebook and let them slurp up their data. You can say "well it works with any provider" but to work effectively it needs a major provider and Google are off doing their own thing so ... |
I don't see how that's true. If Twitter is the publisher, then it can ask their advertisers to register trigger events referencing only twitter's match keys. There is no need for the facebook match keys in that scenario, especially since the ads are running on twitter and so the lack of twitter user ID implies no possible match with advertiser target events. For non-FB smaller publishers, the proposal provides an important theoretical benefit, in that a publisher can choose to register source events leveraging facebook, twitter, and other third-party keys. In an IPA implementation supporting multiple match keys, this ultimately benefits the publisher as it increases potential match rates with advertiser data. It remains to be seen what the motivation could be for a match key provider to act as such, but presumably "having an advertising business" would be one motivating reason. I believe that making the match keys usable by other parties in that context is more fair than doing the opposite. |
I agree it doesn't make much sense, but in the hypothetical that they did want to only use someone else's match key (for whatever reason) then I think my point still stands
I don't think fairness comes in to many business decisions. It costs Facebook nothing to allow its competitors to use its match keys, and if everyone relies on them they gain a position of power over the discourse, even if it's just an implicit one. To be clear, just because I feel like this proposal further entrenches the big players in the ad business by relying on centralised identity services doesn't mean I think it's a bad proposal. As long as the cryptographic stuff works and the ad networks are somehow coerced into dropping their other tracking methods, this is a big step up. But on the other hand if the crypto stuff has a hidden weakness in it and Facebook run one of the "trusted" servers, this is a terrible idea. |
A few comments in response the the thread so far:
As @eriktaubeneck said - I like the idea of the device just generating a random matchkey. That way the API seamlessly defaults to "same-device-only" attribution, which is at least at par with other proposals.
The reason we proposed allowing any company to benefit from match keys set by any other participant, was specifically to try to avoid any kind of system which could be abused by large established players. As @santirely mentions, this would give them a lot of leverage to, as he says, choose to only share access with businesses who "play well with them". We opted for an "open reference" proposal specifically to avoid this type of risk.
As @eriktaubeneck points out, browsers and mobile operating systems are rapidly clamping down on "tracking". Various regulations are doing the same. This means that all businesses (even those with a large footprint) as steadily losing the ability to accurately count the number of conversions attributable to advertising. In a theoretical future world where cookies and device identifiers are all gone, and fingerprinting is impossible, having a "large footprint" will be useless from the perspective of counting conversions which occur off-network on other apps and websites. In such a world, if the only option available for counting conversions is a highly private one, like IPA, then I believe businesses who sell ads will use it (they won't have a choice). In that world, they'll have two options: Each entity will have to weigh these alternatives. For a business with a "large footprint" of users who sign in across multiple devices, here is how I think these choices will look: I posit that there exist businesses for whom the calculus is in favor of option (ii), more accurate measurement being more beneficial than everyone having less accurate measurement.
I think all parties (including "large footprint" entities) would all have similar incentives to push them in this fashion. We've also put a lot of time and thought into trying to ensure there isn't coupling between entities. We think we can design the system in such a way that we do not require collaboration. That is, we want a system where any advertiser who runs ads across N platforms can independently specify which match-keys they want to use, without needing those platforms to all agree with them, or all need agree on something.
First of all, I assume that Facebook / Google / any ad-tech company will never be trusted to operate a helper server =). This will be enforced by browsers. They'll have to decide which public keys they are willing to use to encrypt reports. I cannot imagine a world in which Firefox would trust Facebook enough to encrypt these events using Facebook's public key =). I'm assuming we will see non-profits with strong privacy reputations operating the servers, or possibly the types of organizations which operate Apple's "Private Relay" service. Secondly: Yes, exactly. This proposed system would be a big step up for privacy compared to the status quo mechanisms used to count conversions. I have no expectation that browsers and mobile operating systems will stop trying to clamp down on fingerprinting. Actually, if anything I expect them to accelerate those efforts. I also expect to see more and more regulation along these lines. That the math works out, and we have a strong privacy guarantee is the key. This is why we are trying to work out in the open - we think that's the best way to find all the problems / issues, and to get help finding solutions to them. We've already benefitted tremendously from outside input. @betuldurak found a really clever attack that a malicious helper node could do. I'm really grateful to her for telling us about it! We're working on finding a solution as we speak. I think the path towards standardization looks like a bunch of iterations out in the open, publishing papers, getting feedback, addressing problems, repeat. I hope that we can eventually converge on a design that is super solid. I wouldn't expect browser vendors to feel comfortable shipping an API like this unless a bunch of independent academics were all convinced that it met our design goals. |
Agreed on the approach =) What's the best way to follow along with the proposed solution(s) that you're working on to address @betuldurak's attack? Is the attack documented anywhere? |
https://educatedguesswork.org/posts/ipa-overview/#appendix%3A-linear-relation-attacks perhaps. We've initiated a few discussions with cryptographers; nothing public as yet. |
What role do regulatory requirements such as GDPR / ePrivacy in Europe play in the solution discovery & design from your perspective? That is one aspect I rarely read about in these proposals, yet I believe that this should be an integral part of the problem definition and solution design. Looking at IPA specifically for example, I believe that data protection authorities might categorize the match key as personal data (https://gdpr.eu/eu-gdpr-personal-data/) and storing it on the users device would therefore require user consent. Would you just accept that as a given, or could solutions be more tailored towards regulatory requirements (in a sense: try to discover solutions do not require user consent to not end up modeling 30% of conversions that are lost due to tracking opt-outs). |
You're totally right @csharrison. My thinking is that given each site owner can independently choose which match-key providers they work with, any provider who does something nasty like choosing a uniform value of the match key, will instantly lose trust and develop a bad reputation - leading to nobody using them again. |
@csharrison another approach would be to structure the inclusion of matchkeys in the report as a key:value of provider:matchkey, instead of just a set of matchkeys, i.e.
Then, at query time, you could tell the aggregators: "only join on provider1.com". The aggregators could respect that choice in the joining, but still account for budgeting against the full set of matchkeys. This still wouldn't fully solve the abuse scenario, however, because if someone were to simply set a uniform matchkey, that would likely still disrupt the budget accounting and contribution capping (in which case you'd still need reputational effects which @benjaminsavage proposes.) |
Another question for the IPA proposal. The document mentions it should be possible for third parties to make requests on behalf of other 1Ps. I agree this is a good feature. One attack I didn't see mentioned is malicious parties crafting fake data in the hope of stealing budget from the 1P, by pretending to query on behalf of the 1P. There are many mitigations for this, but it would be good to spell them out. The most obvious one is that if the match key space is high entropy enough, this is just straight up difficult. However, I don't know if we want to design something more robust such that e.g. 1Ps need to attest to working with certain 3Ps up front. |
@csharrison agreed that this is underspecified, and this would be a great area to get more clarity on. [Administrative side note, I opened a request to get a repo specifically for IPA so we can have issues to dedicated topics, and even put together pull requests for docs outline more details in these areas as they emerge.] A few thoughts specific to to this question:
In the case where the 3P has actual In the case where the 3P doesn't have actual events, but is just trying to disrupt some 1P's budget, the high entropy match key space would work.
In the first scenario, it should be possible for the 1P to prevent a 3P from getting actual events by not sending them or installing their "pixel" code. In the second scenario, it seems like a high entropy match key is enough (say 64-bit) where it would be far too expensive to run a query that would actually have meaningful impact. Let's suppose (very conservatively) that it only takes 1ms to generate a fake event - to cover 0.4% (1/256) of the space it would take over 2M years of compute time to generate all those events. And that's not even starting to think about actually running that query... That said, if a 1P wants to work with more than one 3P, then we do probably need a way for that 1P to assign specific portions of its budget across those different 3Ps, which may necessitate the attestation design you mention. |
Thanks, yeah. This one issue is getting very cumbersome haha.
It's worth thinking through this scenario to see if we could detect / tolerate this. As far as I understand things, it is notoriously difficult for 1Ps to make configuration changes on their sites, so if we are relying on that to deter cheaters it's not ideal.
Hm, this made me look back at the IPA doc to see how privacy budgeting is done. For a given report in the MPC system, how do we know the site it is associated with for purposes of budget? The issue I am hoping we can avoid is something like a re-randomization attack where a 3P gets real events for advertiser A but can somehow use the match key to steal budget from advertiser B while having the report look new. I think we need to make sure the budget keys are not tamperable basically. I think I agree with you about the high entropy protecting us a great deal from the "guessing" attack. If we can show that's the worst an adversary can do I might be comfortable with it. |
I agree if we're talking about a cheater using information from site A to impact something about site B. However, if site A is willing to give a "cheater" ability to execute JS on their site, how we can prevent anything beyond that.
This is a good question. I have a few ideas here, but there are some tradeoffs. I'll open an issue for this specifically once the other repo is created. |
Great point. There might be some nuance here with iframes, but even still there is a detection problem that we should think through. This goes back to the general problem of supporting multiple separate reporting origins though which we should flesh out. It ends up being a complicated coordination problem (and possible denial-of-service vector) if everyone has to share a single budget. Needing to involve the advertiser in it makes this even tougher. One more question about IPA behavior. I want to confirm that attribution across multiple queries works correctly. Here's an example: imagine there's a single advertiser selling a single product, and is running 3 different campaigns for it. They want to use IPA to measure the relative performance of their 3 campaigns. This is my understanding of how this would work in IPA. The advertiser will send 3 queries to the system:
If IPA treats these queries completely independently, then attribution does not take into account source events from separate queries. That is, a hypothetical user journey like {Campaign1 source event, Campaign2 source event, Campaign 3 source event, trigger} will end up contributing a count to each of the three queries above, causing double counting. One way to make this work would be to first run "global attribution" with the union of the events in all the queries, and then separately evaluate each query separately from the pool of globally attributed sources/triggers. I couldn't tell if this was how the protocol was intended to work though. |
Here's how I've been thinking about this: When a report collector makes an IPA query, it will cost them some amount of money. You have to pay the MPC helper nodes for the compute you use. This implies the existence of some kind of registration process whereby a site / app signs up to run IPA queries, proves ownership of the app / site, and inputs an associated payment instrument. So I am assuming all IPA queries will be authenticated server-to-server calls. Authentication parameters must be provided to run the query. As such, it should be impossible for anyone but the 1st party, or their legitimate delegate to run queries. If a delegate abuses their permissions, the 1st party should be able to revoke their permission to run IPA queries on their behalf. |
In the event an advertiser wants to evaluate the relative performance of 3 campaigns (which they might have purchased from different ad-sellers) I assume that they would NOT issue three separate queries as you’ve shown. This would wind up hitting their differential privacy budget three times for the same set of trigger events. They’d be far better off running a single query with all of the source events from all three campaigns, and all of the trigger events. This would make much better use of their budget, as well as enable “global attribution”, where we can avoid double counting. To be clear, I understand this is a significant departure from how things work today. Today Facebook ads manager shows just an FB view of things. In an IPA world, it would not be possible to show them this. It wouldn’t be an efficient use of their privacy budget. It would be much more similar to the mobile app ecosystem where advertisers utilize 3rd party “mobile measurement partners” that give them a unified view across all their ad buying channels, preferring to view reporting there and eschewing platform-specific reporting channels. |
I think I might be missing something. Is this use-case possible to achieve with IPA:
i.e. I want a break-out that says: My thought from the doc was this is accomplished via carefully sending relevant source events, but it seems like there is some other way this should be done. Here is the relevant piece from the doc:
Now that is specific to a source query, but I assumed you'd do the same for trigger queries like the one I described. |
Our wording in the doc may not have been super clear - there are two different cases to consider here. The first case is the one you mention, you would want to issue a single query, with all the events. It would be something like the following SQL query: select
source_event.campaign_id
, count(trigger_event.event_id)
, sum(trigger_event.value)
from
source_events
join trigger_events
on <matchkeys and attribution logic>
group by
source_event.campaign_id The second case is where there are multiple distinct products involved, such as:
In this case, since these queries can be constructed entirely independently, the advertiser running the query should be able to bifurcate them appropriately and run the same query as above, without having an effect on the results. In that case, having less data should be more efficient, and also not exhaust unnecessary privacy budget. It would also prevent the need for more complicated attribution logic in the MPC (since you'd only want attribution within that appropriate mapping.) |
Thanks @eriktaubeneck , I think I missed the piece where we can annotate source events by their relevant campaign ID. I wasn't sure if that was supported. |
I guess I will follow-up: how much extra information can we pack into the events? One of the benefits of creating queries as a "bag of relevant events" is that we can use arbitrarily complex information to structure the queries. Once the splitting has to happen within the protocol though, it becomes harder, especially with MPC. Could you imagine us supporting many dimensions of features beyond campaign IDs in IPA? |
I haven't thought about assigning multiple |
If we are allowed to set I will need to think a bit (and understand more about IPA) if there's any benefit to multiple |
I agree with @eriktaubeneck. Let me just elaborate a bit: Since the source_event is generated in-context, you'll know all the relevant queries in which you might like to use it. Perhaps a country breakdown, an age-and-gender breakdown, a placement breakdown, etc. I'm assuming that the "group_id" can be added server-side at will, and the events can be utilized in multiple queries. So an advertiser who wishes to get multiple breakdowns for their conversions would have to decide how much of their privacy budget to spend per-breakdown, then could issue multiple queries using the same events. As for could it contribute to 171 buckets? I think the answer depends entirely on the privacy budget. |
I agree that this seems like a good goal worth shooting for. At the moment, I'd be happy to start with something simple (like last touch or even credit over N touches), but with the flexibility to get more complicated. |
To be clear, this example is using last-touch attribution. It's just that we want to sum up not just campaign counts but also other features, so we can know things like "how many attributed events had feature X", "how many attributed events had feature Y", etc. @benjaminsavage yes, privacy budget :) Actually this batching queries thing gives you a super power, budget wise, which is exactly what Criteo takes advantage of in their competition. See also this thread |
A quote from that thread:
I agree. I would really like for IPA to support queries where the source_events span multiple source sites. I think this is a key use-case for ad-networks that show ads across the open web. We discuss this possible extension in our IPA proposal in the "business privacy grain" section. It's really hard though, and we haven't yet worked through all the issues with this. In particular, it requires careful design to ensure a malicious helper node cannot violate the "Vegas Rule". Reading through that thread, the use-case is really about training ML, not reporting. Rather than trying to get hundreds of independent breakdowns out of the API, it would probably be more efficient (from a DP perspective) to just train an ML model in MPC, and emit a trained model (with DP noise added). We allude to this as a possible future extension: link. This would have the added benefit of being able to model the interaction effects between these features. |
I don't think business privacy grain is necessary here. The thread there is about combining reports across publishers e.g. for a given advertiser. My understanding is that IPA supports this by default (and uses, in that example, the advertiser privacy unit).
Yes, I used this mostly as an example, to understand the limitations of IPA. Obviously if we can train models directly in IPA it will probably be more efficient, but supporting the Criteo competition setting is a decent litmus test on how powerful the reporting use-case is. As far as I understand, supporting a setting like this could allow us to do logistic regression in a pretty privacy-efficient way. Oh let me cc @alois-bissuel since I am bringing up some Criteo stuff :) |
Having read the IPA proposal, I have the following question concerning the addition of differential private noise on aggregated trigger values. How does one compute (estimate) the sensitivity of the trigger value in a setting like IPA, where the range of the trigger values are never revealed to the aggregators. Do you in this case apply local differential privacy, meaning the trigger sites that generates these trigger events, add already properly sampled local differential private noise to each trigger value before encrypting it and submit it to the MPC network? |
Hi @juanli16 - and thanks for the question! Our current thinking is that the API caller would provide some kind of "zero knowledge proof" along with each trigger event, proving that the trigger value lies within a given range. The actual range would also need to be an API param, as the MPC would need to add noise proportional to that value. This param value would need to align with the zero knowledge proofs provided. I would not imagine adding any local differential privacy. |
The solution @benjaminsavage references is the one presented in PRIO. It's important to clarify though, this requires a global bound on trigger values (for example we could pick the range [0,2^16) and use 16 bits.) Depending on the cryptographic details of the MPC, we may also be able to limit that just by limiting the secret shared values (and avoid the extra work of a zero-knowledge proof.) To address the main point of your question, @juanli16, the range is in fact revealed to the aggregators (it is a global constant), but individual values are not revealed to the aggregators. And because the range is a known quantity, the distribution from which to draw the DP noise is also known. |
Thanks @benjaminsavage and @eriktaubeneck for the clarifications! I was starting to arrive to the same conclusion after inspecting the current state of this repo: raw-ipa. I do have 2 follow up questions if you would indulge me.
|
A few thoughts:
|
Is this still where we should open issues on IPA? |
Hi @alextcone - IPA related issues can be filed here: https://github.com/patcg-individual-drafts/ipa/issues |
@benjaminsavage, @martinthomson, and I have been working on a proposal, "Interoperable Private Attribution (IPA)" that addresses the aggregate attribution measurement use case, similar to those listed in #8.
We'd love to have this considered and discussed at the January PATCG meeting, for consideration in maturing it further through collaboration among this community group.
The text was updated successfully, but these errors were encountered: