-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
event_id generation based on UUID principle #809
Comments
Hi @AkhtemWays this is an interesting topic of discussion, which I had not considered before. Can you give any more details about how often you are seeing duplicate IDs? How do you know the duplicates IDs were created by enrich, and not by the trackers the sent the events?
I can't answer this, because the design decision pre-dates when I joined Snowplow! But... I had never considered it a bad decision. To the best of my understanding, the Java implementation of We do often see duplicate event IDs downstream of enrich. But in our experience those duplicates arise from either:
|
When I executed group by query by event_id and ordered by counts I found that at most 45 same event_ids in DWH. |
I am not surprised that you see duplicate events in the DWH. But I think you are looking in the wrong place for the problem if you think it's because of our UUID generator. If you find two events with the same event id, then interesting questions to look at next are:
If you investigate further the duplicate IDs in your DWH I am sure you will find there are other explanations, unrelated to how we generate UUIDs. |
To add to @istreeter comments above - although it is possible to get event id duplicates it is generally rare to see genuine collisions unless duplicates are being sent. We are unlikely to introduce any technology (e.g., Twitter Snowflake) to produce truly globally unique ids as this is very computationally expensive and we cannot rely on sources of server information (e.g., worker and shard numbers) originating from the client. |
Project:
snowplow/enricher/common
Version:
master ( latest )
Expected behavior:
Generate universally unique ID across entire pipeline every time.
Actual behavior:
Generates duplicates event_id columns on enrichment stage
Steps to reproduce:
Create enriched event using class com.snowplowanalytics.snowplow.enrich/common/enrichments/EnrichmentManager.scala using method setupEnrichedEvent, and proceed with the case when EnrichedEvent is returned.
The text was updated successfully, but these errors were encountered: