-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] src_eff not used for ordering in satellites #253
Comments
Hi! Thanks for your question and report. Essentially, it's there to provide the option to include bi-temporality in Satellites; but remember Satellites are still raw vault objects, so we do not apply any business logic to them until we get to the business vault or presentation layer. Hope this helps. This is not a planned feature as it goes against standards and doesn't fit the philosophy of Satellites, so I will close this. Thank you |
Hi @DVAlexHiggs , thanks for taking time to reply on this. So basically, current sat macro can't support multiple attribute changes sent from source within same batch (load_dt), in a proper order manner? This implementation will cause bunch of duplicates in these cases. |
It's not the job of the satellite to apply business rules for which records are effective or not - it's a raw vault object - i.e. we load everything we can get. The satellite will load what it is given if there are hashdiff changes - these are not true duplicates. The This is an integral part of the Data Vault standards and how it works, so I suggest exploring this topic further to deepen your understanding. I hope this helps! If I haven't fully understood what you need, please feel free to share more details—I'm happy to assist further |
Thanks again @DVAlexHiggs , but I am not sure if we fully understand each other here :) No additional logic is added or expected, just pure ordering thing. Once sat needs to know what is the latest_record from current data, it will take last state of attributes for a BK, ordered by LOAD_DT. But what if BK had multiple changes in the same LOAD_DT? I don't think it would harm data vault 2.0 standards, if "ordered by LOAD_DT desc" is expanded as "ordered by LOAD_DT, src_eff desc" - so we can get really last record from current set. When no additional ordering parameter is used, than it's a random selection of a record from multiple ones with same LOAD_DT. And it's not really precise, and can harm hashdiff comparison... So only thing I'm talking about is potentially adding src_eff (which can represent apply_dt or sequence_date from src) to order and get latest record from current set, and first record in incoming source data. Then instead of this (which has test2 attribute/hashdiff as duplicate): ...test2 can be recognized as trully latest record from 20240102 load_dt, and as first record in 20240103, so it won't be loaded twice: |
Ok thank you for further clarification; I need to verify a few things and double check that I understand your point. I'll get back to you with a more in-depth response as soon as I can! Thank you for your patience |
EDIT(see below for the original post) @mrjovana Sorry - I realise now that the issue is because the satellite has already been loaded with two records for the same PK with the same LDTS - this is what you were saying about the records with the same LDTS. The problem with this is that this is not an expected situation for Satellites; if you're loading multiple records in one batch with different To summarise, it looks like the issue is stemming from the LDTS being a date and not a datetime which means the
ExamplesI ran some sample scenarios through our test suite and found that it works as expected when we have expected LDTS values, please see below. Example 1Using your first example, this works: Satellite (Pre-load)
Stage (To be loaded)
Outcome (Satellite)
Example 2This is another example where we try to load a duplicate record (same HASHDIFF, PK combination). If we don't have the .001 nanoseconds in the Satellite (Pre-load)
Stage (To be loaded)
Outcome (Satellite)
ClosingAm I now correct in my understanding? Sorry for the back and forth and confusion here! I am determined to ensure we are aligned and arrive at a solution though! Thank you. ORIGINAL REPLYHello again, @mrjovana, I’ve reviewed the back-end code and reproduced your initial example in our test suite. Here's your original scenario: Satellite (Pre-load)
Stage (To be loaded)
Loaded Satellite (Current Implementation)
Loaded Satellite (Expected Implementation)
After internal discussions, I can confirm this is indeed a bug. I’d like to provide some clarification regarding the confusion around the My initial misunderstanding stemmed from interpreting your request as a need to use the Additionally, I misunderstood your scenario as having duplicate records with identical LDTS values, suggesting that Regarding the bug itself, I believe it is simpler than initially thought: if a record being loaded has a HASHDIFF identical to the latest existing HASHDIFF in the Satellite, it should not be loaded—unless it is part of a known "flip-flop" sequence. This is just a non-brainer really and are how Satellites are supposed to work. In your case, since the HASHDIFF of the record (test2/BBB) matches the latest record in the Satellite, it should not be loaded again. However, the test3/CCC record should indeed be loaded. Does this all make sense? Have I understood correctly? Thank you for your patience, and your detailed example was really helpful in identifying and understanding this issue. |
Describe the bug
Hello. Not sure if this is a bug or should be a feature, but aren't src_eff supposed to serve as additional ordering parameter? Like for getting latest_records and first_record_in_set. This way, without it, records that should not supposed to be there are loaded.
Example data sets:
sat_example:
id attr src_eff load_dt
1 test1 01.01.2024 12:00 20240102
1 test2 01.01.2024 15:00 20240102
staged incoming data:
id attr src_eff load_dt
1 test2 02.01.2024 11:00 20240103
1 test3 02.01.2024 16:00 20240103
Result:
sat_example:
id attr src_eff load_dt
1 test1 01.01.2024 12:00 20240102
1 test2 01.01.2024 15:00 20240102
1 test2 02.01.2024 11:00 20240103
1 test3 02.01.2024 16:00 20240103
Environment
dbt version: 1.8.7
automate_dv version: 0.11.0
Database/Platform: Big Query
To Reproduce
Steps to reproduce the behavior:
Expected behavior
test2 attribute shouldn't be loaded cause it's a last state of 20240102 load_dt and also first record in 20240103 (if it's ordered by load_dt and src_eff both)
I would say this should be expected behaviour.
Any feedback very appreciated :)
The text was updated successfully, but these errors were encountered: