-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Write a valid final state message at the end of each stream sync #1164
fix: Write a valid final state message at the end of each stream sync #1164
Conversation
Ah, there's probably some tests that tally message types that will have to be updated. |
@laurentS - I looked through the codebase for other references context, and yes, I think your implementation looks correct. As @edgarrmondragon notes, this might break other tests which are counting number of state messages emitted. I did evaluate whether this should be run at the very end of I did also evaluate if this should be calling a private member, but I think this is appropriate given that we define Anyway - thanks for submitting this. I'll let @edgarrmondragon take from here in regards to official review/approval, etc. |
Codecov Report
@@ Coverage Diff @@
## main #1164 +/- ##
==========================================
+ Coverage 83.52% 83.57% +0.05%
==========================================
Files 42 42
Lines 3872 3873 +1
Branches 657 657
==========================================
+ Hits 3234 3237 +3
+ Misses 474 473 -1
+ Partials 164 163 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I might be missing something but I have one concern. If we update the state too frequently while we are going through a stream in descending order, there is a risk that we send a state update before we actually reached our expected final "STATE". "Random" example, we are updating issue comments on GitHub, which we can only navigate in descending order. Say our initial state is 2022-10-01, and today is 2022-11-09. We want to go through the whole stream until we reach data from 2022-10-01. Only then do we want to update the state to our new one of "2022-11-09" aka today. If we send a state update too early, and the stream fails before actually finishing, we would have no way to know, and would then assume that we got all the data we needed until 2022-11-09. No? |
@ericboucher - your example is exactly the reason why the SDK decoupled sending the state message from finalizing the state message. If we've written the internal logic correctly, it should be (near) impossible to send an invalid or too-soon / too-frequent state message. Any bookmarks that aren't yet resumable should have the progress-marker tag which distinguishes between resumable and non-resumable bookmarks. If you see flaws, do call them out. But if functioning correctly, this "should" be safe. 🙂 |
I spent a few hours tracking state messages today 😵💫 and, at least for the use case I was interested in (
I can't speak for other use cases, but for this one, I think we're good :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to add a useful test case around this
@laurentS do you feel confident writing a test for this change? It's probably OK if you don't, given the comments above.
but it would also be helpful to come up with some more general tests for taps using the sdk, to validate messages they produce.
@edgarrmondragon I looked around the sdk testing code a bit, but I feel that the most relevant test would be something like what I outlined in @kgpayne 's discussion. For this PR, we would need to run Otherwise, I feel comfortable with merging as is, the worst case scenario is that taps might send a duplicate final STATE message with the same content, which should not hurt. |
I agree 👍. Thanks @laurentS! |
@laurentS can you merge |
I won't get to my computer until Tuesday at best. Feel free to take over from here if you want/can. Otherwise I'll finish it next week. |
Thanks for syncing the branch @ericboucher! |
Refer to this slack thread.
This PR adds a call to issue a final (valid) state message at the end of each stream sync.
In some cases, messages sent by the tap were invalid. For instance, if
state_partitioning_keys
is overridden in a stream, the stream would never issue a valid state message, making incremental syncs impossible.Here is an example of the last message issued by
tap-github
when running on theissue_comments
stream (which hasstate_partitioning_keys
overridden).progress_markers
should not be present in the output, andreplication_key_value
should have been promoted one level up.There are probably some other use cases where the final state message is never sent.
I will try to add a useful test case around this, but it would also be helpful to come up with some more general tests for taps using the sdk, to validate messages they produce.
📚 Documentation preview 📚: https://meltano-sdk--1164.org.readthedocs.build/en/1164/