Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Write a valid final state message at the end of each stream sync #1164

Merged
merged 3 commits into from
Nov 10, 2022

Conversation

laurentS
Copy link
Contributor

@laurentS laurentS commented Nov 9, 2022

Refer to this slack thread.

This PR adds a call to issue a final (valid) state message at the end of each stream sync.

In some cases, messages sent by the tap were invalid. For instance, if state_partitioning_keys is overridden in a stream, the stream would never issue a valid state message, making incremental syncs impossible.

Here is an example of the last message issued by tap-github when running on the issue_comments stream (which has state_partitioning_keys overridden). progress_markers should not be present in the output, and replication_key_value should have been promoted one level up.

{"type": "STATE", "value": {"bookmarks": {
    "tempStream": {},
    "repositories": {"partitions": [{"context": {"org": "nextcloud", "repo": "server", "repo_id": 60243197}}]},
    "issue_comments": {
        "partitions": [{"context": {"org": "nextcloud", "repo": "server"}, 
        "replication_key_signpost": "2022-11-09T16:44:57.220041+00:00",
        "starting_replication_value": "2022-11-07",
        "progress_markers": {"Note": "Progress is not resumable if interrupted.", "replication_key": "updated_at", "replication_key_value": "2022-11-09T16:41:17Z"}
        }]
    }
}}}

There are probably some other use cases where the final state message is never sent.

I will try to add a useful test case around this, but it would also be helpful to come up with some more general tests for taps using the sdk, to validate messages they produce.


📚 Documentation preview 📚: https://meltano-sdk--1164.org.readthedocs.build/en/1164/

@edgarrmondragon edgarrmondragon changed the title Write a valid final state message at the end of each stream sync fix: Write a valid final state message at the end of each stream sync Nov 9, 2022
@edgarrmondragon
Copy link
Collaborator

Ah, there's probably some tests that tally message types that will have to be updated.

@aaronsteers
Copy link
Contributor

aaronsteers commented Nov 9, 2022

@laurentS - I looked through the codebase for other references context, and yes, I think your implementation looks correct.

As @edgarrmondragon notes, this might break other tests which are counting number of state messages emitted.

I did evaluate whether this should be run at the very end of sync_all(), outside the streams loop, but your implementation is better (inside the loop) because it provides earlier notification to the target per stream.

I did also evaluate if this should be calling a private member, but I think this is appropriate given that we define sync_all() as @final.

Anyway - thanks for submitting this. I'll let @edgarrmondragon take from here in regards to official review/approval, etc.

@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Merging #1164 (eb6e7a7) into main (0a7f086) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head eb6e7a7 differs from pull request most recent head c795e65. Consider uploading reports for the commit c795e65 to get more accurate results

@@            Coverage Diff             @@
##             main    #1164      +/-   ##
==========================================
+ Coverage   83.52%   83.57%   +0.05%     
==========================================
  Files          42       42              
  Lines        3872     3873       +1     
  Branches      657      657              
==========================================
+ Hits         3234     3237       +3     
+ Misses        474      473       -1     
+ Partials      164      163       -1     
Impacted Files Coverage Δ
singer_sdk/tap_base.py 68.66% <100.00%> (+0.14%) ⬆️
singer_sdk/target_base.py 86.44% <0.00%> (+0.93%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ericboucher
Copy link
Contributor

I might be missing something but I have one concern.

If we update the state too frequently while we are going through a stream in descending order, there is a risk that we send a state update before we actually reached our expected final "STATE".

"Random" example, we are updating issue comments on GitHub, which we can only navigate in descending order. Say our initial state is 2022-10-01, and today is 2022-11-09. We want to go through the whole stream until we reach data from 2022-10-01. Only then do we want to update the state to our new one of "2022-11-09" aka today.

If we send a state update too early, and the stream fails before actually finishing, we would have no way to know, and would then assume that we got all the data we needed until 2022-11-09. No?

@aaronsteers
Copy link
Contributor

aaronsteers commented Nov 9, 2022

@ericboucher - your example is exactly the reason why the SDK decoupled sending the state message from finalizing the state message.

If we've written the internal logic correctly, it should be (near) impossible to send an invalid or too-soon / too-frequent state message. Any bookmarks that aren't yet resumable should have the progress-marker tag which distinguishes between resumable and non-resumable bookmarks.

If you see flaws, do call them out. But if functioning correctly, this "should" be safe. 🙂

@laurentS
Copy link
Contributor Author

laurentS commented Nov 9, 2022

I spent a few hours tracking state messages today 😵‍💫 and, at least for the use case I was interested in (tap-github's issue_comments stream, which behaves like @ericboucher describes above), the sdk seems to behave correctly (with the fix in this PR):

  • while it does emit STATE messages as it progresses, those messages are invalid (see the example I copied at the top of this PR: the replication_key and replication_key_value are under progress_marker, which is the wrong place) so these have no effect
  • at the very end, the call to finalize_state_progress_markers fixes the state object to a valid format (in memory).
  • what was missing was actually sending this valid state across the line to the target. This is what the PR adds.

I can't speak for other use cases, but for this one, I think we're good :)

Copy link
Contributor

@ericboucher ericboucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to add a useful test case around this

@laurentS do you feel confident writing a test for this change? It's probably OK if you don't, given the comments above.

but it would also be helpful to come up with some more general tests for taps using the sdk, to validate messages they produce.

I agree. @kgpayne is workshopping some ideas: #1162

@laurentS
Copy link
Contributor Author

@laurentS do you feel confident writing a test for this change? It's probably OK if you don't, given the comments above.

@edgarrmondragon I looked around the sdk testing code a bit, but I feel that the most relevant test would be something like what I outlined in @kgpayne 's discussion.

For this PR, we would need to run tap.sync_all() on an unsorted stream with overridden state partitioning keys to actually test the behaviour, and verify that the last message sent is a valid STATE message. Maybe we can add that to SampleTapCountries, if you think that's useful.

Otherwise, I feel comfortable with merging as is, the worst case scenario is that taps might send a duplicate final STATE message with the same content, which should not hurt.

@edgarrmondragon
Copy link
Collaborator

Otherwise, I feel comfortable with merging as is, the worst case scenario is that taps might send a duplicate final STATE message with the same content, which should not hurt.

I agree 👍.

Thanks @laurentS!

@edgarrmondragon edgarrmondragon enabled auto-merge (squash) November 10, 2022 19:20
@edgarrmondragon
Copy link
Collaborator

@laurentS can you merge main into your PR branch?

@laurentS
Copy link
Contributor Author

@laurentS can you merge main into your PR branch?

I won't get to my computer until Tuesday at best. Feel free to take over from here if you want/can. Otherwise I'll finish it next week.

@edgarrmondragon edgarrmondragon enabled auto-merge (squash) November 10, 2022 22:53
@edgarrmondragon edgarrmondragon merged commit ff3d65e into meltano:main Nov 10, 2022
@edgarrmondragon
Copy link
Collaborator

edgarrmondragon commented Nov 10, 2022

Thanks for syncing the branch @ericboucher!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants