Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid SCHEMA messages are produced for deselected streams #212

Open
laurentS opened this issue May 24, 2023 · 3 comments · Fixed by #224
Open

Invalid SCHEMA messages are produced for deselected streams #212

laurentS opened this issue May 24, 2023 · 3 comments · Fixed by #224
Labels
bug Something isn't working

Comments

@laurentS
Copy link
Contributor

In #193, a set of traffic_* streams were added to the tap, with a customised metadata property, which deselects them if no catalog was passed as input to the tap.

Unfortunately, when running the tap with poetry run tap-github --config /tmp/tmpmt8fq0pn/tmp7896kkwh.json --test=schema

with this config (which does not seem to matter much, the main thing being the test=schema cli option):

{"metrics_log_level": "error", "auth_token": "<mytoken>", "additional_auth_tokens": [], "rate_limit_buffer": 1000, "start_date": "2021-05-24 13:44:42.693145", "skip_parent_streams": true, "repositories": []}

the tap issues invalid SCHEMA messages like:

{
"type": "SCHEMA", 
"stream": "traffic_pageviews",
"schema": {"properties": {}, "type": "object"},
"key_properties": ["repo", "org", "timestamp"]
}

Specifically, properties is empty, so downstream targets cannot lookup the key_properties.

The line that causes the problem is here https://github.com/MeltanoLabs/tap-github/pull/193/files#diff-06dc9c6115cbc069ce355913de0c101fedf6956d6f6b4873c5112434596934d3R2260

I have not dug into the details yet, but it looks like the schema production does not correctly take the selection metadata into account.

Pinging @edgarrmondragon as you suggested that code, and you might have a fix for it :)

I also think the sdk should not allow a tap to produce invalid records like this. Is there a way to test against it without causing too much overhead? Obviously, we could validate each record before sending it out, but that might be a bit heavy ;)
Interestingly there's a test for this _test_replication_keys_in_schema but it does not validate against the schema messages that are sent.

@laurentS laurentS added the bug Something isn't working label May 24, 2023
@edgarrmondragon
Copy link
Member

edgarrmondragon commented May 25, 2023

@laurentS thanks for reporting this. The latest SDK was shipped with Stream.selected_by_default, which has the same use cases in mind. That involved adding unit tests in the SDK, so it may be more robust. We can give that a try.

Also, I don't think this behavior is spilling into normal sync runs since tap-github --test=schema ignores any selection.

PS: meltano/sdk#1698 might help

@laurentS
Copy link
Contributor Author

Thanks @edgarrmondragon for these details. selected_by_default seems to have appeared in sdk v0.27 but this tap cannot upgrade past v0.22 due to meltano/sdk#1704 so I guess we'll have to wait for meltano/sdk#1708 to be merged.

selected_by_default seems to be a good way to go though, and this bug is not immediately blocking for us. I'll wait on #209 to propose a fix for this.

@laurentS
Copy link
Contributor Author

laurentS commented Oct 2, 2023

I'm reopening this issue, as we are still seeing the bug with tap-github on the Traffic* streams.
The bug is happening for streams that have selected_by_default = False. For instance:

{"type": "SCHEMA", "stream": "traffic_clones", "schema": {"properties": {"repo": {"type": ["string", "null"]}, "org": {"type": ["string", "null"]}, "repo_id": {"type": ["integer", "null"]}, "timestamp": {"format": "date-time", "type": ["string", "null"]}, "count": {"type": ["integer", "null"]}, "uniques": {"type": ["integer", "null"]}}, "type": "object"}, "key_properties": ["repo", "org", "timestamp"], "bookmark_properties": ["timestamp"]}

is generated with selected_by_default = True but if that option is flipped to False then we get:

{"type": "SCHEMA", "stream": "traffic_clones", "schema": {"properties": {}, "type": "object"}, "key_properties": ["repo", "org", "timestamp"], "bookmark_properties": ["timestamp"]}

The first SCHEMA message is valid, the second one isn't. This is with singer-sdk = 0.32.0.

I think the problem is that the test case added in #1698 does not cover the use case described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants