Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: optimize tables/schema operations #57

Merged
merged 12 commits into from
Jun 14, 2023
Merged

Conversation

pnadolny13
Copy link
Contributor

@pnadolny13 pnadolny13 commented Jun 13, 2023

Closes #29

  • disable altering for now until we optimize that better. This was also done in target-postgres. I opened feat: support altering efficiently #58 to track adding support back in once it efficient.
  • caches table columns and schemas within a sink object. If a schema or key properties change for a stream then a new sink object is created so this is safe to cache and does not need to be invalidated.
  • I like the prepare_schema method was used with IF NOT EXISTS as a quick way to avoid these weird column casing and reserved word errors. Its slow to let it constantly retry creating so I removed it by fixing the schema_exists logic to not accidentally try to recreate it again. This required schema names to be conformed before passing to the prepare_schema method.
  • UPDATE: there was an issue with the way schema messages were processed that caused github streams to constantly re-initialize new sinks and drain old ones. For the Squared CI test it was doing this 150+ times for a 1 day set of data. I put in a fix here by overriding get_sink to remove _sdc_ columns before comparing because the existing sink schema has already been post processed at that point and the new incoming schema has not. This should be pushed down to the SDK but for now this worked to get my CI tests down to the original transferwise timing. cc @kgpayne

@pnadolny13
Copy link
Contributor Author

This improved the Squared CI tests enough to finish https://github.com/meltano/squared/actions/runs/5260307866 but the github streams took 27 and 33 mins whereas they took about 7 mins with the transferwise variant. Theres a lot of schema message like I mentioned in the description so I wonder if the logic that diffs the schema against existing ones is accidentally thinking that every schema message is a new schema so its reinitializing a new sink object and draining the previous one, causing lost of wasted time. I'll need to test this theory out.

@pnadolny13 pnadolny13 requested a review from kgpayne June 14, 2023 11:57
@kgpayne
Copy link
Collaborator

kgpayne commented Jun 14, 2023

@pnadolny13 looks great 👍 Would be good to create issues for these caches in the SDK.

Base automatically changed from handle_colons to main June 14, 2023 14:46
@pnadolny13 pnadolny13 merged commit 5693b6f into main Jun 14, 2023
@pnadolny13 pnadolny13 deleted the dont_allow_altering branch June 14, 2023 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Optimize column altering checks
2 participants