-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up unused _airbyte_tmp tables #7011
Comments
Hi, |
We've had a few failed syncs from a farily big db; currenty looking at 600+ tmp tables. this makes refreshing the schema in datagrip/dbeaver painfully slow, and makes autocompletion much less useful. |
@ChristopheDuong could we add a script in normalization to always delete tables with startwith |
Of course if normalization failed, the tmp tables will continue there. Adding to normalization can be a easy solution but don't solve the problem. |
It's not really normalization's job to clean up the |
This cleanup task seems to me like something that should be orchestrated by the platform, because this cleanup task can be invoked in these scenarios:
my suggestion is that we add a destination::cleanup task to the protocol and have the platform call it after each job. We can also allow users to manually trigger it. |
Is there a world where the destination tmp naming is a bit smarter and the next time the destination runs it looks at left over artifacts and delete what wasn't deleted in the previous run? |
yes, however, if the sync is canceled, the messy state will persist until the next job runs. |
That's still a lot better than accumulating 600 unused tables from 6 months ago :) |
Also it doesn't mean we shouldn't do our best to clean up on exit (cancel...) |
I also have the same problem about unused I cancel the one of scheduled job because it runs for about whiles. And these above tables are presented since then. Should I clean them safely with running the SQL? I also found this discussion and I think it's okay to delete them manually. |
For us, it was over 3,500 of these tables. I do think that cleanup needs to happen here, but it also seems silly that (I understand this is slightly out of scope for this issue, but resumability is worth considering before tables are purged wholesale on cancel/failure.) For anyone needing a temporary workaround, you can use a SQL script like this to delete them from within BigQuery (for folks who aren't comfortable in the CLI anyway):
Keep in mind this won't be as fast as deleting them through the Either way, I think we should really consider ways to use synced data to minimize duplicate syncing where possible. and then clean up temp tables that are incomplete, unsafe, or blank. |
If i may suggest quick bandaid to this (requires dev work though):
Neither solutions would block proper cleanup logic in case of exceptions but might be quicker and easier to implement and solves 90% of the pain experienced in this thread. |
Any update on this issue? I just found a schema in our Snowflake warehouse with roughly 11'000 undeleted _airbyte_tmp tables. This resulted in the following DBT error during the Normalization step:
|
The new version of Snowflake destination connector is not creating tmp tables anymore. If you update your Snowflake connector to the latest one, you can delete all those airbyte_tmp tables. |
We are also encountering this problem, we have over 6000 of such tables in a Bigquery destination. |
If you have upgraded to the latest BigQuery destination connector and you are using staging inserts loading method, you can delete |
Thanks for the clarification. |
The same happens with MongoDB destination, we have zombie collections named Not sure if this happens after a failed job or after Airbyte server crash. |
Same here! Our Github -> BigQuery connection creates a ton of |
the latest version of BigQuery destination does not create _tmp tables anymore. |
what version of BigQuery destination are you using? |
We are still using version |
The newer versions do not delete Other destinations will stop creating |
This should be fixed. |
Tell us about the problem you're trying to solve
When syncs are canceled, encounter exceptions or something unexpected may happen, it is possible to get a "messy" state on destinations where "zombies"
_airbyte_tmp
tables are kept around...See this user's question on slack:
https://airbytehq.slack.com/archives/C01MFR03D5W/p1627380195366300
with multiple _tmp tables for the same stream:
or this other user's worrying about it here:
https://airbytehq.slack.com/archives/C01MFR03D5W/p1632830821410100
Describe the solution you’d like
A way to clear up _tmp tables that may be confusing to users.
Describe the alternative you’ve considered or used
Ask people to safely delete them if no syncs is currently running
The text was updated successfully, but these errors were encountered: