Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up unused _airbyte_tmp tables #7011

Closed
ChristopheDuong opened this issue Oct 13, 2021 · 25 comments
Closed

Clean up unused _airbyte_tmp tables #7011

ChristopheDuong opened this issue Oct 13, 2021 · 25 comments
Labels
area/connectors Connector related issues autoteam frozen Not being actively worked on team/destinations Destinations team's backlog type/enhancement New feature or request

Comments

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Oct 13, 2021

Tell us about the problem you're trying to solve

When syncs are canceled, encounter exceptions or something unexpected may happen, it is possible to get a "messy" state on destinations where "zombies" _airbyte_tmp tables are kept around...

See this user's question on slack:
https://airbytehq.slack.com/archives/C01MFR03D5W/p1627380195366300
with multiple _tmp tables for the same stream:
image (5)

or this other user's worrying about it here:
https://airbytehq.slack.com/archives/C01MFR03D5W/p1632830821410100

I can't find any documentation on best practice for tidying these tables, is there any potential downside to deactivating the sync, deleting the old temp tables and then re-syncing? I’m at my wit’s end (details on issue #5435)!

Describe the solution you’d like

A way to clear up _tmp tables that may be confusing to users.

Describe the alternative you’ve considered or used

Ask people to safely delete them if no syncs is currently running

@ChristopheDuong ChristopheDuong added the type/enhancement New feature or request label Oct 13, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Nov 15, 2021
@lizdeika
Copy link
Contributor

lizdeika commented Jan 6, 2022

Hi,
We have 128 of such leftover failed sync tables in 1 (one) schema. We have more than one schema.

@segv
Copy link

segv commented Jan 6, 2022

We've had a few failed syncs from a farily big db; currenty looking at 600+ tmp tables. this makes refreshing the schema in datagrip/dbeaver painfully slow, and makes autocompletion much less useful.

@marcosmarxm
Copy link
Member

@ChristopheDuong could we add a script in normalization to always delete tables with startwith _airbyte_tmp*? Let me know what do you think about it

@marcosmarxm
Copy link
Member

Of course if normalization failed, the tmp tables will continue there. Adding to normalization can be a easy solution but don't solve the problem.

@ChristopheDuong
Copy link
Contributor Author

@ChristopheDuong could we add a script in normalization to always delete tables with startwith _airbyte_tmp*? Let me know what do you think about it

It's not really normalization's job to clean up the _airbyte_tmp tables that are "internal" tables from destination connectors. But sure, you can do that.

@grishick
Copy link
Contributor

grishick commented Apr 26, 2022

This cleanup task seems to me like something that should be orchestrated by the platform, because this cleanup task can be invoked in these scenarios:

  • destination connector failed and normalization was not run
  • normalization started but failed
  • normalization succeeded, but for some reason failed to delete the *_tmp tables

my suggestion is that we add a destination::cleanup task to the protocol and have the platform call it after each job. We can also allow users to manually trigger it.

@michel-tricot
Copy link
Contributor

Is there a world where the destination tmp naming is a bit smarter and the next time the destination runs it looks at left over artifacts and delete what wasn't deleted in the previous run?

@grishick
Copy link
Contributor

Is there a world where the destination tmp naming is a bit smarter and the next time the destination runs it looks at left over artifacts and delete what wasn't deleted in the previous run?

yes, however, if the sync is canceled, the messy state will persist until the next job runs.

@ChristopheDuong
Copy link
Contributor Author

yes, however, if the sync is canceled, the messy state will persist until the next job runs.

That's still a lot better than accumulating 600 unused tables from 6 months ago :)

@michel-tricot
Copy link
Contributor

Also it doesn't mean we shouldn't do our best to clean up on exit (cancel...)

@peter279k
Copy link
Contributor

peter279k commented Aug 11, 2022

I also have the same problem about unused _airbyte_tmp_* tables.

I cancel the one of scheduled job because it runs for about whiles. And these above tables are presented since then.

Should I clean them safely with running the SQL?

I also found this discussion and I think it's okay to delete them manually.

@NAjustin
Copy link
Contributor

For us, it was over 3,500 of these tables.

I do think that cleanup needs to happen here, but it also seems silly that _airbyte_tmp_* tables are completely ignored after a failed job. For example, I just had a HubSpot connector that synced several billion rows of data, but because an object was configured that wasn't enabled on this account (feedback_submissions), a 429 error was thrown and the whole job was aborted. So these temp tables were left, but could have been used to resume from a subsequent sync once the issue was corrected vs. restarting a large sync from scratch.

(I understand this is slightly out of scope for this issue, but resumability is worth considering before tables are purged wholesale on cancel/failure.)

For anyone needing a temporary workaround, you can use a SQL script like this to delete them from within BigQuery (for folks who aren't comfortable in the CLI anyway):

DECLARE dry_run BOOL DEFAULT true; --CHANGE TO false TO ACTUALLY DROP!
DECLARE pattern STRING DEFAULT r"\_airbyte\_tmp\_%"; -- raw string only needs one backslash for _ character (so it isn't treated as a single char wildcard)
DECLARE project STRING DEFAULT "project-name";
DECLARE dataset STRING DEFAULT "DatasetName";
DECLARE retention_period STRING DEFAULT "7 DAY";
DECLARE tbls ARRAY<STRING> DEFAULT [];
DECLARE i INT64 DEFAULT 0;
EXECUTE IMMEDIATE """SELECT ARRAY_AGG(project_id||"."||dataset_id||"."||table_id ORDER BY table_id ASC) FROM `"""||project||"."||dataset||""".__TABLES__` WHERE type=1 AND table_id LIKE @pattern AND TIMESTAMP_MILLIS(last_modified_time) < TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL """||retention_period||""")""" INTO tbls USING pattern AS pattern;
SELECT ARRAY_LENGTH(tbls); -- Returns the number of tables matched
IF
	NOT dry_run
THEN
	WHILE tbls[SAFE_OFFSET(i)] IS NOT NULL DO
		EXECUTE IMMEDIATE "DROP TABLE `"||tbls[OFFSET(i)]||"`";
		SET i = i + 1;
	END WHILE;
END IF;

Keep in mind this won't be as fast as deleting them through the bq CLI or an environment where you can call the drop asynchronously (this will only run at about 2 tables per second or so), but gives non-programmer types a way to see a list of tables matching a criteria by last change date (look at results of the first query result while in dry_run mode) or actually drop the matching tables (changing dry_run to false)

Either way, I think we should really consider ways to use synced data to minimize duplicate syncing where possible. and then clean up temp tables that are incomplete, unsafe, or blank.

@Zatte
Copy link

Zatte commented Sep 22, 2022

If i may suggest quick bandaid to this (requires dev work though):

  1. Temporary tables can be created with a default table expiration of 2 days? (maybe even configurable).
  2. Temporary tables can reside in their own dataset (where default table expiration can be configured in a case by case basis).

Neither solutions would block proper cleanup logic in case of exceptions but might be quicker and easier to implement and solves 90% of the pain experienced in this thread.

@BravoDeltaBD
Copy link

BravoDeltaBD commented Feb 3, 2023

Any update on this issue? I just found a schema in our Snowflake warehouse with roughly 11'000 undeleted _airbyte_tmp tables. This resulted in the following DBT error during the Normalization step:

Too many schemas in schema  "FOO_DB"."BAR_SCHEMA"! dbt can only get
        information about schemas with fewer than 10000 objects.

@grishick
Copy link
Contributor

grishick commented Feb 3, 2023

The new version of Snowflake destination connector is not creating tmp tables anymore. If you update your Snowflake connector to the latest one, you can delete all those airbyte_tmp tables.

@TimothyZhang7
Copy link
Contributor

TimothyZhang7 commented Feb 7, 2023

We are also encountering this problem, we have over 6000 of such tables in a Bigquery destination.
Any update on this issue?

@grishick
Copy link
Contributor

grishick commented Feb 7, 2023

If you have upgraded to the latest BigQuery destination connector and you are using staging inserts loading method, you can delete _airbyte_tmp tables

@TimothyZhang7
Copy link
Contributor

If you have upgraded to the latest BigQuery destination connector and you are using staging inserts loading method, you can delete _airbyte_tmp tables

Thanks for the clarification.

@ABitShift
Copy link

The same happens with MongoDB destination, we have zombie collections named airbyte_tmp_id3_Source_Stream.

Not sure if this happens after a failed job or after Airbyte server crash.
Would it be the "Airbyte responsability" to delete these zombie collections if the server crash (reconciliation)?

@lhvubtqn
Copy link

lhvubtqn commented Mar 17, 2023

Same here! Our Github -> BigQuery connection creates a ton of _airbyte_tmp tables whenever it fails. Really need to fix this issue!

@grishick
Copy link
Contributor

the latest version of BigQuery destination does not create _tmp tables anymore.

@grishick
Copy link
Contributor

Same here! Our Github -> BigQuery connection creates a ton of _airbyte_tmp tables whenever it fails. Really need to fix this issue!

what version of BigQuery destination are you using?

@lhvubtqn
Copy link

Same here! Our Github -> BigQuery connection creates a ton of _airbyte_tmp tables whenever it fails. Really need to fix this issue!

what version of BigQuery destination are you using?

We are still using version 1.2.9. Will try upgrading to 1.2.17. Thank you!

@grishick
Copy link
Contributor

grishick commented Mar 22, 2023

  • destination-bigquery stopped creating _airbyte_tmp tables starting version 1.2.13
  • destination-snowflake stopped creating _airbyte_tmp tables starting version 0.4.46
  • destination-redshift stopped creating _airbyte_tmp tables starting version 0.3.55

The newer versions do not delete airbyte_tmp tables created by previous versions, so if you still see any of these tables after the upgrade - feel free to drop them.

Other destinations will stop creating _airbyte_tmp tables before they are promoted to GA.

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
@cgardens
Copy link
Contributor

cgardens commented Apr 3, 2024

This should be fixed.

@cgardens cgardens closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues autoteam frozen Not being actively worked on team/destinations Destinations team's backlog type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests