Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary table seems to be destroyed and recreated in the middle of a run when table expiry is reached, resulting in lost data #99

Open
TrishGillett opened this issue Sep 20, 2024 · 2 comments

Comments

@TrishGillett
Copy link

TrishGillett commented Sep 20, 2024

Currently, when a temporary table is created, it is set to expire one day in the future.

I've run into a problem using this target in this scenario:

  • I'm using this target with settings that call for a temporary table (upsert + dedupe_before_upsert)
  • I need to run some large extraction tasks that will take well over 24 hours.

Observations:

  • I was querying the temp table periodically during the run as a way to monitor progress. Shortly after the extraction crossed the 24 hour mark, I noticed that the number of rows in the temporary table had dropped from a couple million records to a couple thousand, and the oldest _sdc_extracted_at in the temporary table was now just after the 24 hour mark. Based on the code linked above I'm assuming that it expired, got deleted, and then seems to have been automatically recreated.
  • The data extracted in the first 24 hours had gone missing, but no errors were thrown and the extraction continued on.

There are a couple things here that could be opportunities for enhancements:

  • Can we make the max lifespan of the temporary table (effectively the time limit on the task) configurable? Or better, would it be possible to extend the expiry on the fly if it's getting close to expiry but the job is still in progress?
  • Can we make it throw an error if a temp table that's still in use gets deleted?

I'm open to trying to contribute towards these changes, but would appreciate getting alignment from a maintainer on the approach first. 🙏

@TrishGillett TrishGillett changed the title Overwrite table seem to be destroyed and recreated in the middle of a run when table expiry is reached, resulting in lost data Temporary table seems to be destroyed and recreated in the middle of a run when table expiry is reached, resulting in lost data Sep 21, 2024
@AlejandroUPC
Copy link
Contributor

Mmm I think the best case here is to be able to configure the expiration date and set it to a very high value (one you're sure won't expire) and maybe also a param to ensure deletion after completion of the temp table (when all the sinks are drained)? Would this work?

@TrishGillett
Copy link
Author

TrishGillett commented Oct 8, 2024

Hey @AlejandroUPC! I think that could be part of the answer, although personally I would also love to see runs fail loudly in the case where the table disappears mid-run. That would be reassuring for me since I could set the limit to something that I think should be long enough (as opposed to something absurdly long) and trust that I'll be notified if it turns out to be too short. It would also be useful to other users since they'd be informed if they're encountering this issue and need to use the (as yet hypothetical :P) custom time limit setting.

I'm picturing something like, could we make it so the temp table is created before extraction begins, and anytime we intend to write to it we could do an existence check first and fail the run if it doesn't exist? (Apologies if my mental model is off here, I am new to the internals of this target and making some guesses.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants