Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state gets moved forward without data being written when batch load fails. #101

Open
loveeklund-osttra opened this issue Oct 3, 2024 · 3 comments

Comments

@loveeklund-osttra
Copy link
Contributor

I think it is the behaviour described here.
https://github.com/z3z1ma/target-bigquery/blob/9d1d0b08606a716a5a36f53b3388cbd6055535a8/target_bigquery/target.py#L544C9-L549C79

I suspect what happened was that one on my workers failed on a bad row but the other was able to write out data. Resulting in state being moved forward without any data from the bad sink being written.
What is the upside vs downside that is referenced in the comment? is it that data gets read from source but not written to target?

@loveeklund-osttra
Copy link
Contributor Author

I've looked into this some more and I don't think it is the fail_fast that is causing the issue. From what I've been able to see the issue seems to arise when this load_job https://github.com/z3z1ma/target-bigquery/blob/main/target_bigquery/batch_job.py#L63 fails for some reason.
I've also discovered that you can get a "silent" error where a load_job fails without any error being raised. I've seen this happen when the load_job fails on the last load. This unfortunately also moves the state forward. I'd really like to get some help looking into this as I don't fully understand all the parts of the target with the workers and where errors are caught and not. I'll see if I can figure it out and come up with a fix, but if I can't we have to stop using this target :(

@loveeklund-osttra
Copy link
Contributor Author

Accidentally closed the issues...

I think I somewhat understand what happens, it's something with the parallelization and it not waiting properly when it goes to requeue the job in BatchJobWorker.run . I'll try to get some more details soon

@loveeklund-osttra
Copy link
Contributor Author

loveeklund-osttra commented Oct 9, 2024

If you want to replicate the error you can check out this commit https://github.com/loveeklund-osttra/target-bigquery/tree/308859d93da38135a30433edb523c970f4bdb371
install the tap-testsource and run meltano run tap-testsource target-bigquery and you should see that your job does't fail, your state gets moved forward and the bigquery load job fails.

I've tried with the other loading methods as well and it gets the same error for all of them except gcs_stage, which actually fails because it triggers the loading of data to bigquery in cleanup and not in the run of the worker.

I added some logging statements to get some clarity into why it fails and I think the problem is the requeueing logic that happens in the batchjobworker.run causes the pipeline to not wait for it to finish properly

I'm going to see if I can fix it by removing the retrying logic in the workers run commands

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant