-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to always keep temporary files #199
Conversation
see here for how this currently works I think this more general solution looks better and more useful so feel free to remove the old logic. I wonder if it would be worth extending the results schema to include a |
Yeah I wrote the current implementation because I was frustrated with losing my files if the optimisation errored out before it gets to the copy step. I agree that my implementation is delightfully general and useful but unfortunately it has the small downside of not working 😅 The temporary directories themselves are preserved but not their contents. I'll keep working on it when I get a chance but I need to focus on the workshop at the moment. |
Is this closer to 0.1.3 or 0.2.0? This might be very useful when working on the step that interacts with ForceBalance if it can bypass the need to generate data each run. |
Yeah this will be super useful for debugging but it's a ways off. It ran into the usual problem - unit tests weren't capturing the bugs it introduced. So it's probably a post-0.2.0 feature. The goal was not to do any caching, just to keep the files around - BespokeFit already has a caching feature for the QC step, it's probably simpler to extend that to optimisation. |
Thinking about this some more I think it might be best to reverse how this works, what about doing the work in a local folder whose name is the celery task-id and then removing the files depending on the exposed settings? We would then add the celery task-id to the relevant stage schema so that when inspecting a running task users can query the task id of the current stage and then check the folder and see the progress. In cases like #237 this would help users check if the task is actually running or stuck. |
I think that's the approach this PR takes? I (tried to) refactor the |
for more information, see https://pre-commit.ci
3100534
to
8d8a773
Compare
OK I think I mostly got this working!! It seems to keep all the optimizer files at least. It does not keep the QC files, but that's apparently just how QCEngine works - I tried keeping the QCE scratch space but there's nothing very interesting in it. I don't think other steps produce any files so I think it works. I'd like to name folders after their IDs like you suggesting @jthorton, but I don't know what the best way to do that would be. Do you know if there's some place you could stick a |
Thats great this will help a lot. I think that we will have to change each workers call to the delay function to include a task id like the example here and also store this in the task schema under a new id field, then the task and the results folder have a common ID. We can then have worker function use the new temporary cd. Maybe these changes would be better handled in another PR as its changing the scope of this one by updating the schemas. Happy to take that on! |
That makes sense to me! If you could take a crack at that, that'd be great - I think we should merge this, #239, and #241, release 0.2.1 next week, and then we can remove the old If you could also formally review/approve this PR that'd be great! |
I had a look into setting the schema id to match the task and it turns out we already do this! So we could add it to this PR, the task id is extracted and set here this would just then need to be passed to the temp cd here and then we can remove the lines at the bottom which copy the tree as this would all be handled by the new tmp cd function. I tested this locally and see that it works I can watch the ForceBalance optimisation output in real-time which is great! |
I shoulda just tried out the task ID thing before commenting everywhere 😅 I love how using the task ID consolidates all the logic - this'll make removing the old setting easy in the future. We just need to remember to tempcd to the task id whenever we write temporary files to disk. I'll merge this on Monday unless you find something else to fix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @Yoshanuikabundi, looks fantastic!
Description
Trouble shooting bespoke fit is very frustrating, in part because temporary files are aggressively cleaned up. This PR attempts to add a setting,
BEFLOW_KEEP_TMP_FILES
, that ensures that temporary files are kept (though perhaps not in an obvious location)Todos
Notable points that this PR has either accomplished or will accomplish.
temporary_cd()
implementation using settingtemporary_cd()
Questions
Status