Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to always keep temporary files #199

Merged
merged 17 commits into from
Mar 30, 2023
Merged

Conversation

Yoshanuikabundi
Copy link
Contributor

@Yoshanuikabundi Yoshanuikabundi commented Oct 11, 2022

Description

Trouble shooting bespoke fit is very frustrating, in part because temporary files are aggressively cleaned up. This PR attempts to add a setting, BEFLOW_KEEP_TMP_FILES, that ensures that temporary files are kept (though perhaps not in an obvious location)

Todos

Notable points that this PR has either accomplished or will accomplish.

  • Add setting
  • Add new temporary_cd() implementation using setting
  • Replace uses of temporary_cd()
  • See if it works
  • Investigate other changes needed to keep files
  • Try to keep temporary file creation inside the requested working directory
  • Rebase after Support OpenFF Toolkit v0.11+ #198 is merged

Questions

  • Question1

Status

  • Ready to go

@codecov
Copy link

codecov bot commented Oct 11, 2022

Codecov Report

Merging #199 (9642a26) into main (2af0a1a) will increase coverage by 0.13%.
The diff coverage is 97.43%.

Additional details and impacted files

@jthorton
Copy link
Contributor

see here for how this currently works I think this more general solution looks better and more useful so feel free to remove the old logic.

I wonder if it would be worth extending the results schema to include a file_path to the location of files so users can work out which directory corresponds to which optimisation?

@Yoshanuikabundi
Copy link
Contributor Author

Yeah I wrote the current implementation because I was frustrated with losing my files if the optimisation errored out before it gets to the copy step. I agree that my implementation is delightfully general and useful but unfortunately it has the small downside of not working 😅 The temporary directories themselves are preserved but not their contents. I'll keep working on it when I get a chance but I need to focus on the workshop at the moment.

@mattwthompson
Copy link
Member

Is this closer to 0.1.3 or 0.2.0? This might be very useful when working on the step that interacts with ForceBalance if it can bypass the need to generate data each run.

@Yoshanuikabundi
Copy link
Contributor Author

Yeah this will be super useful for debugging but it's a ways off. It ran into the usual problem - unit tests weren't capturing the bugs it introduced. So it's probably a post-0.2.0 feature.

The goal was not to do any caching, just to keep the files around - BespokeFit already has a caching feature for the QC step, it's probably simpler to extend that to optimisation.

@jthorton
Copy link
Contributor

Thinking about this some more I think it might be best to reverse how this works, what about doing the work in a local folder whose name is the celery task-id and then removing the files depending on the exposed settings? We would then add the celery task-id to the relevant stage schema so that when inspecting a running task users can query the task id of the current stage and then check the folder and see the progress. In cases like #237 this would help users check if the task is actually running or stuck.

@Yoshanuikabundi
Copy link
Contributor Author

I think that's the approach this PR takes? I (tried to) refactor the temporary_cd function to create a temporary directory, and then only clean it up if the appropriate setting is on. If the setting is off, the directory is created but not cleaned up. Naming things after the task id is a great idea but I've gotta get this working first 🙃

@Yoshanuikabundi
Copy link
Contributor Author

OK I think I mostly got this working!! It seems to keep all the optimizer files at least. It does not keep the QC files, but that's apparently just how QCEngine works - I tried keeping the QCE scratch space but there's nothing very interesting in it. I don't think other steps produce any files so I think it works.

I'd like to name folders after their IDs like you suggesting @jthorton, but I don't know what the best way to do that would be. Do you know if there's some place you could stick a with temporary_cd(f"id-{id}"): or something that would cover all the workers? My very limited understanding of celery suggests that might not be possible... I'll have a crack at seeing if I can come up with an alternative I like tomorrow :)

@jthorton
Copy link
Contributor

jthorton commented Mar 22, 2023

I'd like to name folders after their IDs

Thats great this will help a lot. I think that we will have to change each workers call to the delay function to include a task id like the example here and also store this in the task schema under a new id field, then the task and the results folder have a common ID. We can then have worker function use the new temporary cd. Maybe these changes would be better handled in another PR as its changing the scope of this one by updating the schemas. Happy to take that on!

@Yoshanuikabundi Yoshanuikabundi marked this pull request as ready for review March 23, 2023 04:31
@Yoshanuikabundi
Copy link
Contributor Author

Yoshanuikabundi commented Mar 23, 2023

That makes sense to me! If you could take a crack at that, that'd be great - I think we should merge this, #239, and #241, release 0.2.1 next week, and then we can remove the old BEFLOW_OPTIMIZER_KEEP_FILES setting and name temporary folders after the task ID in the next release. Something something breaking changes something something lets just get a release that supports toolkit 0.11+ out.

If you could also formally review/approve this PR that'd be great!

@Yoshanuikabundi Yoshanuikabundi added this to the v0.2.1 milestone Mar 23, 2023
@jthorton
Copy link
Contributor

I had a look into setting the schema id to match the task and it turns out we already do this! So we could add it to this PR, the task id is extracted and set here this would just then need to be passed to the temp cd here and then we can remove the lines at the bottom which copy the tree as this would all be handled by the new tmp cd function. I tested this locally and see that it works I can watch the ForceBalance optimisation output in real-time which is great!

@Yoshanuikabundi
Copy link
Contributor Author

I shoulda just tried out the task ID thing before commenting everywhere 😅 I love how using the task ID consolidates all the logic - this'll make removing the old setting easy in the future. We just need to remember to tempcd to the task id whenever we write temporary files to disk.

I'll merge this on Monday unless you find something else to fix!

Copy link
Contributor

@jthorton jthorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @Yoshanuikabundi, looks fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants