-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cloud Deployment IVa] EFS creation and mounting #1018
Conversation
@@ -15,7 +15,7 @@ def read_requirements(file): | |||
|
|||
|
|||
extras_require = defaultdict(list) | |||
extras_require["full"] = ["dandi>=0.58.1", "hdf5plugin"] | |||
extras_require["full"] = ["dandi>=0.58.1", "hdf5plugin", "boto3"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this got missed in the pyproject.toml refactor
import os | ||
import time | ||
|
||
import boto3 | ||
|
||
from neuroconv.tools.aws import submit_aws_batch_job | ||
|
||
_RETRY_STATES = ["RUNNABLE", "PENDING", "STARTING", "RUNNING"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of those bug fixes I mentioned sneaking into this PR - after running the tests so many times I did see some edge cases show up in the waiting times
I think a final follow-up in the series (once everything works together) could be more elegant with some common utils for such things
job = None | ||
max_retries = 10 | ||
retry = 0 | ||
while retry < max_retries: | ||
job_description_response = batch_client.describe_jobs(jobs=[job_id]) | ||
assert job_description_response["ResponseMetadata"]["HTTPStatusCode"] == 200 | ||
|
||
jobs = job_description_response["jobs"] | ||
assert len(jobs) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a repeated example of the wait condition until a desired outcome is reached
efs_client.delete_mount_target(MountTargetId=mount_target["MountTargetId"]) | ||
|
||
time.sleep(60) | ||
efs_client.delete_file_system(FileSystemId=efs_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The the future I plan to be more elegant with pytest cleanup, possibly via pytest_sessionfinish
in a conftest.py
It might also cleanup any job definitions that start with 'test_neuroconv'
if efs_id is not None: | ||
volumes = [ | ||
{ | ||
"name": "neuroconv_batch_efs_mounted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name identifier here is entirely local (does not need to match the 'name' tagged on the actual EFS volume)
if efs_id is not None: | ||
job_definition_name += f"_{efs_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned out to be quite important to include the filesystem ID in the definition name to avoid reusing a previous definition created without the EFS mount configured
I have toyed with the idea of making the job definition a hash of all the 'unique' configuration aspects, but it is also important for it to be readable - a follow-up towards the end might apply a bunch of human readable tags to this effect
"minvCpus": 0, # Note: if not zero, will always keep an instance running in active state on standby | ||
"maxvCpus": 8, # Note: not currently exposing control over this since these are mostly I/O intensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-mayorquin This is a very important detail to be aware of
reason="MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE", | ||
state="RUNNABLE", | ||
maxTimeSeconds=minimum_time_to_kill_in_seconds, | ||
action="CANCEL", | ||
), | ||
dict( | ||
reason="MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another bug fix I'm sneaking in here - previous testing suites ran under an existing compute environment that did not have this, but now it is actually setup properly (and tested in practice but no clue how to make a proper test for it) and will end any zombie jobs
@h-mayorquin This is ready for review now Evidence of passing tests in CI: https://github.com/catalystneuro/neuroconv/actions/runs/10726561768/job/29746896174 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after discussion.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1018 +/- ##
==========================================
+ Coverage 90.32% 90.36% +0.04%
==========================================
Files 129 129
Lines 7996 7999 +3
==========================================
+ Hits 7222 7228 +6
+ Misses 774 771 -3
Flags with carried forward coverage won't be shown. Click here to find out more. |
breaks up various parts of #393, starting with EFS handling