Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cloud Deployment IVa] EFS creation and mounting #1018

Merged
merged 37 commits into from
Sep 12, 2024
Merged

Conversation

CodyCBakerPhD
Copy link
Member

breaks up various parts of #393, starting with EFS handling

@CodyCBakerPhD CodyCBakerPhD self-assigned this Aug 19, 2024
@CodyCBakerPhD CodyCBakerPhD changed the title [Cloud Deployment III] EFS creation and mounting [Cloud Deployment IVa] EFS creation and mounting Aug 19, 2024
@@ -15,7 +15,7 @@ def read_requirements(file):


extras_require = defaultdict(list)
extras_require["full"] = ["dandi>=0.58.1", "hdf5plugin"]
extras_require["full"] = ["dandi>=0.58.1", "hdf5plugin", "boto3"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this got missed in the pyproject.toml refactor

import os
import time

import boto3

from neuroconv.tools.aws import submit_aws_batch_job

_RETRY_STATES = ["RUNNABLE", "PENDING", "STARTING", "RUNNING"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those bug fixes I mentioned sneaking into this PR - after running the tests so many times I did see some edge cases show up in the waiting times

I think a final follow-up in the series (once everything works together) could be more elegant with some common utils for such things

Comment on lines +41 to +49
job = None
max_retries = 10
retry = 0
while retry < max_retries:
job_description_response = batch_client.describe_jobs(jobs=[job_id])
assert job_description_response["ResponseMetadata"]["HTTPStatusCode"] == 200

jobs = job_description_response["jobs"]
assert len(jobs) == 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a repeated example of the wait condition until a desired outcome is reached

efs_client.delete_mount_target(MountTargetId=mount_target["MountTargetId"])

time.sleep(60)
efs_client.delete_file_system(FileSystemId=efs_id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The the future I plan to be more elegant with pytest cleanup, possibly via pytest_sessionfinish in a conftest.py

It might also cleanup any job definitions that start with 'test_neuroconv'

if efs_id is not None:
volumes = [
{
"name": "neuroconv_batch_efs_mounted",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name identifier here is entirely local (does not need to match the 'name' tagged on the actual EFS volume)

Comment on lines +539 to +540
if efs_id is not None:
job_definition_name += f"_{efs_id}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned out to be quite important to include the filesystem ID in the definition name to avoid reusing a previous definition created without the EFS mount configured

I have toyed with the idea of making the job definition a hash of all the 'unique' configuration aspects, but it is also important for it to be readable - a follow-up towards the end might apply a bunch of human readable tags to this effect

Comment on lines +324 to +325
"minvCpus": 0, # Note: if not zero, will always keep an instance running in active state on standby
"maxvCpus": 8, # Note: not currently exposing control over this since these are mostly I/O intensive
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-mayorquin This is a very important detail to be aware of

Comment on lines +417 to +423
reason="MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE",
state="RUNNABLE",
maxTimeSeconds=minimum_time_to_kill_in_seconds,
action="CANCEL",
),
dict(
reason="MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another bug fix I'm sneaking in here - previous testing suites ran under an existing compute environment that did not have this, but now it is actually setup properly (and tested in practice but no clue how to make a proper test for it) and will end any zombie jobs

@CodyCBakerPhD CodyCBakerPhD marked this pull request as ready for review September 6, 2024 13:50
@CodyCBakerPhD
Copy link
Member Author

@h-mayorquin This is ready for review now

Evidence of passing tests in CI: https://github.com/catalystneuro/neuroconv/actions/runs/10726561768/job/29746896174

Copy link
Collaborator

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after discussion.

@CodyCBakerPhD CodyCBakerPhD enabled auto-merge (squash) September 12, 2024 07:12
@CodyCBakerPhD CodyCBakerPhD merged commit 78c1177 into main Sep 12, 2024
35 checks passed
@CodyCBakerPhD CodyCBakerPhD deleted the efs_mounting branch September 12, 2024 10:48
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.36%. Comparing base (81a022d) to head (b175c60).
Report is 64 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
+ Coverage   90.32%   90.36%   +0.04%     
==========================================
  Files         129      129              
  Lines        7996     7999       +3     
==========================================
+ Hits         7222     7228       +6     
+ Misses        774      771       -3     
Flag Coverage Δ
unittests 90.36% <ø> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 3 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants