-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Fail Job Attachments action when consecutive transfer rates drop below threshold #33
Conversation
f5461ad
to
c9fcd29
Compare
c18bc00
to
6940aa8
Compare
# A transfer rate below 1 Kb/s is considered concerning or potentially stalled. | ||
TRANSFER_RATE_THRESHOLD = 1000 # bytes/s | ||
# Each progress report callback takes 5 mins, so 3 reports amount to 15 mins in total | ||
LOW_TRANSFER_COUNT_THRESHOLD = 3 | ||
low_transfer_count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So our circuit breaker will kick in when there are 15 minutes of 1Kb/s throughput on average?
What data do we have to support this hard-coded threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to determine reasonable threshold values, a simulation analysis was performed in a new Jupyter notebook.
# rates, and if the count exceeds the spcified threshold, cancels the download and | ||
# fails the current (sync_input_job_attachments) action. | ||
nonlocal low_transfer_count | ||
transfer_rate = job_attachment_download_status.transferRate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in scope for this PR, but I see we used camel casing here instead of the Python convetion of snake case (transfer_rate
)
717a54a
to
695d169
Compare
4055339
to
44dc5cd
Compare
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setting Thresholds For Anomaly Detection in Downlaod Speed\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some typos in this document, I'd recommend running it through a word doc.
Setting Thresholds For Anomaly Detection in Downlaod Speed
-> Setting Thresholds For Anomaly Detection in Download Speed
Also, this file name has a typo, should be download_slowdown_detection.ipynb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made too many typos, my apologies. I will correct them in the next revision.
@@ -752,12 +752,50 @@ def sync_asset_inputs( | |||
if self._asset_sync is None: | |||
return | |||
|
|||
def progress_handler(job_upload_status: ProgressReportMetadata) -> bool: | |||
# A transfer rate below 1 Kb/s is considered concerning or potentially stalled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I think you mean A transfer rate below 50 Kb/s...
state=ActionState.FAILED, | ||
fail_message=( | ||
"Input syncing failed due to successive low transfer rates (< 1 Kb/s). " | ||
"The transfer rate was below the threshold for the last three checks." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this should last *two* checks
and < 50 kb/s
right? Maybe we should just put the hard-coded values in here?
44dc5cd
to
141c5bf
Compare
141c5bf
to
21d813c
Compare
54341b6
to
ec7af9a
Compare
…p below threshold Signed-off-by: Gahyun Suh <[email protected]>
ec7af9a
to
238294c
Compare
What was the problem/requirement? (What/Why)
During the SYNC_INPUT_JOB_ATTACHMENTS session actions, there might be a potential for the download operation to get stuck or stall due to unforeseen issues. Currently, there is no mechanism to monitor or detect such scenarios. This poses a risk, especially when downloading many/large files or using slow/unstable network.
What was the solution? (How)
on_downloading_files
callback function, which is passed to the AssetSyncsync_inputs
function. This callback is intended to be called at regular intervals during the download, acting as a sort of heartbeat from Job Attachments library.sync_inputs
should be stalled:on_downloading_files
callbacks, the transfer rate falls below a certain threshold (e.g., 50 MB/s), it is considered concerning or potentially stalled.What is the impact of this change?
Worker agent can now detect and halt potentially stuck SYNC_INPUT_JOB_ATTACHMENTS session actions. This ensures that resources are not wasted on prolonged stalled operations, and that users are timely notified about the issues.
How was this change tested?
hatch run lint && hatch run test
Was this change documented?
Yes.
Is this a breaking change?
No.