Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Fail Job Attachments action when consecutive transfer rates drop below threshold #33

Closed
wants to merge 1 commit into from

Conversation

gahyusuh
Copy link
Contributor

@gahyusuh gahyusuh commented Sep 15, 2023

What was the problem/requirement? (What/Why)

During the SYNC_INPUT_JOB_ATTACHMENTS session actions, there might be a potential for the download operation to get stuck or stall due to unforeseen issues. Currently, there is no mechanism to monitor or detect such scenarios. This poses a risk, especially when downloading many/large files or using slow/unstable network.

What was the solution? (How)

  • Leverages the on_downloading_files callback function, which is passed to the AssetSync sync_inputs function. This callback is intended to be called at regular intervals during the download, acting as a sort of heartbeat from Job Attachments library.
  • We use following conditions to determine if the sync_inputs should be stalled:
    • If, in a certain number of (e.g., 2) consecutive on_downloading_files callbacks, the transfer rate falls below a certain threshold (e.g., 50 MB/s), it is considered concerning or potentially stalled.
    • In order to determine threshold values, a simulation analysis was performed using a jupyter notebook.

What is the impact of this change?

Worker agent can now detect and halt potentially stuck SYNC_INPUT_JOB_ATTACHMENTS session actions. This ensures that resources are not wasted on prolonged stalled operations, and that users are timely notified about the issues.

How was this change tested?

  • hatch run lint && hatch run test
  • Run an end-to-end test (submitting a non-rendering job bundle --> running a CMF against my local backend) ensuring that there were no issues/errors in the flow.

Was this change documented?

Yes.

Is this a breaking change?

No.

@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch from f5461ad to c9fcd29 Compare September 15, 2023 19:30
@jusiskin jusiskin added the enhancement New feature or request label Sep 15, 2023
@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch from c18bc00 to 6940aa8 Compare September 18, 2023 19:07
Comment on lines 755 to 772
# A transfer rate below 1 Kb/s is considered concerning or potentially stalled.
TRANSFER_RATE_THRESHOLD = 1000 # bytes/s
# Each progress report callback takes 5 mins, so 3 reports amount to 15 mins in total
LOW_TRANSFER_COUNT_THRESHOLD = 3
low_transfer_count = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So our circuit breaker will kick in when there are 15 minutes of 1Kb/s throughput on average?

What data do we have to support this hard-coded threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to determine reasonable threshold values, a simulation analysis was performed in a new Jupyter notebook.

# rates, and if the count exceeds the spcified threshold, cancels the download and
# fails the current (sync_input_job_attachments) action.
nonlocal low_transfer_count
transfer_rate = job_attachment_download_status.transferRate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in scope for this PR, but I see we used camel casing here instead of the Python convetion of snake case (transfer_rate)

@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch 8 times, most recently from 717a54a to 695d169 Compare September 21, 2023 19:18
@gahyusuh gahyusuh marked this pull request as ready for review September 21, 2023 19:18
@gahyusuh gahyusuh requested a review from a team as a code owner September 21, 2023 19:18
@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch 2 times, most recently from 4055339 to 44dc5cd Compare September 22, 2023 14:14
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting Thresholds For Anomaly Detection in Downlaod Speed\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some typos in this document, I'd recommend running it through a word doc.

Setting Thresholds For Anomaly Detection in Downlaod Speed -> Setting Thresholds For Anomaly Detection in Download Speed

Also, this file name has a typo, should be download_slowdown_detection.ipynb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made too many typos, my apologies. I will correct them in the next revision.

@@ -752,12 +752,50 @@ def sync_asset_inputs(
if self._asset_sync is None:
return

def progress_handler(job_upload_status: ProgressReportMetadata) -> bool:
# A transfer rate below 1 Kb/s is considered concerning or potentially stalled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I think you mean A transfer rate below 50 Kb/s...

state=ActionState.FAILED,
fail_message=(
"Input syncing failed due to successive low transfer rates (< 1 Kb/s). "
"The transfer rate was below the threshold for the last three checks."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this should last *two* checks and < 50 kb/s right? Maybe we should just put the hard-coded values in here?

@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch from 44dc5cd to 141c5bf Compare September 26, 2023 18:29
marofke
marofke previously approved these changes Sep 26, 2023
@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch from 141c5bf to 21d813c Compare October 5, 2023 15:19
@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch 2 times, most recently from 54341b6 to ec7af9a Compare October 24, 2023 01:26
@gahyusuh gahyusuh force-pushed the gahyusuh/job_att_heartbeat_check branch from ec7af9a to 238294c Compare November 6, 2023 22:42
@gahyusuh gahyusuh closed this Mar 21, 2024
@gahyusuh gahyusuh deleted the gahyusuh/job_att_heartbeat_check branch March 21, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants