Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retry on network failures #14

Open
anandhu-eng opened this issue Dec 12, 2024 · 4 comments
Open

Enable retry on network failures #14

anandhu-eng opened this issue Dec 12, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@anandhu-eng
Copy link
Contributor


Migrated from mlcommons/mlperf-automations_archived#11
Originally created by @arjunsuresh on Fri, 01 Nov 2024 10:44:17 GMT


We often see CM script runs failing due to netwok failures like this. It'll be good to add a retry mechanism for such failures to improve the user experience and reduce the failures of automatic runs.

@anandhu-eng anandhu-eng added the enhancement New feature or request label Dec 12, 2024
@anandhu-eng
Copy link
Contributor Author


Migrated from mlcommons/mlperf-automations_archived#11 (comment)
Originally created by @arjunsuresh on Thu, 07 Nov 2024 21:09:14 GMT


git clone failures

Cloning into 'inference'...
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 8093 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

@anandhu-eng
Copy link
Contributor Author


Migrated from mlcommons/mlperf-automations_archived#11 (comment)
Originally created by @arjunsuresh on Tue, 05 Nov 2024 13:45:20 GMT


The below failure is seen many times in our github actions. Trying the fix --dns 8.8.8.8 --dns 8.8.4.4 to docker run command.

2024-11-05T13:38:34.8624636Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/s/systemd/libsystemd0_245.4-4ubuntu3.24_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8629875Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/a/argon2/libargon2-1_0~20171227-0.2_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8634850Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/lvm2/libdevmapper1.02.1_1.02.167-1ubuntu1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8639786Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/i/iptables/libip4tc2_1.8.4-3ubuntu2.1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8642386Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/k/kmod/libkmod2_27-1ubuntu2.1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8644257Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/s/systemd/systemd-timesyncd_245.4-4ubuntu3.24_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8646140Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/s/systemd/systemd_245.4-4ubuntu3.24_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8648397Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/lvm2/dmsetup_1.02.167-1ubuntu1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8650427Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/g/gobject-introspection/libgirepository-1.0-1_1.64.1-1~ubuntu20.04.1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8652805Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/g/gobject-introspection/gir1.2-glib-2.0_1.64.1-1~ubuntu20.04.1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8654823Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/d/dbus-python/python3-dbus_1.2.16-1build1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8656630Z E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/p/pygobject/python3-gi_3.36.0-1_amd64.deb  Could not resolve 'archive.ubuntu.com'
2024-11-05T13:38:34.8658027Z E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

@anandhu-eng
Copy link
Contributor Author


Migrated from mlcommons/mlperf-automations_archived#11 (comment)
Originally created by @arjunsuresh on Mon, 04 Nov 2024 11:44:16 GMT


@anandhu-eng I think we should enable it by default and let users an ENV variable to turn it off for any reason. But first we need to list out the places where we need this. Below are some of them. We should probably try it on one, and if it works as expected move to the remaining places.

  1. git clone
  2. System util installation
  3. pip package installation

@anandhu-eng
Copy link
Contributor Author


Migrated from mlcommons/mlperf-automations_archived#11 (comment)
Originally created by @anandhu-eng on Mon, 04 Nov 2024 09:46:42 GMT


Hi @arjunsuresh , this would be useful. Should this be kept on by default or should it be controlled through any env variable? I'm wondering if there is a case where user wants to turn it off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant