-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E for MXJob is the flaky test #1743
Comments
/kind cancel |
@tenzen-y: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind e2e-test-failure |
Yeah. This is the cause of failure in most integration tests. Need to root cause the problem Thanks @tenzen-y for creating this. |
It seems to face connection errors. Maybe, we need to fix the sample codes. mxjob-mnist-ci-test-worker-0 mxnet INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='dist_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=1, num_examples=1000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, use_imagenet_data_augmentation=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:https://web.archive.org:443 "GET /web/20160828233817/http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 302 0
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:https://web.archive.org:443 "GET /web/20161113155520/http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881
mxjob-mnist-ci-test-worker-0 mxnet incubator-mxnet/example/image-classification/train_mnist.py:38: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
mxjob-mnist-ci-test-worker-0 mxnet label = np.fromstring(flbl.read(), dtype=np.int8)
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connection.py", line 175, in _new_conn
mxjob-mnist-ci-test-worker-0 mxnet (self._dns_host, self.port), self.timeout, **extra_kw
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/util/connection.py", line 95, in create_connection
mxjob-mnist-ci-test-worker-0 mxnet raise err
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/util/connection.py", line 85, in create_connection
mxjob-mnist-ci-test-worker-0 mxnet sock.connect(sa)
mxjob-mnist-ci-test-worker-0 mxnet ConnectionRefusedError: [Errno 111] Connection refused
mxjob-mnist-ci-test-worker-0 mxnet
mxjob-mnist-ci-test-worker-0 mxnet During handling of the above exception, another exception occurred:
mxjob-mnist-ci-test-worker-0 mxnet
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
mxjob-mnist-ci-test-worker-0 mxnet chunked=chunked,
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
mxjob-mnist-ci-test-worker-0 mxnet self._validate_conn(conn)
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
mxjob-mnist-ci-test-worker-0 mxnet conn.connect()
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connection.py", line 358, in connect
mxjob-mnist-ci-test-worker-0 mxnet self.sock = conn = self._new_conn()
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connection.py", line 187, in _new_conn
mxjob-mnist-ci-test-worker-0 mxnet self, "Failed to establish a new connection: %s" % e
mxjob-mnist-ci-test-worker-0 mxnet urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x404f5c92d0>: Failed to establish a new connection: [Errno 111] Connection refused
mxjob-mnist-ci-test-worker-0 mxnet
mxjob-mnist-ci-test-worker-0 mxnet During handling of the above exception, another exception occurred:
mxjob-mnist-ci-test-worker-0 mxnet
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 450, in send
mxjob-mnist-ci-test-worker-0 mxnet timeout=timeout
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 786, in urlopen
mxjob-mnist-ci-test-worker-0 mxnet method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/urllib3/util/retry.py", line 592, in increment
mxjob-mnist-ci-test-worker-0 mxnet raise MaxRetryError(_pool, url, error or ResponseError(cause))
mxjob-mnist-ci-test-worker-0 mxnet urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20160828233817/http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x404f5c92d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
mxjob-mnist-ci-test-worker-0 mxnet
mxjob-mnist-ci-test-worker-0 mxnet During handling of the above exception, another exception occurred:
mxjob-mnist-ci-test-worker-0 mxnet
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet File "incubator-mxnet/example/image-classification/train_mnist.py", line 97, in <module>
mxjob-mnist-ci-test-worker-0 mxnet fit.fit(args, sym, get_mnist_iter)
mxjob-mnist-ci-test-worker-0 mxnet File "/mxnet/incubator-mxnet/example/image-classification/common/fit.py", line 182, in fit
mxjob-mnist-ci-test-worker-0 mxnet (train, val) = data_loader(args, kv)
mxjob-mnist-ci-test-worker-0 mxnet File "incubator-mxnet/example/image-classification/train_mnist.py", line 56, in get_mnist_iter
mxjob-mnist-ci-test-worker-0 mxnet 'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
mxjob-mnist-ci-test-worker-0 mxnet File "incubator-mxnet/example/image-classification/train_mnist.py", line 39, in read_data
mxjob-mnist-ci-test-worker-0 mxnet with gzip.open(download_file(base_url+image, os.path.join('data',image)), 'rb') as fimg:
mxjob-mnist-ci-test-worker-0 mxnet File "/mxnet/incubator-mxnet/example/image-classification/common/util.py", line 42, in download_file
mxjob-mnist-ci-test-worker-0 mxnet r = requests.get(url, stream=True)
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/requests/api.py", line 75, in get
mxjob-mnist-ci-test-worker-0 mxnet return request('get', url, params=params, **kwargs)
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/requests/api.py", line 61, in request
mxjob-mnist-ci-test-worker-0 mxnet return session.request(method=method, url=url, **kwargs)
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/requests/sessions.py", line 529, in request
mxjob-mnist-ci-test-worker-0 mxnet resp = self.send(prep, **send_kwargs)
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/requests/sessions.py", line 645, in send
mxjob-mnist-ci-test-worker-0 mxnet r = adapter.send(request, **kwargs)
mxjob-mnist-ci-test-worker-0 mxnet File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 519, in send
mxjob-mnist-ci-test-worker-0 mxnet raise ConnectionError(e, request=request)
mxjob-mnist-ci-test-worker-0 mxnet requests.exceptions.ConnectionError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20160828233817/http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x404f5c92d0>: Failed to establish a new connection: [Errno 111] Connection refused')) |
This flaky test seems to be fixed by #1754. |
/close |
@tenzen-y: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
E2E for MXJob is the flaky test. We should fix the test.
https://github.com/kubeflow/training-operator/actions/runs/4009038764/jobs/6883927326
/cc @johnugeorge
The text was updated successfully, but these errors were encountered: