Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E for MXJob is the flaky test #1743

Closed
tenzen-y opened this issue Jan 25, 2023 · 8 comments
Closed

E2E for MXJob is the flaky test #1743

tenzen-y opened this issue Jan 25, 2023 · 8 comments

Comments

@tenzen-y
Copy link
Member

/kind bug

E2E for MXJob is the flaky test. We should fix the test.

INFO     root:utils.py:36 


MXJob is running
INFO     root:utils.py:47 mxjob-mnist-ci-test            Running              2023-01-25 19:02:12+00:00
INFO     root:utils.py:47 mxjob-mnist-ci-test            Running              2023-01-25 19:02:12+00:00
INFO     root:utils.py:47 mxjob-mnist-ci-test            Running              2023-01-25 19:02:12+00:00
INFO     root:utils.py:47 mxjob-mnist-ci-test            Running              2023-01-25 19:02:12+00:00
INFO     root:utils.py:47 mxjob-mnist-ci-test            Running              2023-01-25 19:02:12+00:00
INFO     root:utils.py:47 mxjob-mnist-ci-test            Running              2023-01-25 19:02:12+00:00
INFO     root:utils.py:47 mxjob-mnist-ci-test            Failed               2023-01-25 19:03:34+00:00
=========================== short test summary info ============================
FAILED sdk/python/test/e2e/test_e2e_mxjob.py::test_sdk_e2e - RuntimeError: MXJob default/mxjob-mnist-ci-test is Failed. MXJob conditions: [{'last_transition_time': datetime.datetime(2023, 1, 25, 19, 1, 57, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2023, 1, 25, 19, 1, 57, tzinfo=tzlocal()),
 'message': 'MXJob mxjob-mnist-ci-test is created.',
 'reason': 'MXJobCreated',
 'status': 'True',
 'type': 'Created'}, {'last_transition_time': datetime.datetime(2023, 1, 25, 19, 2, 12, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2023, 1, 25, 19, 2, 12, tzinfo=tzlocal()),
 'message': 'MXJob mxjob-mnist-ci-test is running.',
 'reason': 'MXJobRunning',
 'status': 'False',
 'type': 'Running'}, {'last_transition_time': datetime.datetime(2023, 1, 25, 19, 3, 34, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2023, 1, 25, 19, 3, 34, tzinfo=tzlocal()),
 'message': 'mxjob mxjob-mnist-ci-test is failed because 1 Worker replica(s) '
            'failed.',
 'reason': 'MXJobFailed',
 'status': 'True',
 'type': 'Failed'}]
============= 1 failed, 32 passed, 6 skipped in 692.86s (0:11:32) ==============
Error: Process completed with exit code 1.

https://github.com/kubeflow/training-operator/actions/runs/4009038764/jobs/6883927326

/cc @johnugeorge

@tenzen-y
Copy link
Member Author

/kind cancel

@google-oss-prow
Copy link

@tenzen-y: The label(s) kind/cancel cannot be applied, because the repository doesn't have them.

In response to this:

/kind cancel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member Author

/kind e2e-test-failure

@johnugeorge
Copy link
Member

Yeah. This is the cause of failure in most integration tests. Need to root cause the problem

Thanks @tenzen-y for creating this.

@tenzen-y
Copy link
Member Author

It seems to face connection errors. Maybe, we need to fix the sample codes.

mxjob-mnist-ci-test-worker-0 mxnet INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='dist_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=1, num_examples=1000, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, use_imagenet_data_augmentation=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:https://web.archive.org:443 "GET /web/20160828233817/http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 302 0
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:https://web.archive.org:443 "GET /web/20161113155520/http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881
mxjob-mnist-ci-test-worker-0 mxnet incubator-mxnet/example/image-classification/train_mnist.py:38: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
mxjob-mnist-ci-test-worker-0 mxnet   label = np.fromstring(flbl.read(), dtype=np.int8)
mxjob-mnist-ci-test-worker-0 mxnet DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): web.archive.org:443
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connection.py", line 175, in _new_conn
mxjob-mnist-ci-test-worker-0 mxnet     (self._dns_host, self.port), self.timeout, **extra_kw
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/util/connection.py", line 95, in create_connection
mxjob-mnist-ci-test-worker-0 mxnet     raise err
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/util/connection.py", line 85, in create_connection
mxjob-mnist-ci-test-worker-0 mxnet     sock.connect(sa)
mxjob-mnist-ci-test-worker-0 mxnet ConnectionRefusedError: [Errno 111] Connection refused
mxjob-mnist-ci-test-worker-0 mxnet 
mxjob-mnist-ci-test-worker-0 mxnet During handling of the above exception, another exception occurred:
mxjob-mnist-ci-test-worker-0 mxnet 
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
mxjob-mnist-ci-test-worker-0 mxnet     chunked=chunked,
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
mxjob-mnist-ci-test-worker-0 mxnet     self._validate_conn(conn)
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
mxjob-mnist-ci-test-worker-0 mxnet     conn.connect()
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connection.py", line 358, in connect
mxjob-mnist-ci-test-worker-0 mxnet     self.sock = conn = self._new_conn()
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connection.py", line 187, in _new_conn
mxjob-mnist-ci-test-worker-0 mxnet     self, "Failed to establish a new connection: %s" % e
mxjob-mnist-ci-test-worker-0 mxnet urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x404f5c92d0>: Failed to establish a new connection: [Errno 111] Connection refused
mxjob-mnist-ci-test-worker-0 mxnet 
mxjob-mnist-ci-test-worker-0 mxnet During handling of the above exception, another exception occurred:
mxjob-mnist-ci-test-worker-0 mxnet 
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 450, in send
mxjob-mnist-ci-test-worker-0 mxnet     timeout=timeout
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 786, in urlopen
mxjob-mnist-ci-test-worker-0 mxnet     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/urllib3/util/retry.py", line 592, in increment
mxjob-mnist-ci-test-worker-0 mxnet     raise MaxRetryError(_pool, url, error or ResponseError(cause))
mxjob-mnist-ci-test-worker-0 mxnet urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20160828233817/http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x404f5c92d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
mxjob-mnist-ci-test-worker-0 mxnet 
mxjob-mnist-ci-test-worker-0 mxnet During handling of the above exception, another exception occurred:
mxjob-mnist-ci-test-worker-0 mxnet 
mxjob-mnist-ci-test-worker-0 mxnet Traceback (most recent call last):
mxjob-mnist-ci-test-worker-0 mxnet   File "incubator-mxnet/example/image-classification/train_mnist.py", line 97, in <module>
mxjob-mnist-ci-test-worker-0 mxnet     fit.fit(args, sym, get_mnist_iter)
mxjob-mnist-ci-test-worker-0 mxnet   File "/mxnet/incubator-mxnet/example/image-classification/common/fit.py", line 182, in fit
mxjob-mnist-ci-test-worker-0 mxnet     (train, val) = data_loader(args, kv)
mxjob-mnist-ci-test-worker-0 mxnet   File "incubator-mxnet/example/image-classification/train_mnist.py", line 56, in get_mnist_iter
mxjob-mnist-ci-test-worker-0 mxnet     'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
mxjob-mnist-ci-test-worker-0 mxnet   File "incubator-mxnet/example/image-classification/train_mnist.py", line 39, in read_data
mxjob-mnist-ci-test-worker-0 mxnet     with gzip.open(download_file(base_url+image, os.path.join('data',image)), 'rb') as fimg:
mxjob-mnist-ci-test-worker-0 mxnet   File "/mxnet/incubator-mxnet/example/image-classification/common/util.py", line 42, in download_file
mxjob-mnist-ci-test-worker-0 mxnet     r = requests.get(url, stream=True)
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/requests/api.py", line 75, in get
mxjob-mnist-ci-test-worker-0 mxnet     return request('get', url, params=params, **kwargs)
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/requests/api.py", line 61, in request
mxjob-mnist-ci-test-worker-0 mxnet     return session.request(method=method, url=url, **kwargs)
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/requests/sessions.py", line 529, in request
mxjob-mnist-ci-test-worker-0 mxnet     resp = self.send(prep, **send_kwargs)
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/requests/sessions.py", line 645, in send
mxjob-mnist-ci-test-worker-0 mxnet     r = adapter.send(request, **kwargs)
mxjob-mnist-ci-test-worker-0 mxnet   File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 519, in send
mxjob-mnist-ci-test-worker-0 mxnet     raise ConnectionError(e, request=request)
mxjob-mnist-ci-test-worker-0 mxnet requests.exceptions.ConnectionError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20160828233817/http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x404f5c92d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

@tenzen-y
Copy link
Member Author

This flaky test seems to be fixed by #1754.

@tenzen-y
Copy link
Member Author

/close

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants