Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

nnictl resume cannot open GUI #5803

Open
jimmy133719 opened this issue Aug 26, 2024 · 0 comments
Open

nnictl resume cannot open GUI #5803

jimmy133719 opened this issue Aug 26, 2024 · 0 comments

Comments

@jimmy133719
Copy link

jimmy133719 commented Aug 26, 2024

As title, GUI cannot be open after I resume the experiment.
I observe that if the output log does not hang at "Web portal URLs: ...", GUI will be unable to open. However, I can't find the way to keep "nnictl resume ID" command running.

Complete log of command:

[2024-08-26 13:10:28] Creating experiment, Experiment ID: z39sirw8
[2024-08-26 13:10:28] Starting web server...
[2024-08-26 13:10:29] INFO (main) Start NNI manager
[2024-08-26 13:10:29] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-08-26 13:10:29] INFO (RestServer) REST server started.
[2024-08-26 13:10:29] INFO (NNIDataStore) Datastore initialization done
[2024-08-26 13:10:29] Setting up...
[2024-08-26 13:10:30] INFO (NNIManager) Resuming experiment: z39sirw8
[2024-08-26 13:10:30] INFO (NNIManager) Setup training service...
[2024-08-26 13:10:30] INFO (NNIManager) Setup tuner...
[2024-08-26 13:10:31] INFO (NNIManager) Number of current submitted trials: 621, where 0 is resuming.
[2024-08-26 13:10:31] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-08-26 13:10:31] Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-08-26 13:10:31] Stopping experiment, please wait...
[2024-08-26 13:10:31] Saving experiment checkpoint...
[2024-08-26 13:10:31] Stopping NNI manager, if any...
[2024-08-26 13:10:31] INFO (ShutdownManager) Initiate shutdown: REST request
[2024-08-26 13:10:31] INFO (RestServer) Stopping REST server.
[2024-08-26 13:10:31] ERROR (ShutdownManager) Error during shutting down NniManager: TypeError: Cannot read properties of undefined (reading 'getBufferedAmount')
at TunerServer.sendCommand (/usr/local/lib/python3.8/dist-packages/nni_node/core/tuner_command_channel.js:60:26)
at NNIManager.stopExperimentTopHalf (/usr/local/lib/python3.8/dist-packages/nni_node/core/nnimanager.js:303:25)
at NNIManager.stopExperiment (/usr/local/lib/python3.8/dist-packages/nni_node/core/nnimanager.js:292:20)
at /usr/local/lib/python3.8/dist-packages/nni_node/common/globals/shutdown.js:49:23
at Array.map ()
at ShutdownManager.shutdown (/usr/local/lib/python3.8/dist-packages/nni_node/common/globals/shutdown.js:47:51)
at ShutdownManager.initiate (/usr/local/lib/python3.8/dist-packages/nni_node/common/globals/shutdown.js:22:18)
at /usr/local/lib/python3.8/dist-packages/nni_node/rest_server/restHandler.js:366:40
at Layer.handle [as handle_request] (/usr/local/lib/python3.8/dist-packages/nni_node/node_modules/express/lib/router/layer.js:95:5)
at next (/usr/local/lib/python3.8/dist-packages/nni_node/node_modules/express/lib/router/route.js:144:13)
[2024-08-26 13:10:31] INFO (NNIManager) Change NNIManager status from: RUNNING to: STOPPING
[2024-08-26 13:10:31] INFO (NNIManager) Stopping experiment, cleaning up ...
[2024-08-26 13:10:31] INFO (ShutdownManager) Shutdown complete.
[2024-08-26 13:10:31] INFO (RestServer) REST server stopped.
[2024-08-26 13:10:31] Experiment stopped.
root@e44bc2dd4409:/workspace/MediaPipePyTorch# Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/nni/main.py", line 85, in
main()
File "/usr/local/lib/python3.8/dist-packages/nni/main.py", line 58, in main
dispatcher = MsgDispatcher(url, tuner, assessor)
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/msg_dispatcher.py", line 71, in init
super().init(command_channel_url)
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/msg_dispatcher_base.py", line 47, in init
self._channel.connect()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/tuner_command_channel/channel.py", line 58, in connect
self._channel.connect()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 23, in connect
self._ensure_conn()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/channel.py", line 75, in _ensure_conn
self._conn.connect()
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 65, in connect
self._ws = _wait(_connect_async(self._url))
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 121, in _wait
return future.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/dist-packages/nni/runtime/command_channel/websocket/connection.py", line 135, in _connect_async
return await websockets.connect(url, max_size=None) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/client.py", line 655, in await_impl_timeout
return await self.await_impl()
File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/client.py", line 659, in await_impl
_transport, _protocol = await self._create_connection()
File "/usr/lib/python3.8/asyncio/base_events.py", line 1033, in create_connection
raise OSError('Multiple exceptions: {}'.format(
OSError: Multiple exceptions: [Errno 111] Connect call failed ('127.0.0.1', 8080), [Errno 99] Cannot assign requested address

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant