Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webserver and yagna autostart #69

Merged
merged 3 commits into from
Sep 20, 2023
Merged

Conversation

approxit
Copy link
Contributor

What I've done:

  • Upgraded yagna autostart/stop in webserver - yagna is checked / started at webserver start, and stopped at webserver stop (if yagna was managed by webserver in the first place)
  • Enabled auto yagna payment fund, as its required each time after yagna restart, due to usage of goerli instead of rinkeby
  • Upgraded a little logs of few things to have them more explicit
  • Upgraded webserver autostart/stop in node provider - webserver is checked at before any webserver call, started if needed, and stopped after all nodes are terminated
  • Reformatted the code

Notable remarks:

  • Auto yagna payment fund should be tested / changed while using mainnet, as the is no reason to force the user to wait to collect testnet funds
  • As node provider is used by ray only for atomic cluster actions, we have no control over up and down commands, hence we don't know the true intent. This arises in webserver stopping, node termination is not a last commands that is ran at down. I've assumed that 10s delay from

if await self._check_if_yagna_is_running():
logger.info("Yagna service is already running")
else:
await self._run_yagna_service()
await self._run_yagna_payment_fund() # FIXME
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yagna payment fund is needed only when yagna is launched the first time - or - when it depletes testnet fund (be it ETH or tGLM)

on a regular restart, only yagna payment init is currently required (and, actually, could be done without if we ported the mechanism currently used in yapapi)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to payment init.

@@ -19,6 +25,9 @@ def __init__(self, base_url: URL) -> None:

self._session = requests.Session()

if not is_running_on_golem_network():
self._start_webserver()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering aloud... is it sensible to put such long-blocking operations into a constructor? my intuition suggests this should be a separate step...

plus, it sounds somewhat unintuitive in general to start a server when you instiantiate a client ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... It sounded better at first, but good point. Moving this code to node provider fixes the problem of ray logs while starting webserver.

@@ -19,6 +25,9 @@ def __init__(self, base_url: URL) -> None:

self._session = requests.Session()

if not is_running_on_golem_network():
self._start_webserver()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a separate note - question - what if the webserver startup fails when it's required? wouldn't we want to raise an explicit exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_start_webserver() already handles explicit exception for that case.

start_new_session=True,
)

for _ in range(10):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking but possibly such magic values as the number of retries and the interval between them should be extracted to module-level constants ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added configurable and time based waiting.

@@ -192,3 +203,62 @@ def _make_request(
raise RayOnGolemClientValidationError(
"Couldn't validate response data",
) from e

def _start_webserver(self) -> None:
with cli_logger.group("Ray On Golem webserver"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of the logging group should most likely be a constant defined somewhere...

url=settings.URL_SELF_SHUTDOWN,
request_data=models.SelfShutdownRequestData(),
response_model=models.SelfShutdownResponseData,
error_message="Couldn't send self shutdown request",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error_message="Couldn't send self shutdown request",
error_message="Couldn't send a self-shutdown request",

)

if response.shutdown_state == ShutdownState.NOT_ENABLED:
cli_logger.print("No need to stop webserver, as it was ran externally")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cli_logger.print("No need to stop webserver, as it was ran externally")
cli_logger.print("No need to stop the webserver, as it was started externally")

cli_logger.print("No need to stop webserver, as it was ran externally")
return
elif response.shutdown_state == ShutdownState.CLUSTER_NOT_EMPTY:
cli_logger.print("No need to stop webserver, as cluster is not empty")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cli_logger.print("No need to stop webserver, as cluster is not empty")
cli_logger.print("No need to stop the webserver, as the cluster is not empty")

cli_logger.print("No need to stop webserver, as cluster is not empty")
return

cli_logger.print("Requesting webserver done, will stop soon")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this way?

Suggested change
cli_logger.print("Requesting webserver done, will stop soon")
cli_logger.print("Webserver shutdown request done, will stop soon")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing word, but I'm keeping symmetric logs text with line Requesting webserver shutdown... to keep logs more readable.

def _is_webserver_running(self) -> bool:
try:
response = self._session.get(
str(self._base_url / URL_HEALTH_CHECK.lstrip("/")), timeout=2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd define this timeout also as some constant on a module level...

(or, we could have them as parameters provided to the class's constructor)

@@ -14,6 +14,12 @@ class NodeState(Enum):
stopping = "stopping"


class ShutdownState(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not now, but if, in the future, we want to support various transitions of the server/cluster state, this might benefit from being a state machine...

Comment on lines 114 to 115
app = create_application(args.port, args.self_shutdown)
web.run_app(app, port=args.port, print=None)
Copy link
Contributor

@shadeofblue shadeofblue Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if we're already using the app's setitem/getitem interface to store arbitrary additional properties, we could store the port there as well ... and then, there would be just one place where we define the server's port...

Suggested change
app = create_application(args.port, args.self_shutdown)
web.run_app(app, port=args.port, print=None)
app = create_application(args.port, args.self_shutdown)
web.run_app(app, port=app["port"], print=None)

plus, (not now, just a thought) -> if we're allowing the port to be chosen, maybe we should also allow override of the host argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, good point, added.

But regarding host - I've never changed this from 0.0.0.0 to anything else in my whole career. Even more - every time where host was not 0.0.0.0 it causes some problems. Having host in that way is enough for my taste.

from ray_on_golem.exceptions import RayOnGolemError
from ray_on_golem.server.settings import YAGNA_APPKEY
from ray_on_golem.utils import run_subprocess

logger = logging.getLogger(__name__)

YAGNA_APPNAME = "ray-on-golem"
YAGNA_API_URL = URL("http://127.0.0.1:7465")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to allow this to be overridden? by an environment variable perhaps?


async def _stop_yagna_service(self):
if self._yagna_process is None:
logger.info("No need to stop Yagna service, as it was ran externally")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("No need to stop Yagna service, as it was ran externally")
logger.info("No need to stop Yagna service, as it was started externally")

shutdown_state = ShutdownState.WILL_SHUTDOWN

if shutdown_state == ShutdownState.WILL_SHUTDOWN:
logger.info("Received self shutdown request, exiting in 10 seconds...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, a twice-hardcoded magic number...

Copy link
Contributor

@shadeofblue shadeofblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added my comments

Copy link
Contributor

@shadeofblue shadeofblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it looks good now

@approxit approxit merged commit ed39eb0 into main Sep 20, 2023
@approxit approxit deleted the approxit/webserver-and-yagna-autostart branch September 20, 2023 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants