Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requestor doesn't find available provider #427

Closed
johny-b opened this issue Jun 2, 2021 · 9 comments
Closed

Requestor doesn't find available provider #427

johny-b opened this issue Jun 2, 2021 · 9 comments
Assignees

Comments

@johny-b
Copy link
Contributor

johny-b commented Jun 2, 2021

I made a demo on the today call. I will attach the logs.
Tell me if you need to run it on your own, I will prepare the code.

  1. I have a subnet with 4 providers
  2. I try to create 5 clusters
  3. 4 clusters are created, one is still waiting. To be exact, I'm waiting in this loop:
cluster = await golem.run_service(service_wrapper.service_cls)
while not cluster.instances:
    print(f"WAITING FOR cluster.instances for {service_wrapper}")
    await asyncio.sleep(1)
  1. I stop one of the clusters. It is correctly stopped (e.g. "Terminated agreement", "Accepted invoice") etc.
  2. I expect the 5th cluster to start on the now-released provider, but it doesn't start.

I'm pretty sure that at least once I've seen it starting, so maybe it is not "doesn't start" but "it takes veeery long for it to start".

@johny-b
Copy link
Contributor Author

johny-b commented Jun 2, 2021

log.log

@azawlocki
Copy link
Contributor

Hints from @johny-b for reproducing this issue: try to create a number of clusters (of any service) larger than the number of available providers

@mateuszsrebrny
Copy link
Contributor

@johny-b @azawlocki should this be blocking the beta.2 release?
Please add a timeframe label when you find out / decide :)

@johny-b
Copy link
Contributor Author

johny-b commented Jun 10, 2021

@mateuszsrebrny
Afaik erigon is delayed, so if sdk considers patch release in the next week than definitely not.
Either way, we could release erigon without this fix - but this will be something that (imho) will have to be addressed in our tutorial/docs.

@johny-b
Copy link
Contributor Author

johny-b commented Jun 10, 2021

@azawlocki
I have the same results with

  • single provider
  • two clusters
  • erigon-agnostic code (Service based on a blender image)

@johny-b
Copy link
Contributor Author

johny-b commented Jun 11, 2021

From my POV issue #461 is quite harmless now because of this issue.

If this issue is fixed, and #461 is not, we could

  1. create more clusters than providers
  2. stop the pending clusters & forget about them
  3. stop some of the working clusters and get a lot of instances on the stopped & forgotten clusters

@azawlocki
Copy link
Contributor

@johny-b I've created a test requestor script and run it on a two-provider goth network.

The script tries to create three clusters, each with a single instance. 10s after two clusters are started, one cluster is stopped.

With yagna-0.7.0, the third cluster is never started after that -- that's the issue you describe.

cluster 1: running on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: running on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
Stopping one cluster...
shutting down...
cluster 1: stopping on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
[2021-06-18T12:19:31.118+0200 INFO yapapi.services] <SimpleService: 97daf0bff19f4f01acb56ef595beeb81> decomissioned
[2021-06-18T12:19:31.121+0200 INFO yapapi.summary] Task finished by provider 'provider-2', task data: Service: SimpleService
[2021-06-18T12:19:31.140+0200 INFO yapapi.summary] Terminated agreement with provider-2
[2021-06-18T12:19:31.257+0200 INFO yapapi.summary] Received proposals from 2 providers so far
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
[2021-06-18T12:19:33.220+0200 INFO yapapi.summary] Accepted invoice from 'provider-2', amount: 0.003685869118333334
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
... # and so on, and so on ...

With yagna-0.6.7-beta.1, the third cluster is started within seconds after one of the clusters previously running is stopped.

cluster 1: running on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
Stopping one cluster...
shutting down...
cluster 1: stopping on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:04.662+0200 INFO yapapi.services] <SimpleService: 88b52acc3d27498fbab1aea13253dcbf> decomissioned
[2021-06-18T12:37:04.664+0200 INFO yapapi.summary] Task finished by provider 'provider-2', task data: Service: SimpleService
[2021-06-18T12:37:04.700+0200 INFO yapapi.summary] Terminated agreement with provider-2
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:05.689+0200 INFO yapapi.summary] Accepted invoice from 'provider-2', amount: 0.005001666844166666
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:08.317+0200 INFO yapapi.summary] Received proposals from 2 providers so far
[2021-06-18T12:37:08.375+0200 INFO yapapi.summary] Received proposals from 2 providers so far
[2021-06-18T12:37:08.381+0200 INFO yapapi.summary] Received proposals from 2 providers so far
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:08.714+0200 INFO yapapi.summary] Agreement proposed to provider 'provider-1'
...
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:24.735+0200 INFO yapapi.summary] Agreement proposed to provider 'provider-2'
[2021-06-18T12:37:24.832+0200 INFO yapapi.summary] Agreement confirmed by provider 'provider-2'
[2021-06-18T12:37:25.617+0200 INFO yapapi.services] <SimpleService: bc8715e7ff5c4fee8f34582835d33fd8> commissioned
starting...
[2021-06-18T12:37:25.618+0200 INFO yapapi.summary] Task started on provider 'provider-2', task data: Service: SimpleService
cluster 1: terminated on provider-2; cluster 2: starting on provider-2; cluster 3: running on provider-1; 

The difference is that with yagna-0.6.7-beta.1, after one cluster is stopped and its agreement is terminated, yapapi` finds a new proposal from the provider that run the stopped cluster and can sign a new agreement.

With yagna-0.7.0, yapapi does not find any proposals after the first agreement is terminated.

Perhaps this is something the core team should look at? @tworec, what do you think?


Here's the test script:

#!/usr/bin/env python3
import asyncio
from datetime import datetime, timedelta

from yapapi import windows_event_loop_fix,
from yapapi import Golem
from yapapi.services import Service
from yapapi.log import enable_default_logger, log_summary, log_event_repr, pluralize
from yapapi.payload import vm

col_green = "\033[32;1m"
col_cyan = "\033[36;1m"
col_yellow = "\033[33;1m"
col_magenta = "\033[35;1m"
col_default = "\033[0m"


def cluster_color(cluster_num):
    return [col_green, col_cyan, col_yellow][cluster_num % 3]


class SimpleService(Service):

    next_num: int = 1

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.num = SimpleService.next_num
        self.color = cluster_color(self.num)
        SimpleService.next_num += 1

    @staticmethod
    async def get_payload():
        return await vm.repo(
            image_hash="8b11df59f84358d47fc6776d0bb7290b0054c15ded2d6f54cf634488",
            min_mem_gib=0.5,
            min_storage_gib=2.0,
        )

    async def start(self):
        print(f"{self.color}starting...{col_default}")
        self._ctx.run("/bin/echo", "START")
        await (yield self._ctx.commit())

    async def run(self):
        print(f"{self.color}running...{col_default}")

        import itertools
        for n in itertools.count(1):
            await asyncio.sleep(3)
            self._ctx.run("/bin/echo", "RUN", str(n))
            await (yield self._ctx.commit())

    async def shutdown(self):
        print(f"{self.color}shutting down...{col_default}")
        await asyncio.sleep(1)
        self._ctx.run("/bin/echo", "SHUTDOWN")
        await (yield self._ctx.commit())


async def main(subnet_tag):

    async with Golem(budget=1.0, subnet_tag=subnet_tag) as golem:

        clusters = [
            await golem.run_service(SimpleService),
            await golem.run_service(SimpleService),
            await golem.run_service(SimpleService),
        ]

        two_clusters_started = None
        three_clusters_started = None
        one_cluster_stopped = False

        while True:
            await asyncio.sleep(1)
            status = ""
            for n, cluster in enumerate(clusters):
                color = cluster_color(n+1)
                status += f"{color}cluster {n+1}: "
                if cluster.instances:
                    status += ", ".join(
                        f"{s.state.value} on {s.provider_name}" for s in cluster.instances
                    )
                else:
                    status += "no instances"
                status += col_default + "; "
            print(status)

            have_instances = len([c for c in clusters if c.instances])
            if have_instances == 2 and not two_clusters_started:
                two_clusters_started = datetime.now()
            elif have_instances == 3 and not three_clusters_started:
                three_clusters_started = datetime.now()

            if (
                two_clusters_started and not one_cluster_stopped and
                datetime.now() - two_clusters_started > timedelta(seconds=10)
            ):
                print("Stopping one cluster...")
                [c for c in clusters if c.instances][0].stop()
                one_cluster_stopped = True

            if (
                three_clusters_started and
                datetime.now() - three_clusters_started > timedelta(seconds=10)
            ):
                break

        print("Stopping all clusters...")
        for c in clusters:
            c.stop()


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--subnet", default="devnet-beta.2", help="Subnet name; default: %(default)s"
    )
    args = parser.parse_args()

    now = datetime.now().strftime("%Y-%m-%d_%H.%M.%S")
    log_file=f"simple-service-yapapi-{now}.log"

    # This is only required when running on Windows with Python prior to 3.8:
    windows_event_loop_fix()

    enable_default_logger(
        log_file=f"simple-service-yapapi-{now}.log",
        debug_activity_api=True,
        debug_market_api=True,
        debug_payment_api=True,
    )

    loop = asyncio.get_event_loop()
    task = loop.create_task(main(subnet_tag=args.subnet))

    try:
        loop.run_until_complete(task)
    except KeyboardInterrupt:
        print(
            f"{col_yellow}"
            "Shutting down gracefully, please wait a short while "
            "or press Ctrl+C to exit immediately..."
            f"{col_default}"
        )
        task.cancel()
        try:
            loop.run_until_complete(task)
            print(
                f"{col_yellow}Shutdown completed, thank you for waiting!{col_default}"
            )
        except (asyncio.CancelledError, KeyboardInterrupt):
            pass

@johny-b
Copy link
Contributor Author

johny-b commented Jun 30, 2021

@azawlocki
This is already solved - I think with the latest yagna?

@mateuszsrebrny
Copy link
Contributor

ok, closing then 👯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants