Requestor doesn't find available provider #427

johny-b · 2021-06-02T14:02:41Z

I made a demo on the today call. I will attach the logs.
Tell me if you need to run it on your own, I will prepare the code.

I have a subnet with 4 providers
I try to create 5 clusters
4 clusters are created, one is still waiting. To be exact, I'm waiting in this loop:

cluster = await golem.run_service(service_wrapper.service_cls)
while not cluster.instances:
    print(f"WAITING FOR cluster.instances for {service_wrapper}")
    await asyncio.sleep(1)

I stop one of the clusters. It is correctly stopped (e.g. "Terminated agreement", "Accepted invoice") etc.
I expect the 5th cluster to start on the now-released provider, but it doesn't start.

I'm pretty sure that at least once I've seen it starting, so maybe it is not "doesn't start" but "it takes veeery long for it to start".

The text was updated successfully, but these errors were encountered:

johny-b · 2021-06-02T14:03:45Z

log.log

azawlocki · 2021-06-08T11:14:44Z

Hints from @johny-b for reproducing this issue: try to create a number of clusters (of any service) larger than the number of available providers

mateuszsrebrny · 2021-06-10T12:12:11Z

@johny-b @azawlocki should this be blocking the beta.2 release?
Please add a timeframe label when you find out / decide :)

johny-b · 2021-06-10T12:18:28Z

@mateuszsrebrny
Afaik erigon is delayed, so if sdk considers patch release in the next week than definitely not.
Either way, we could release erigon without this fix - but this will be something that (imho) will have to be addressed in our tutorial/docs.

johny-b · 2021-06-10T15:03:53Z

@azawlocki
I have the same results with

single provider
two clusters
erigon-agnostic code (Service based on a blender image)

johny-b · 2021-06-11T13:41:57Z

From my POV issue #461 is quite harmless now because of this issue.

If this issue is fixed, and #461 is not, we could

create more clusters than providers
stop the pending clusters & forget about them
stop some of the working clusters and get a lot of instances on the stopped & forgotten clusters

azawlocki · 2021-06-18T10:51:59Z

@johny-b I've created a test requestor script and run it on a two-provider goth network.

The script tries to create three clusters, each with a single instance. 10s after two clusters are started, one cluster is stopped.

With yagna-0.7.0, the third cluster is never started after that -- that's the issue you describe.

cluster 1: running on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: running on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
Stopping one cluster...
shutting down...
cluster 1: stopping on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
[2021-06-18T12:19:31.118+0200 INFO yapapi.services] <SimpleService: 97daf0bff19f4f01acb56ef595beeb81> decomissioned
[2021-06-18T12:19:31.121+0200 INFO yapapi.summary] Task finished by provider 'provider-2', task data: Service: SimpleService
[2021-06-18T12:19:31.140+0200 INFO yapapi.summary] Terminated agreement with provider-2
[2021-06-18T12:19:31.257+0200 INFO yapapi.summary] Received proposals from 2 providers so far
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
[2021-06-18T12:19:33.220+0200 INFO yapapi.summary] Accepted invoice from 'provider-2', amount: 0.003685869118333334
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
cluster 1: terminated on provider-2; cluster 2: running on provider-1; cluster 3: no instances; 
... # and so on, and so on ...

With yagna-0.6.7-beta.1, the third cluster is started within seconds after one of the clusters previously running is stopped.

cluster 1: running on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
Stopping one cluster...
shutting down...
cluster 1: stopping on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:04.662+0200 INFO yapapi.services] <SimpleService: 88b52acc3d27498fbab1aea13253dcbf> decomissioned
[2021-06-18T12:37:04.664+0200 INFO yapapi.summary] Task finished by provider 'provider-2', task data: Service: SimpleService
[2021-06-18T12:37:04.700+0200 INFO yapapi.summary] Terminated agreement with provider-2
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:05.689+0200 INFO yapapi.summary] Accepted invoice from 'provider-2', amount: 0.005001666844166666
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:08.317+0200 INFO yapapi.summary] Received proposals from 2 providers so far
[2021-06-18T12:37:08.375+0200 INFO yapapi.summary] Received proposals from 2 providers so far
[2021-06-18T12:37:08.381+0200 INFO yapapi.summary] Received proposals from 2 providers so far
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:08.714+0200 INFO yapapi.summary] Agreement proposed to provider 'provider-1'
...
cluster 1: terminated on provider-2; cluster 2: no instances; cluster 3: running on provider-1; 
[2021-06-18T12:37:24.735+0200 INFO yapapi.summary] Agreement proposed to provider 'provider-2'
[2021-06-18T12:37:24.832+0200 INFO yapapi.summary] Agreement confirmed by provider 'provider-2'
[2021-06-18T12:37:25.617+0200 INFO yapapi.services] <SimpleService: bc8715e7ff5c4fee8f34582835d33fd8> commissioned
starting...
[2021-06-18T12:37:25.618+0200 INFO yapapi.summary] Task started on provider 'provider-2', task data: Service: SimpleService
cluster 1: terminated on provider-2; cluster 2: starting on provider-2; cluster 3: running on provider-1;

The difference is that with yagna-0.6.7-beta.1, after one cluster is stopped and its agreement is terminated, yapapi` finds a new proposal from the provider that run the stopped cluster and can sign a new agreement.

With yagna-0.7.0, yapapi does not find any proposals after the first agreement is terminated.

Perhaps this is something the core team should look at? @tworec, what do you think?

Here's the test script:

#!/usr/bin/env python3
import asyncio
from datetime import datetime, timedelta

from yapapi import windows_event_loop_fix,
from yapapi import Golem
from yapapi.services import Service
from yapapi.log import enable_default_logger, log_summary, log_event_repr, pluralize
from yapapi.payload import vm

col_green = "\033[32;1m"
col_cyan = "\033[36;1m"
col_yellow = "\033[33;1m"
col_magenta = "\033[35;1m"
col_default = "\033[0m"


def cluster_color(cluster_num):
    return [col_green, col_cyan, col_yellow][cluster_num % 3]


class SimpleService(Service):

    next_num: int = 1

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.num = SimpleService.next_num
        self.color = cluster_color(self.num)
        SimpleService.next_num += 1

    @staticmethod
    async def get_payload():
        return await vm.repo(
            image_hash="8b11df59f84358d47fc6776d0bb7290b0054c15ded2d6f54cf634488",
            min_mem_gib=0.5,
            min_storage_gib=2.0,
        )

    async def start(self):
        print(f"{self.color}starting...{col_default}")
        self._ctx.run("/bin/echo", "START")
        await (yield self._ctx.commit())

    async def run(self):
        print(f"{self.color}running...{col_default}")

        import itertools
        for n in itertools.count(1):
            await asyncio.sleep(3)
            self._ctx.run("/bin/echo", "RUN", str(n))
            await (yield self._ctx.commit())

    async def shutdown(self):
        print(f"{self.color}shutting down...{col_default}")
        await asyncio.sleep(1)
        self._ctx.run("/bin/echo", "SHUTDOWN")
        await (yield self._ctx.commit())


async def main(subnet_tag):

    async with Golem(budget=1.0, subnet_tag=subnet_tag) as golem:

        clusters = [
            await golem.run_service(SimpleService),
            await golem.run_service(SimpleService),
            await golem.run_service(SimpleService),
        ]

        two_clusters_started = None
        three_clusters_started = None
        one_cluster_stopped = False

        while True:
            await asyncio.sleep(1)
            status = ""
            for n, cluster in enumerate(clusters):
                color = cluster_color(n+1)
                status += f"{color}cluster {n+1}: "
                if cluster.instances:
                    status += ", ".join(
                        f"{s.state.value} on {s.provider_name}" for s in cluster.instances
                    )
                else:
                    status += "no instances"
                status += col_default + "; "
            print(status)

            have_instances = len([c for c in clusters if c.instances])
            if have_instances == 2 and not two_clusters_started:
                two_clusters_started = datetime.now()
            elif have_instances == 3 and not three_clusters_started:
                three_clusters_started = datetime.now()

            if (
                two_clusters_started and not one_cluster_stopped and
                datetime.now() - two_clusters_started > timedelta(seconds=10)
            ):
                print("Stopping one cluster...")
                [c for c in clusters if c.instances][0].stop()
                one_cluster_stopped = True

            if (
                three_clusters_started and
                datetime.now() - three_clusters_started > timedelta(seconds=10)
            ):
                break

        print("Stopping all clusters...")
        for c in clusters:
            c.stop()


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--subnet", default="devnet-beta.2", help="Subnet name; default: %(default)s"
    )
    args = parser.parse_args()

    now = datetime.now().strftime("%Y-%m-%d_%H.%M.%S")
    log_file=f"simple-service-yapapi-{now}.log"

    # This is only required when running on Windows with Python prior to 3.8:
    windows_event_loop_fix()

    enable_default_logger(
        log_file=f"simple-service-yapapi-{now}.log",
        debug_activity_api=True,
        debug_market_api=True,
        debug_payment_api=True,
    )

    loop = asyncio.get_event_loop()
    task = loop.create_task(main(subnet_tag=args.subnet))

    try:
        loop.run_until_complete(task)
    except KeyboardInterrupt:
        print(
            f"{col_yellow}"
            "Shutting down gracefully, please wait a short while "
            "or press Ctrl+C to exit immediately..."
            f"{col_default}"
        )
        task.cancel()
        try:
            loop.run_until_complete(task)
            print(
                f"{col_yellow}Shutdown completed, thank you for waiting!{col_default}"
            )
        except (asyncio.CancelledError, KeyboardInterrupt):
            pass

johny-b · 2021-06-30T07:33:35Z

@azawlocki
This is already solved - I think with the latest yagna?

mateuszsrebrny · 2021-06-30T11:54:19Z

ok, closing then 👯

johny-b added the apps-dispatch label Jun 7, 2021

azawlocki self-assigned this Jun 8, 2021

This was referenced Jun 8, 2021

unexpected 500 in console logs #439

Closed

pending node didn't start being started even when a provider became available golemfactory/yagna-service-erigon#16

Closed

mateuszsrebrny added the beta.2.patch.1 label Jun 15, 2021

johny-b mentioned this issue Jun 18, 2021

Extract code from yagna-service-erigon, add docs & examples golemfactory/yapapi-service-manager#1

Merged

mateuszsrebrny added beta.2.patch.2 and removed beta.2.patch.1 labels Jun 23, 2021

mateuszsrebrny closed this as completed Jun 30, 2021

mateuszsrebrny removed the beta.2.patch.2 label Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requestor doesn't find available provider #427

Requestor doesn't find available provider #427

johny-b commented Jun 2, 2021

johny-b commented Jun 2, 2021

azawlocki commented Jun 8, 2021

mateuszsrebrny commented Jun 10, 2021

johny-b commented Jun 10, 2021

johny-b commented Jun 10, 2021

johny-b commented Jun 11, 2021

azawlocki commented Jun 18, 2021

johny-b commented Jun 30, 2021

mateuszsrebrny commented Jun 30, 2021

Requestor doesn't find available provider #427

Requestor doesn't find available provider #427

Comments

johny-b commented Jun 2, 2021

johny-b commented Jun 2, 2021

azawlocki commented Jun 8, 2021

mateuszsrebrny commented Jun 10, 2021

johny-b commented Jun 10, 2021

johny-b commented Jun 10, 2021

johny-b commented Jun 11, 2021

azawlocki commented Jun 18, 2021

johny-b commented Jun 30, 2021

mateuszsrebrny commented Jun 30, 2021