Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_statistical_profiling flaky #4974

Open
jrbourbeau opened this issue Jun 24, 2021 · 0 comments
Open

test_statistical_profiling flaky #4974

jrbourbeau opened this issue Jun 24, 2021 · 0 comments
Labels
flaky test Intermittent failures on CI.

Comments

@jrbourbeau
Copy link
Member

Seen over in #4971 in this CI build

Details:
2021-06-24T21:59:01.3710970Z ================================== FAILURES ===================================
2021-06-24T21:59:01.3711706Z _________________________ test_statistical_profiling __________________________
2021-06-24T21:59:01.3712225Z 
2021-06-24T21:59:01.3712809Z outer_args = (), kwargs = {}, result = None
2021-06-24T21:59:01.3713822Z coro = <function gen_cluster.<locals>._.<locals>.test_func.<locals>.coro at 0x000002382A9FA040>
2021-06-24T21:59:01.3714465Z 
2021-06-24T21:59:01.3716034Z     def test_func(*outer_args, **kwargs):
2021-06-24T21:59:01.3717249Z         result = None
2021-06-24T21:59:01.3718008Z         workers = []
2021-06-24T21:59:01.3719088Z         with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
2021-06-24T21:59:01.3719816Z     
2021-06-24T21:59:01.3720239Z             async def coro():
2021-06-24T21:59:01.3720748Z                 with dask.config.set(config):
2021-06-24T21:59:01.3721286Z                     s = False
2021-06-24T21:59:01.3721700Z                     for _ in range(60):
2021-06-24T21:59:01.3722157Z                         try:
2021-06-24T21:59:01.3722618Z                             s, ws = await start_cluster(
2021-06-24T21:59:01.3723145Z                                 nthreads,
2021-06-24T21:59:01.3723597Z                                 scheduler,
2021-06-24T21:59:01.3724072Z                                 loop,
2021-06-24T21:59:01.3724534Z                                 security=security,
2021-06-24T21:59:01.3725155Z                                 Worker=Worker,
2021-06-24T21:59:01.3726682Z                                 scheduler_kwargs=scheduler_kwargs,
2021-06-24T21:59:01.3727288Z                                 worker_kwargs=worker_kwargs,
2021-06-24T21:59:01.3727795Z                             )
2021-06-24T21:59:01.3728251Z                         except Exception as e:
2021-06-24T21:59:01.3728805Z                             logger.error(
2021-06-24T21:59:01.3729335Z                                 "Failed to start gen_cluster: "
2021-06-24T21:59:01.3730129Z                                 f"{e.__class__.__name__}: {e}; retrying",
2021-06-24T21:59:01.3730653Z                                 exc_info=True,
2021-06-24T21:59:01.3731100Z                             )
2021-06-24T21:59:01.3732331Z                             await asyncio.sleep(1)
2021-06-24T21:59:01.3732819Z                         else:
2021-06-24T21:59:01.3733282Z                             workers[:] = ws
2021-06-24T21:59:01.3733727Z                             args = [s] + workers
2021-06-24T21:59:01.3734189Z                             break
2021-06-24T21:59:01.3734597Z                     if s is False:
2021-06-24T21:59:01.3735241Z                         raise Exception("Could not start cluster")
2021-06-24T21:59:01.3735769Z                     if client:
2021-06-24T21:59:01.3736237Z                         c = await Client(
2021-06-24T21:59:01.3736685Z                             s.address,
2021-06-24T21:59:01.3737161Z                             loop=loop,
2021-06-24T21:59:01.3737680Z                             security=security,
2021-06-24T21:59:01.3738828Z                             asynchronous=True,
2021-06-24T21:59:01.3739355Z                             **client_kwargs,
2021-06-24T21:59:01.3739758Z                         )
2021-06-24T21:59:01.3740184Z                         args = [c] + args
2021-06-24T21:59:01.3740577Z                     try:
2021-06-24T21:59:01.3741107Z                         future = func(*args, *outer_args, **kwargs)
2021-06-24T21:59:01.3741622Z                         if timeout:
2021-06-24T21:59:01.3742225Z                             future = asyncio.wait_for(future, timeout)
2021-06-24T21:59:01.3742809Z                         result = await future
2021-06-24T21:59:01.3743321Z                         if s.validate:
2021-06-24T21:59:01.3743839Z                             s.validate_state()
2021-06-24T21:59:01.3745087Z                     finally:
2021-06-24T21:59:01.3745674Z                         if client and c.status not in ("closing", "closed"):
2021-06-24T21:59:01.3746358Z                             await c._close(fast=s.status == Status.closed)
2021-06-24T21:59:01.3747031Z                         await end_cluster(s, workers)
2021-06-24T21:59:01.3747667Z                         await asyncio.wait_for(cleanup_global_workers(), 1)
2021-06-24T21:59:01.3748216Z     
2021-06-24T21:59:01.3748658Z                     try:
2021-06-24T21:59:01.3749148Z                         c = await default_client()
2021-06-24T21:59:01.3749664Z                     except ValueError:
2021-06-24T21:59:01.3750156Z                         pass
2021-06-24T21:59:01.3750581Z                     else:
2021-06-24T21:59:01.3751801Z                         await c._close(fast=True)
2021-06-24T21:59:01.3752249Z     
2021-06-24T21:59:01.3752638Z                     def get_unclosed():
2021-06-24T21:59:01.3753259Z                         return [c for c in Comm._instances if not c.closed()] + [
2021-06-24T21:59:01.3753784Z                             c
2021-06-24T21:59:01.3754324Z                             for c in _global_clients.values()
2021-06-24T21:59:01.3755400Z                             if c.status != "closed"
2021-06-24T21:59:01.3755856Z                         ]
2021-06-24T21:59:01.3756225Z     
2021-06-24T21:59:01.3756560Z                     try:
2021-06-24T21:59:01.3756995Z                         start = time()
2021-06-24T21:59:01.3757567Z                         while time() < start + 60:
2021-06-24T21:59:01.3758069Z                             gc.collect()
2021-06-24T21:59:01.3758546Z                             if not get_unclosed():
2021-06-24T21:59:01.3759031Z                                 break
2021-06-24T21:59:01.3759510Z                             await asyncio.sleep(0.05)
2021-06-24T21:59:01.3760018Z                         else:
2021-06-24T21:59:01.3760454Z                             if allow_unclosed:
2021-06-24T21:59:01.3761060Z                                 print(f"Unclosed Comms: {get_unclosed()}")
2021-06-24T21:59:01.3761719Z                             else:
2021-06-24T21:59:01.3762303Z                                 raise RuntimeError("Unclosed Comms", get_unclosed())
2021-06-24T21:59:01.3762948Z                     finally:
2021-06-24T21:59:01.3763447Z                         Comm._instances.clear()
2021-06-24T21:59:01.3764060Z                         _global_clients.clear()
2021-06-24T21:59:01.3764490Z     
2021-06-24T21:59:01.3764968Z                     return result
2021-06-24T21:59:01.3765344Z     
2021-06-24T21:59:01.3765783Z >           result = loop.run_sync(
2021-06-24T21:59:01.3766369Z                 coro, timeout=timeout * 2 if timeout else timeout
2021-06-24T21:59:01.3766950Z             )
2021-06-24T21:59:01.3767201Z 
2021-06-24T21:59:01.3767626Z distributed\utils_test.py:976: 
2021-06-24T21:59:01.3768161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2021-06-24T21:59:01.3768939Z C:\Miniconda3\envs\dask-distributed\lib\site-packages\tornado\ioloop.py:530: in run_sync
2021-06-24T21:59:01.3769786Z     return future_cell[0].result()
2021-06-24T21:59:01.3770335Z distributed\utils_test.py:935: in coro
2021-06-24T21:59:01.3770884Z     result = await future
2021-06-24T21:59:01.3771615Z C:\Miniconda3\envs\dask-distributed\lib\asyncio\tasks.py:494: in wait_for
2021-06-24T21:59:01.3772331Z     return fut.result()
2021-06-24T21:59:01.3772822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2021-06-24T21:59:01.3773082Z 
2021-06-24T21:59:01.3773555Z c = <Client: No scheduler connected>
2021-06-24T21:59:01.3774311Z s = <Scheduler: "tcp://127.0.0.1:56127" workers: 0 cores: 0, tasks: 0>
2021-06-24T21:59:01.3775451Z a = <Worker: 'tcp://127.0.0.1:56128', 0, Status.closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
2021-06-24T21:59:01.3776497Z b = <Worker: 'tcp://127.0.0.1:56130', 1, Status.closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>
2021-06-24T21:59:01.3777096Z 
2021-06-24T21:59:01.3777681Z     @pytest.mark.flaky(reruns=10, reruns_delay=5, condition=MACOS)
2021-06-24T21:59:01.3778377Z     @gen_cluster(client=True)
2021-06-24T21:59:01.3778959Z     async def test_statistical_profiling(c, s, a, b):
2021-06-24T21:59:01.3779629Z         futures = c.map(slowinc, range(10), delay=0.1)
2021-06-24T21:59:01.3780250Z         await wait(futures)
2021-06-24T21:59:01.3780680Z     
2021-06-24T21:59:01.3781158Z         profile = a.profile_keys["slowinc"]
2021-06-24T21:59:01.3781749Z >       assert profile["count"]
2021-06-24T21:59:01.3782180Z E       assert 0
2021-06-24T21:59:01.3782453Z 
2021-06-24T21:59:01.3783038Z distributed\tests\test_worker.py:1079: AssertionError
2021-06-24T21:59:01.3783841Z ---------------------------- Captured stderr call -----------------------------
2021-06-24T21:59:01.3784668Z distributed.scheduler - INFO - Clear task state
2021-06-24T21:59:01.3785604Z distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:56127
2021-06-24T21:59:01.3786543Z distributed.scheduler - INFO -   dashboard at:           127.0.0.1:56126
2021-06-24T21:59:01.3787445Z distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:56128
2021-06-24T21:59:01.3788349Z distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:56128
2021-06-24T21:59:01.3789235Z distributed.worker - INFO -          dashboard at:            127.0.0.1:56129
2021-06-24T21:59:01.3790072Z distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:56127
2021-06-24T21:59:01.3790997Z distributed.worker - INFO - -------------------------------------------------
2021-06-24T21:59:01.3791802Z distributed.worker - INFO -               Threads:                          1
2021-06-24T21:59:01.3792582Z distributed.worker - INFO -                Memory:                   7.00 GiB
2021-06-24T21:59:01.3793703Z distributed.worker - INFO -       Local Directory: D:\a\distributed\distributed\dask-worker-space\worker-5u5l3qhr
2021-06-24T21:59:01.3795120Z distributed.worker - INFO - -------------------------------------------------
2021-06-24T21:59:01.3796081Z distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:56130
2021-06-24T21:59:01.3796966Z distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:56130
2021-06-24T21:59:01.3797835Z distributed.worker - INFO -          dashboard at:            127.0.0.1:56131
2021-06-24T21:59:01.3798681Z distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:56127
2021-06-24T21:59:01.3799603Z distributed.worker - INFO - -------------------------------------------------
2021-06-24T21:59:01.3800450Z distributed.worker - INFO -               Threads:                          2
2021-06-24T21:59:01.3801199Z distributed.worker - INFO -                Memory:                   7.00 GiB
2021-06-24T21:59:01.3802330Z distributed.worker - INFO -       Local Directory: D:\a\distributed\distributed\dask-worker-space\worker-cit4w7h1
2021-06-24T21:59:01.3803484Z distributed.worker - INFO - -------------------------------------------------
2021-06-24T21:59:01.3804635Z distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:56128', name: 0, memory: 0, processing: 0>
2021-06-24T21:59:01.3806036Z distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:56128
2021-06-24T21:59:01.3807017Z distributed.core - INFO - Starting established connection
2021-06-24T21:59:01.3808152Z distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:56130', name: 1, memory: 0, processing: 0>
2021-06-24T21:59:01.3809364Z distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:56130
2021-06-24T21:59:01.3810386Z distributed.core - INFO - Starting established connection
2021-06-24T21:59:01.3811283Z distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:56127
2021-06-24T21:59:01.3812207Z distributed.worker - INFO - -------------------------------------------------
2021-06-24T21:59:01.3813132Z distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:56127
2021-06-24T21:59:01.3814012Z distributed.worker - INFO - -------------------------------------------------
2021-06-24T21:59:01.3815055Z distributed.core - INFO - Starting established connection
2021-06-24T21:59:01.3815908Z distributed.core - INFO - Starting established connection
2021-06-24T21:59:01.3817599Z distributed.scheduler - INFO - Receive client connection: Client-fc8681fe-d536-11eb-8624-000d3ae50775
2021-06-24T21:59:01.3819576Z distributed.core - INFO - Starting established connection
2021-06-24T21:59:01.3820788Z distributed.scheduler - INFO - Remove client Client-fc8681fe-d536-11eb-8624-000d3ae50775
2021-06-24T21:59:01.3822280Z distributed.scheduler - INFO - Remove client Client-fc8681fe-d536-11eb-8624-000d3ae50775
2021-06-24T21:59:01.3823786Z distributed.scheduler - INFO - Close client connection: Client-fc8681fe-d536-11eb-8624-000d3ae50775
2021-06-24T21:59:01.3826034Z distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:56128
2021-06-24T21:59:01.3826975Z distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:56130
2021-06-24T21:59:01.3828098Z distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:56128', name: 0, memory: 0, processing: 0>
2021-06-24T21:59:01.3829157Z distributed.core - INFO - Removing comms to tcp://127.0.0.1:56128
2021-06-24T21:59:01.3830496Z distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:56130', name: 1, memory: 0, processing: 0>
2021-06-24T21:59:01.3832824Z distributed.core - INFO - Removing comms to tcp://127.0.0.1:56130
2021-06-24T21:59:01.3838406Z distributed.scheduler - INFO - Lost all workers
2021-06-24T21:59:01.3839282Z distributed.scheduler - INFO - Scheduler closing...
2021-06-24T21:59:01.3840156Z distributed.scheduler - INFO - Scheduler closing all comms
2021-06-24T21:59:01.3841976Z ============================== warnings summary ===============================
@jrbourbeau jrbourbeau added the flaky test Intermittent failures on CI. label Jun 24, 2021
@fjetter fjetter mentioned this issue Feb 16, 2022
51 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

1 participant