Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update comparison logic for worker state #3321

Merged
merged 4 commits into from
Feb 11, 2020
Merged

Conversation

rockwellw
Copy link
Contributor

@rockwellw rockwellw commented Dec 13, 2019

Addresses #3256 , based on @TomAugspurger 's comment.

@TomAugspurger
Copy link
Member

Thanks. Do you think this will actually fix #3256? This being the cause was a shot in the dark.

And do you know if / how WorkerState.__eq__ is defined? I don't see one at a glance. Does implementing __slots__ give us one?

@rockwellw
Copy link
Contributor Author

This may just address a symptom and not the root problem, but it seems like a good starting point considering the log messages @bnaul was seeing.

@TomAugspurger
Copy link
Member

Ok. Do you know how __eq__ is implemented ok WorkerState?

@TomAugspurger
Copy link
Member

TomAugspurger commented Dec 16, 2019

I think WorkerState.__eq__ may actually be using identity. I would have expected these to be equal, if we were comparing attributes

In [15]: from distributed.scheduler import WorkerState

In [16]: a = WorkerState()

In [17]: b = WorkerState()

In [18]: a == b
Out[18]: False

This may not be a problem if WorkerState instances are expected to be singletons per worker, which I suspect they are. But it does mean that this likely isn't fixing the underlying issue.

@StephanErb
Copy link
Contributor

There is another occurrence of the same identity comparison here

if ts.processing_on is not ws:
To reduce surprises, I guess both should be fixed.

While I have no proof, I could totally imagine that race conditions between add_worker and remove_worker break the single instance assumption.

@rockwellw rockwellw changed the title Check equality, not identity Update comparison logic for worker state Jan 27, 2020
@bnaul
Copy link
Contributor

bnaul commented Jan 27, 2020

I think WorkerState.__eq__ may actually be using identity. I would have expected these to be equal, if we were comparing attributes

@TomAugspurger it looks like identity() has the info that we would want to compare against, do you think something like this might do the trick?

class WorkerState():
    def __eq__(self, value):
        if hasattr(value, 'identity'):
            return self.identity() == value.identity()
        else:
            return False

@mrocklin
Copy link
Member

Rather than conditionals I usually do something like ...

return type(self) == type(other) and self.identity() == other.identity()

Also, if we're going to implement __eq__ then we'll probably also have to implement __hash__ (should be more or less the same)

@bnaul
Copy link
Contributor

bnaul commented Jan 27, 2020

Rather than conditionals I usually do something like ...

return type(self) == type(other) and self.identity() == other.identity()

Probably not at all important for this case but doesn't this pattern kind of break subclassing? How about isinstance(other, type(self))?

Also, if we're going to implement __eq__ then we'll probably also have to implement __hash__ (should be more or less the same)

Do we currently rely on WorkerStates being hashable? It seems to me like they're mutable so wouldn't we want hash to be None?

@mrocklin
Copy link
Member

If they're of different classes then can they be equal? (but yeah, in truth it doesn't matter)

Fair enough on hashability (assuming that it doesn't come up, which it may)

@bnaul
Copy link
Contributor

bnaul commented Feb 10, 2020

@mrocklin you were right that hashability is needed so we added that as well and changed __eq__ to rely on __hash__ instead. it seems to me like just name+host is enough to uniquely specify the WorkerState but if you think we should add more attributes we can do that.

@TomAugspurger anecdotally we haven't seen #3256 since we started running with this change, but I wouldn't say we've conclusively proven it's resolved since unfortunately that's a random unpredictable failure. regardless this seems like the "right" check to be performing and should help w/ possible race conditions as pointed out by @StephanErb, so my vote would be to merge this once everyone's happy with it and close #3256 until someone observes that same behavior again.

@TomAugspurger
Copy link
Member

Has anyone seen this CI failure before?

=================================== FAILURES ===================================

____________________________ test_pickle_functions _____________________________

    def test_pickle_functions():

        def make_closure():

            value = 1

    

            def f(x):  # closure

                return x + value

    

            return f

    

        def funcs():

            yield make_closure()

            yield (lambda x: x + 1)

            yield partial(add, 1)

    

        for func in funcs():

            wr = weakref.ref(func)

            func2 = loads(dumps(func))

            wr2 = weakref.ref(func2)

            assert func2(1) == func(1)

            del func, func2

            gc.collect()

>           assert wr() is None

E           AssertionError: assert <function test_pickle_functions.<locals>.funcs.<locals>.<lambda> at 0x7f0ab6f535f0> is None

E            +  where <function test_pickle_functions.<locals>.funcs.<locals>.<lambda> at 0x7f0ab6f535f0> = <weakref at 0x7f0a8cc05350; to 'function' at 0x7f0ab6f535f0 (<lambda>)>()

distributed/protocol/tests/test_pickle.py:47: AssertionError`

I've restarted that build.

@mrocklin
Copy link
Member

mrocklin commented Feb 11, 2020 via email

@TomAugspurger
Copy link
Member

OK, thanks. We'll merge this and keep an eye out for that failure again.

Thanks @rockwellw!

@TomAugspurger TomAugspurger merged commit f561e96 into dask:master Feb 11, 2020
@rockwellw rockwellw deleted the patch-1 branch February 11, 2020 18:43
return hash((self.name, self.host))

def __eq__(self, other):
return type(self) == type(other) and hash(self) == hash(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to the party. There is a bug in here: It is not guaranteed that two WorkerState objects are actually the same even though there hash value is the same. Hashing is not a bijective function.

I think we need to go the extra route here and check for (self.name, self.host) == (other.name, other.host)

Copy link
Contributor

@StephanErb StephanErb Feb 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger @bnaul please see above.

While I totally agree that hash collisions are very very unlikely, Dask has a massive install base with sometimes hundreds to thousands of workers per cluster. So as far as we know the once in a million event could happen next Tuesday. 🤷‍♂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants