-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update comparison logic for worker state #3321
Conversation
Thanks. Do you think this will actually fix #3256? This being the cause was a shot in the dark. And do you know if / how |
This may just address a symptom and not the root problem, but it seems like a good starting point considering the log messages @bnaul was seeing. |
Ok. Do you know how |
I think In [15]: from distributed.scheduler import WorkerState
In [16]: a = WorkerState()
In [17]: b = WorkerState()
In [18]: a == b
Out[18]: False This may not be a problem if WorkerState instances are expected to be singletons per worker, which I suspect they are. But it does mean that this likely isn't fixing the underlying issue. |
There is another occurrence of the same identity comparison here distributed/distributed/scheduler.py Line 2560 in d747f63
While I have no proof, I could totally imagine that race conditions between |
@TomAugspurger it looks like
|
Rather than conditionals I usually do something like ...
Also, if we're going to implement |
Probably not at all important for this case but doesn't this pattern kind of break subclassing? How about
Do we currently rely on |
If they're of different classes then can they be equal? (but yeah, in truth it doesn't matter) Fair enough on hashability (assuming that it doesn't come up, which it may) |
Define __eq__ and __hash__ for WorkerState
@mrocklin you were right that hashability is needed so we added that as well and changed @TomAugspurger anecdotally we haven't seen #3256 since we started running with this change, but I wouldn't say we've conclusively proven it's resolved since unfortunately that's a random unpredictable failure. regardless this seems like the "right" check to be performing and should help w/ possible race conditions as pointed out by @StephanErb, so my vote would be to merge this once everyone's happy with it and close #3256 until someone observes that same behavior again. |
Has anyone seen this CI failure before?
I've restarted that build. |
Not recently, but yes, it used to be an intermittent failure.
…On Tue, Feb 11, 2020 at 5:23 AM Tom Augspurger ***@***.***> wrote:
Has anyone seen this CI failure before?
=================================== FAILURES ===================================
____________________________ test_pickle_functions _____________________________
def test_pickle_functions():
def make_closure():
value = 1
def f(x): # closure
return x + value
return f
def funcs():
yield make_closure()
yield (lambda x: x + 1)
yield partial(add, 1)
for func in funcs():
wr = weakref.ref(func)
func2 = loads(dumps(func))
wr2 = weakref.ref(func2)
assert func2(1) == func(1)
del func, func2
gc.collect()
> assert wr() is None
E AssertionError: assert <function test_pickle_functions.<locals>.funcs.<locals>.<lambda> at 0x7f0ab6f535f0> is None
E + where <function test_pickle_functions.<locals>.funcs.<locals>.<lambda> at 0x7f0ab6f535f0> = <weakref at 0x7f0a8cc05350; to 'function' at 0x7f0ab6f535f0 (<lambda>)>()
distributed/protocol/tests/test_pickle.py:47: AssertionError`
I've restarted that build.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3321?email_source=notifications&email_token=AACKZTER3ZA3WJTT5US4FMTRCKRGRA5CNFSM4J2TLRWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELMMN5I#issuecomment-584632053>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDCXKDCFHYSALANVELRCKRGRANCNFSM4J2TLRWA>
.
|
OK, thanks. We'll merge this and keep an eye out for that failure again. Thanks @rockwellw! |
return hash((self.name, self.host)) | ||
|
||
def __eq__(self, other): | ||
return type(self) == type(other) and hash(self) == hash(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late to the party. There is a bug in here: It is not guaranteed that two WorkerState objects are actually the same even though there hash value is the same. Hashing is not a bijective function.
I think we need to go the extra route here and check for (self.name, self.host) == (other.name, other.host)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger @bnaul please see above.
While I totally agree that hash collisions are very very unlikely, Dask has a massive install base with sometimes hundreds to thousands of workers per cluster. So as far as we know the once in a million event could happen next Tuesday. 🤷♂
Addresses #3256 , based on @TomAugspurger 's comment.