-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAI-4661: fix race condition in singleton-per-collection StateWatcher creation #151
SAI-4661: fix race condition in singleton-per-collection StateWatcher creation #151
Conversation
collectionWatches.compute( | ||
collection, | ||
(k, v) -> { | ||
if (v == null) { | ||
v = new StatefulCollectionWatch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magibney its not clear to me whether current code has any race condition here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some explanation on the most relevant section of the PR code.
if (!collectionWatches.watchedCollections().contains(coll)) { | ||
// This collection is no longer interesting, stop watching. | ||
StatefulCollectionWatch scw = collectionWatches.statefulWatchesByCollectionName.get(coll); | ||
if (scw == null || scw.associatedWatcher != this) { | ||
// Collection no longer interesting, or we have been replaced by a different watcher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The substance of the fix is in these three lines. Since StateWatcher instance do a deferred check against collectionWatches
in order to determine whether to exit, it's possible that by the time a StateWatcher checks here, the watch it was created to serve has been removed, and a new watch inserted (and new StateWatcher created) before this check. In that case it's possible to accumulate persistent StateWatcher instances that all notify callbacks on the same event.
The fix is for the StateWatcher to check whether it is the instance responsible for the current entry in collectionWatches
, and exit if not. The rest of the changes are basically just to ensure that The new StatefulCollectionWatch.associatedWatcher
field is never null once it's been inserted in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe will do f2f session for it.
the watch it was created to serve has been removed, and a new watch inserted
When/where we remove watch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above 3 lines is where the StateWatcher
is "removed", passively, by having the process()
method return early without resetting itself. There are a number of places where DocCollectionWatcher
s are started, including waitForState()
, etc. Not actually sure which of these is the likely culprit, but the test definitely reproduces the problematic behavior we were observing, in a way that should not be possible to do (and is fixed by this change).
int iterations = 10; | ||
for (int i = 0; i < extraWatchers; i++) { | ||
// add and remove a bunch of watchers | ||
DocCollectionWatcher w = (coll) -> false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume generally we want to return true here. Wondering what happens if we return true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends. Some watches are one-off (like in waitForState
), but there's at least one watcher that's persistent for the life of the core: ZkController.UnloadCoreOnDeletedWatcher
. In this test, because the point is that we add-and-remove quickly, returning true
wouldn't change anything ... it would just be another way of removing the watcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just want to make sure with prs collection and restart, we see this behavior?
try { | ||
reader.registerDocCollectionWatcher("c1", w); | ||
} finally { | ||
reader.removeDocCollectionWatcher("c1", w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see test is removing watch here - would be good to create test with prs collection with n shards and then restart. I'm bit trying to simulate our solr node behavior
@@ -642,8 +647,10 @@ private void constructState(Set<String> changedCollections) { | |||
|
|||
/** Refresh collections. */ | |||
private void refreshCollections() { | |||
for (String coll : collectionWatches.watchedCollections()) { | |||
new StateWatcher(coll).refreshAndWatch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on this @magibney !
I traced the code logic and design a little bit. The principle design for removal of watches (ie NOT re-install the watch to ZK after event trigger) appears to tie to the absence of collection name in collectionWatches.watchedCollections()
in StateWatcher#process
(which you pointed out later)
IMHO, this new StateWatcher() call is a bit of an oddball, it skips the checks on collectionWatches
(unlike 2 other places that instantiate the StateWatcher
, which check collectionWatches
and only create new instance if there's no entry for the target collection), hence this StateWatcher fails to remove itself at https://github.com/cowpaths/fullstory-solr/pull/151/files#diff-9b1918d641f78a8c245306b47ffd7a4554a12771ed4cb42293a411537fe7b54dL1334.
I cannot find documentation stating that there could ever be only one StateWatcher
instance per collection per zkReader - or perhaps I missed it, but such could have been implied by the current design of collectionWatches
.
I'm unsure of the expected behaviors of this refreshCollections
method (javadoc seems lacking 😅 ), it might simply want to re-fetch the collection states instead of installing a new zk watch? It could be using new StateWatcher(coll).refreshAndWatch();
out of convenience, but it installs an extra zk watch for a collection already exists in collectionWatches
.
Or if this also wants to install a new "replacement" watch (of a "stale" watch, though such watch could still be working if we see duplicates of event notification as described in https://issues.apache.org/jira/browse/SOLR-11535), then i think @magibney 's solution makes perfect sense - instead of having only collectionWatches
field to keep track of what collections have zk watch installed, we also need to know exactly which StateWatcher
instance is being used.
@magibney i 💯 agree with your description as in https://issues.apache.org/jira/browse/SOLR-11535?focusedCommentId=17770574&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17770574 . I'm wondering though (which your are probably also working on it), how did such race condition happen during our restart. AFAIK, the ONLY place that could remove the collection from Therefore instantiation of And if we look at invocation of So a race condition for Another thing that DID happen, which is That alone probably did not cause a "runaway" zkCallback, but perhaps did exacerbate the problem? 🤔 |
@magibney so i debugged a little bit and it's true that there are several zk watch installation from the |
} | ||
|
||
DocCollection state = clusterState.getCollectionOrNull(collection); | ||
if (stateWatcher.onStateChanged(state) == true) { | ||
removeDocCollectionWatcher(collection, stateWatcher); | ||
if (docCollectionWatcher.onStateChanged(state) == true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice rename here! the original stateWatcher
was confusing as it's not really a StateWatcher
😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, like 9 different concepts of "state" and "watcher" going on here :-)
@@ -372,6 +372,11 @@ private StatefulCollectionWatch compute( | |||
|
|||
private static class StatefulCollectionWatch extends CollectionWatch<DocCollectionWatcher> { | |||
private DocCollection currentState; | |||
private volatile StateWatcher associatedWatcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some description for this field, there are DocCollectionWatcher
and StateWatcher
here, which can be confusing for devs unfamiliar with code (OR forgetful dev like me 😅 ).
Perhaps it's good to mention that this is the underlying ZK level watcher that we tracking. And we need this to stop re-installing old ZK StateWatcher
?
@magibney @patsonluk Thanks for working/debugging/fixing/explaning this issue thoroughly. Was bit thinking more about this and thought we should be registering https://github.com/cowpaths/fullstory-solr/pull/154/files |
…er creation back-ported from active upstream PR as of commit 405e1a8
adcfe3a
to
001d506
Compare
switched the base branch to fs/branch_9_3 and backported the in-progress upstream fix. (Will merge this to upstream branch_9x soon, so only doing fs/branch_9_3 here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think the change proposed is safe and straightforward!
I agree that we would want to do more cleanups later on (on refresh and remove unused methodsd) or use the other proposal @hiteshk25 suggested 👍🏼 https://github.com/cowpaths/fullstory-solr/pull/154/files
No tests yet; will be tricky to test this, esp because I'm not 100% sure where the race condition actually comes in. Buuut I can definitely see several opportunities for race conditions, and this PR will fix the issue by leaning on the deduplication of
collectionWatches
and slightly modifying the logic of how StateWatcher determines whether to expire or reset the watch.