-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be more resilient to partial network partitions #8720
Conversation
|
||
// we need custom serialization logic because org.apache.lucene.util.Version is not serializable | ||
// nocommit - do we want to push this down to lucene? | ||
private void writeObject(java.io.ObjectOutputStream out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to happen - lucene opted out of Serializable for a long time now I don't think we should add it back. I'd rather drop our dependency on it to be honest!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for controlling the serialization better (i.e., having versioning support) but that's a hard thing. At the moment we can't serialize transport related exceptions so imho we should fix this first. Since the version's variables are final, I had to move this to the disco node level.
left a bunch of comments - I love the test :) |
@s1monw I pushed and update based on feedback. Note that I also modified the CancelableThreads a bit when extracting it. It's not identical. |
@bleskes I looked at the two commits. I like the second one but do not like the first one. I think your changes to the creation model makes thing even more complex and error prone than they are already. Then engine should not be started / not started / stopped and have some state in between can might or might not be cleaned up. I think we should have a dedicated class that we might even can use without all the guice stuff that initializes everything it needs in the constructor. It might even create a new instance if we update settings etc. and tear down anything during that time. That is much cleaner and then you can do the start stop logic cleanup on top. |
b2e63d4
to
4237210
Compare
4237210
to
fb9f6f7
Compare
|
||
/** | ||
*/ | ||
@SuppressWarnings("deprecation") | ||
public class Version implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean removing the Serializable? yeah, because it's not. It depends on org.apache.lucene.util.Version
@@ -754,6 +733,16 @@ public void performRecoveryPrepareForTranslog() throws ElasticsearchException { | |||
engine.start(); | |||
} | |||
|
|||
/** called if recovery has to be restarted after network error / delay ** */ | |||
public void performRecoveryRestart() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still wonder if it is needed to stop the engine - can't we just replay the translog more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just replaying the translog is possible, but that would mean we'd have to make sure the primary doesn't flush in the mean time (the 10s before we retry). Right now, a disconnect will release all resources on the primary and restart from scratch. I still feel this is the right way to go but happy to hear alternatives.
@@ -412,4 +428,117 @@ private void validateIndexRecoveryState(RecoveryState.Index indexState) { | |||
assertThat(indexState.percentBytesRecovered(), greaterThanOrEqualTo(0.0f)); | |||
assertThat(indexState.percentBytesRecovered(), lessThanOrEqualTo(100.0f)); | |||
} | |||
|
|||
@Test | |||
@TestLogging("indices.recovery:TRACE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed?
I left some more comments |
@s1monw I pushed another update. Responded to some comments as well. |
e3551aa
to
b3da055
Compare
This commits adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short: 1) On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default). 2) Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced). To do soI had to add a stop method to the internal engine, which doesn't free the underlying store (putting the engine back to it pre-start status). 3) To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests. 4) When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged
b3da055
to
ba64f52
Compare
@s1monw I rebased and squashed against the latest master. Would be great if you can give it another round. |
cool stuff LGTM |
This commits adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short: 1) On going recoveries will be scheduled for retry upon network disconnect. The default retry period is 5s (cross node connections are checked every 10s by default). 2) Sometimes the disconnect happens after the target engine has started (but the shard is still in recovery). For simplicity, I opted to restart the recovery from scratch (where little to no files will be copied again, because there were just synced). 3) To protected against dropped requests, a Recovery Monitor was added that fails a recovery if no progress has been made in the last 30m (by default), which is equivalent to the long time outs we use in recovery requests. 4) When a shard fails on a node, we try to assign it to another node. If no such node is available, the shard will remain unassigned, causing the target node to clean any in memory state for it (files on disk remain). At the moment the shard will remain unassigned until another cluster state change happens, which will re-assigned it to the node in question but if no such change happens the shard will remain stuck at unassigned. The commits adds an extra delayed reroute in such cases to make sure the shard will be reassinged 5) Moved all recovery related settings to the RecoverySettings. Closes elastic#8720
elastic#8720 introduced a timeout mechanism for ongoing recoveries, based on a last access time variable. In the many iterations on that PR the update of the access time was lost. This adds it back, including a test that should have been there in the first place.
When a node is experience network issues, the master detects it and removes the node from the cluster. That cause all ongoing recoveries from and to that nodes to be stopped and a new location is found for the relevant shards. However, in the case partial network partition, where there a connectivity issues between the source and target node of a recovery but not between those nodes and the current master things may go wrong. While the nodes successfully restore the connection, the on going recoveries may have encounter issues.
This PR adds a test that simulate disconnecting nodes and dropping requests during the various stages of recovery and solves all the issues that were raised by it. In short:
I'd love it if extra focus was give to the engine changes while reviewing - I'm not 100% familiar with the implications of the code to the underlying lucene state.
There is also one nocommit regarding the Java serializability of a Version object (used by DiscoveryNode). We rely on Java serialization for exceptions and this makes the ConnectTransportException unserializable because of it's DiscoveryNode field. This can be fixed in another change, but we need to discuss how.
One more todo left - add a reference to the resiliency page