-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-26323 Introduce a Snapshot Procedure #3920
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
1 similar comment
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
b0f826b
to
04a8772
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this is what I expected, thanks for the great work.
Of course we still need to polish some implementation details, left some comments, PTAL.
future.completeExceptionally( | ||
new SnapshotCreationException("Snapshot '" + snapshot.getName() + | ||
"' wasn't completed in expectedTime:" + expectedTimeout + " ms", snapshotDesc)); | ||
if (resp.hasProcId()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add some comments to explain this is for keeping compatibility with old server implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll address it ASAP.
getProcedureResult(resp.getProcId(), future, 0); | ||
addListener(future, new SnapshotProcedureBiConsumer(snapshotDesc.getTableName())); | ||
} else { | ||
long expectedTimeout = resp.getExpectedTimeout(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better extract the code to an separated method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll address it ASAP.
@@ -437,10 +437,13 @@ message IsCleanerChoreEnabledResponse { | |||
|
|||
message SnapshotRequest { | |||
required SnapshotDescription snapshot = 1; | |||
optional uint64 nonce_group = 2 [default = 0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not have this in the past?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have no idea. Maybe the previous snapshot framework is safe enough to avoid repeated execution. But I think we should have this for the snapshot procedure so I added here.
|
||
// just to pass the unit tests for all 3.x versions, | ||
// the minimum version maybe needs to be modified later | ||
if (VersionInfoUtil.currentClientHasMinimumVersion(2, 10)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we could always use the same logic? Just always return the procId, it is the client's choice to whether to make use of it. And we could just test whether we have nonceGroup and nonce directly, so we do not need to test the client version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. That's a good idea. I will address it ASAP.
|
||
private TableDescriptor loadTableDescriptorSnapshot(MasterProcedureEnv env) throws IOException { | ||
TableDescriptor htd = env.getMasterServices().getTableDescriptors().get(snapshotTable); | ||
if (htd == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exists in the old code. I can't think of a situation where this would happen. Actually I tried removing this check and it seemed that it didn't make any difference. The unit tests still passed. This is just to keep the same as the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind giving a pointer in the current code base? We could check the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (htd == null) { | ||
throw new IOException("TableDescriptor missing for " + snapshotTable); | ||
} | ||
if (htd.getMaxFileSize() == -1 && this.snapshot.getMaxFileSize() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to treat the maxFileSize specially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exists in the old code. I can't think of a situation where this would happen. Actually I tried removing this check and it seemed that it didn't make any difference. The unit tests still passed. This is just to keep the same as the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind giving a pointer in the current code base? We could check the history.
try { | ||
prepareSnapshotEnv(env); | ||
} catch (IOException e) { | ||
LOG.error("Failed replaying {}, mark procedure as failed", this, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't encountered it so far. But we may need to be prepared for filesystem failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here you just call prepareSnapshotEnv... I do not think this is correct, in afterReplay typically you should not do the same thing with a normal step? I was thinking we will use this method to restore the in progress snapshot procedures map in SnapshotManager...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some variables such as workingDir, workingDirFs or snapshotManifest, they are runtime variables and maybe we don't need to persist them to the procedure store, just reconstruct them after replay. Based on this consideration, I call the prepareSnapshotEnv in this method.
The SnapshotManager has not been initialized when we replay procedures. As we discussed before, the solution is changing the startup order of master or using a map to track the unfinished snapshot procedure and pass the map when initializing SnapshotManager. I am trying the second solution and have not finished this work. The code is a little ugly, because I have to put some special logic to the procedure-v2 framework which is what I want to avoid. So I am thinking about the first solution too ....
.getAssignmentManager().getTableRegions(snapshotTable, false); | ||
List<ServerName> servers = env | ||
.getMasterServices().getServerManager().getOnlineServersList(); | ||
return env.getMasterServices().getLoadBalancer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use LoadBalancer here? I suppose we could use something like the SplitWALManager, where we could configure the max concurrency number for a region server, and if there are more tasks then the number_of_region_servers * max_concurrency, we just do not schedule all the tasks, once there are tasks finished, we will schedule new tasks. This could make the tasks more evenly distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a better idea to distribute the tasks. I'll address it ASAP.
region.getRegionNameAsString(), regionNode.getProcedure())); | ||
throw new ProcedureSuspendedException(); | ||
} | ||
if (!regionNode.getState().matches(RegionState.State.OPEN)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible if we do not have a TRSP for this region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regionserver can be crashed in any time. So I think we may need to be prepared for this. What do you think ?
return null; | ||
} else { | ||
// can not send request to remote server, we will try another server | ||
ServerName newTargetServer = env.getMasterServices().getServerManager().randomSelect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better just fail here and let the upper layer to reschedule a procedure for verifying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. But I did not find a good way to let the parent procedure find that the child procedure failed to execute (As I mentioned in the doc, I think it's hard to communicate between parent procedure and child procedure), So I have to use some tricks to let the veiry procedure handle dispatch failure and keep retrying until succeed. Do you have any ideas about how to let the parent procedure find that the child procedure failed to execute ?
Really appreciate for taking time reviewing this @Apache9 The last commit fails some unit tests and I have fixed most of them. Currently I am working on fixing the left. I will finish this work ASAP. Thanks. |
…e to a seperated method
34c4ca7
to
8617d7c
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
closed this PR and started a new One |
No description provided.