Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-27551 Add config options to delay assignment to retain last region location #4945

Merged
merged 5 commits into from
Jan 24, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -406,7 +406,7 @@ private void checkIsDead(final ServerName serverName, final String what)
* Assumes onlineServers is locked.
* @return ServerName with matching hostname and port.
*/
private ServerName findServerWithSameHostnamePortWithLock(final ServerName serverName) {
public ServerName findServerWithSameHostnamePortWithLock(final ServerName serverName) {
ServerName end =
ServerName.valueOf(serverName.getHostname(), serverName.getPort(), Long.MAX_VALUE);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
import org.apache.hadoop.hbase.client.RetriesExhaustedException;
import org.apache.hadoop.hbase.master.MetricsAssignmentManager;
import org.apache.hadoop.hbase.master.RegionState.State;
import org.apache.hadoop.hbase.master.ServerManager;
import org.apache.hadoop.hbase.master.procedure.AbstractStateMachineRegionProcedure;
import org.apache.hadoop.hbase.master.procedure.MasterProcedureEnv;
import org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure;
Expand Down Expand Up @@ -107,6 +108,23 @@ public class TransitRegionStateProcedure

private static final Logger LOG = LoggerFactory.getLogger(TransitRegionStateProcedure.class);

public static final String FORCE_REGION_RETAINMENT = "hbase.master.scp.retain.assignment.force";

public static final boolean DEFAULT_FORCE_REGION_RETAINMENT = false;

/** The wait time in millis before checking again if the region's previous RS is back online*/
public static final String FORCE_REGION_RETAINMENT_WAIT =
"hbase.master.scp.retain.assignment.force.wait";

/** The number of times to check if the region's previous RS is back online, before giving up
* and proceeding with assignment on a new RS*/
public static final long DEFAULT_FORCE_REGION_RETAINMENT_WAIT = 500;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first read this config as how long we should wait for this server to come online. However, the code uses this config to sleep for this period between checks, so the overall wait time is FORCE_REGION_RETAINMENT_WAIT * FORCE_REGION_RETAINMENT_RETRIES.

From the operator's perspective, I think it is better to manage this with overall wait time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single overall wait time would be less efficient, we would forcibly wait for the whole of the given timeout, even if the RS comes back online much sooner than that. Maybe better leave as this, but put comments on this constant to avoid confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation/comment is definitely good to have here. Is it costly to check the RS's availability? Maybe hbase.master.scp.retain.assignment.force.wait can be set to a small value so configuring it might not be that important.


public static final String FORCE_REGION_RETAINMENT_RETRIES =
"hbase.master.scp.retain.assignment.force.retries";

public static final long DEFAULT_FORCE_REGION_RETAINMENT_RETRIES = 600;

private TransitionType type;

private RegionStateTransitionState initialState;
Expand All @@ -126,6 +144,14 @@ public class TransitRegionStateProcedure

private boolean isSplit;

private boolean forceRegionRetainment;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the below fields should not be stored here, when reloading we will use the default constructor to create a procedure and use deserialize method to restore the fields, so if you want to store them here, you need to serialize them, or you should implement the afterReplay method to initialize them...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Since these are all config readings, decided to go with the afterReplay overriding option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at other configs in TRSP, I prefer we store these configs in AssignmentManager, just like

  private final boolean shouldAssignRegionsWithFavoredNodes;
  private final int assignDispatchWaitQueueMaxSize;
  private final int assignDispatchWaitMillis;
  private final int assignMaxAttempts;
  private final int assignRetryImmediatelyMaxAttempts;


private ServerManager serverManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerManager can be gotten from MasterProcedureEnv, so we do not need to store it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem here is that we don't keep any MasterProcedureEnv ref as class attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When executing you will always have a MasterProcedureEnv...


private long forceRegionRetainmentWait;

private long forceRegionRetainmentRetries;

public TransitRegionStateProcedure() {
}

Expand Down Expand Up @@ -163,6 +189,17 @@ protected TransitRegionStateProcedure(MasterProcedureEnv env, RegionInfo hri,
}
evictCache =
env.getMasterConfiguration().getBoolean(EVICT_BLOCKS_ON_CLOSE_KEY, DEFAULT_EVICT_ON_CLOSE);

forceRegionRetainment = env.getMasterConfiguration().getBoolean(FORCE_REGION_RETAINMENT,
DEFAULT_FORCE_REGION_RETAINMENT);

forceRegionRetainmentWait = env.getMasterConfiguration().getLong(FORCE_REGION_RETAINMENT_WAIT,
DEFAULT_FORCE_REGION_RETAINMENT_WAIT);

forceRegionRetainmentRetries = env.getMasterConfiguration()
.getLong(FORCE_REGION_RETAINMENT_RETRIES, DEFAULT_FORCE_REGION_RETAINMENT_RETRIES);

serverManager = env.getMasterServices().getServerManager();
}

protected TransitRegionStateProcedure(MasterProcedureEnv env, RegionInfo hri,
Expand All @@ -188,6 +225,25 @@ protected boolean waitInitialized(MasterProcedureEnv env) {
return am.waitMetaLoaded(this) || am.waitMetaAssigned(this, getRegion());
}

private void checkAndWaitForOriginalServer(ServerName lastHost)
throws ProcedureSuspendedException {
boolean isOnline = serverManager.findServerWithSameHostnamePortWithLock(lastHost) != null;
long retries = 0;
while (!isOnline && retries < forceRegionRetainmentRetries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way the procedure will hang here for a very long time without releasing the procedure worker. Better suspend the procedure and reschedule it again later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks for the suggestion, I had changed this in the last commit.

try {
Thread.sleep(forceRegionRetainmentWait);
} catch (InterruptedException e) {
throw new ProcedureSuspendedException();
}
retries++;
isOnline = serverManager.findServerWithSameHostnamePortWithLock(lastHost) != null;
}
LOG.info(
Apache9 marked this conversation as resolved.
Show resolved Hide resolved
"{} is true. We waited {} ms for host {} to come back online. "
+ "Did host come back online? {}",
FORCE_REGION_RETAINMENT, (retries * forceRegionRetainmentRetries), lastHost, isOnline);
}

private void queueAssign(MasterProcedureEnv env, RegionStateNode regionNode)
throws ProcedureSuspendedException {
boolean retain = false;
Expand All @@ -200,9 +256,15 @@ private void queueAssign(MasterProcedureEnv env, RegionStateNode regionNode)
regionNode.setRegionLocation(assignCandidate);
} else if (regionNode.getLastHost() != null) {
retain = true;
LOG.info("Setting lastHost as the region location {}", regionNode.getLastHost());
LOG.info("Setting lastHost {} as the location for region {}", regionNode.getLastHost(),
regionNode.getRegionInfo().getEncodedName());
regionNode.setRegionLocation(regionNode.getLastHost());
}
if (regionNode.getRegionLocation() != null && forceRegionRetainment) {
LOG.warn("{} is set to true. This may delay regions re-assignment "
+ "upon RegionServers crashes or restarts.", FORCE_REGION_RETAINMENT);
checkAndWaitForOriginalServer(regionNode.getRegionLocation());
}
}
LOG.info("Starting {}; {}; forceNewPlan={}, retain={}", this, regionNode.toShortString(),
forceNewPlan, retain);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
*/
package org.apache.hadoop.hbase.master;

import static org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.FORCE_REGION_RETAINMENT;
import static org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.FORCE_REGION_RETAINMENT_WAIT;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotEquals;
import static org.junit.Assert.assertTrue;
Expand Down Expand Up @@ -228,6 +230,79 @@ public void testRetainAssignmentOnSingleRSRestart() throws Exception {
}
}

/**
* This tests the force retaining assignments upon an RS restart, even when master triggers an SCP
*/
@Test
public void testForceRetainAssignment() throws Exception {
UTIL.getConfiguration().setBoolean(FORCE_REGION_RETAINMENT, true);
UTIL.getConfiguration().setLong(FORCE_REGION_RETAINMENT_WAIT, 2000);
setupCluster();
HMaster master = UTIL.getMiniHBaseCluster().getMaster();
SingleProcessHBaseCluster cluster = UTIL.getHBaseCluster();
List<JVMClusterUtil.RegionServerThread> threads = cluster.getLiveRegionServerThreads();
assertEquals(NUM_OF_RS, threads.size());
int[] rsPorts = new int[NUM_OF_RS];
for (int i = 0; i < NUM_OF_RS; i++) {
rsPorts[i] = threads.get(i).getRegionServer().getServerName().getPort();
}

// We don't have to use SnapshotOfRegionAssignmentFromMeta. We use it here because AM used to
// use it to load all user region placements
SnapshotOfRegionAssignmentFromMeta snapshot =
new SnapshotOfRegionAssignmentFromMeta(master.getConnection());
snapshot.initialize();
Map<RegionInfo, ServerName> regionToRegionServerMap = snapshot.getRegionToRegionServerMap();
for (ServerName serverName : regionToRegionServerMap.values()) {
boolean found = false; // Test only, no need to optimize
for (int k = 0; k < NUM_OF_RS && !found; k++) {
found = serverName.getPort() == rsPorts[k];
}
assertTrue(found);
}

// Server to be restarted
ServerName deadRS = threads.get(0).getRegionServer().getServerName();
List<RegionInfo> deadRSRegions = snapshot.getRegionServerToRegionMap().get(deadRS);
LOG.info("\n\nStopping {} server", deadRS);
cluster.stopRegionServer(deadRS);

LOG.info("\n\nSleeping a bit");
Thread.sleep(1000);

LOG.info("\n\nStarting region server {} second time with the same port", deadRS);
cluster.getConf().setInt(ServerManager.WAIT_ON_REGIONSERVERS_MINTOSTART, 3);
cluster.getConf().setInt(HConstants.REGIONSERVER_PORT, deadRS.getPort());
cluster.startRegionServer();

ensureServersWithSamePort(master, rsPorts);

// Wait till master is initialized and all regions are assigned
for (TableName TABLE : TABLES) {
UTIL.waitTableAvailable(TABLE);
}
UTIL.waitUntilNoRegionsInTransition(60000);

snapshot = new SnapshotOfRegionAssignmentFromMeta(master.getConnection());
snapshot.initialize();
Map<RegionInfo, ServerName> newRegionToRegionServerMap = snapshot.getRegionToRegionServerMap();
assertEquals(regionToRegionServerMap.size(), newRegionToRegionServerMap.size());
for (Map.Entry<RegionInfo, ServerName> entry : newRegionToRegionServerMap.entrySet()) {
ServerName oldServer = regionToRegionServerMap.get(entry.getKey());
ServerName currentServer = entry.getValue();
LOG.info(
"Key=" + entry.getKey() + " oldServer=" + oldServer + ", currentServer=" + currentServer);
assertEquals(entry.getKey().toString(), oldServer.getAddress(), currentServer.getAddress());

if (deadRS.getPort() == oldServer.getPort()) {
// Restarted RS start code wont be same
assertNotEquals(oldServer.getStartcode(), currentServer.getStartcode());
} else {
assertEquals(oldServer.getStartcode(), currentServer.getStartcode());
}
}
}

private void setupCluster() throws Exception, IOException, InterruptedException {
// Set Zookeeper based connection registry since we will stop master and start a new master
// without populating the underlying config for the connection.
Expand Down