Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-24518 : waitForNamespaceOnline() should return false if any region is offline #1869

Merged
merged 1 commit into from
Jun 17, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1226,15 +1226,15 @@ private void finishActiveMasterInitialization(MonitoredTask status) throws IOExc
* and we will hold here until operator intervention.
*/
@VisibleForTesting
public boolean waitForMetaOnline() throws InterruptedException {
public boolean waitForMetaOnline() {
return isRegionOnline(RegionInfoBuilder.FIRST_META_REGIONINFO);
}

/**
* @return True if region is online and scannable else false if an error or shutdown (Otherwise
* we just block in here holding up all forward-progess).
*/
private boolean isRegionOnline(RegionInfo ri) throws InterruptedException {
private boolean isRegionOnline(RegionInfo ri) {
RetryCounter rc = null;
while (!isStopped()) {
RegionState rs = this.assignmentManager.getRegionStates().getRegionState(ri);
Expand Down Expand Up @@ -1265,16 +1265,16 @@ private boolean isRegionOnline(RegionInfo ri) throws InterruptedException {
* Check hbase:namespace table is assigned. If not, startup will hang looking for the ns table
* <p/>
* This is for rolling upgrading, later we will migrate the data in ns table to the ns family of
* meta table. And if this is a new clsuter, this method will return immediately as there will be
* meta table. And if this is a new cluster, this method will return immediately as there will be
* no namespace table/region.
* @return True if namespace table is up/online.
*/
private boolean waitForNamespaceOnline() throws InterruptedException, IOException {
private boolean waitForNamespaceOnline() throws IOException {
TableState nsTableState =
MetaTableAccessor.getTableState(getConnection(), TableName.NAMESPACE_TABLE_NAME);
if (nsTableState == null || nsTableState.isDisabled()) {
// this means we have already migrated the data and disabled or deleted the namespace table,
// or this is a new depliy which does not have a namespace table from the beginning.
// or this is a new deploy which does not have a namespace table from the beginning.
return true;
}
List<RegionInfo> ris =
Expand All @@ -1285,7 +1285,9 @@ private boolean waitForNamespaceOnline() throws InterruptedException, IOExceptio
}
// Else there are namespace regions up in meta. Ensure they are assigned before we go on.
for (RegionInfo ri : ris) {
isRegionOnline(ri);
if (!isRegionOnline(ri)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually isRegionOnline() is waiting in a loop until this region's status become opened and that server is online. So there is no bug as such right? isRegionOnline() might return false iff the server is being stopped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. So even if one region is offline you will not wait for namespace correct? That means that indicates some case where regions are gong down? What if a split region was opened and that has status offline? Even then we wont wait for namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isRegionOnline might also return false if region doesn't come online after all retry attempts with exponential backoff are completed (rare but valid case)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW our RetryCounterFactory is with infinite retry. maxAttempts are Integer.MAX_VALUE. Also we don't account for a case whether RetryCounterFactory finishes the retry attempts.
There is no harm in adding this fix. But should we see whether the RetryCounterFactory usage is correct also?
Am I missing some thing?

Copy link
Contributor Author

@virajjasani virajjasani Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, for infinite retries, it is not a problem but then ideally we should change the signature of isRegionOnline() because boolean is of no use. Also, it is being used in similar manner for meta region also, and that's why I wanted to make namespace region open logic look same.

e.g

  public boolean waitForMetaOnline() throws InterruptedException {
    return isRegionOnline(RegionInfoBuilder.FIRST_META_REGIONINFO);
  }

However, now I understand the main purpose of isRegionOnline() which is to keep retrying infinite times and only when server is going down, return false. And with the current logic, waitForNamespaceOnline() returns true no matter what isRegionOnline() returns and hence, instead of returning from HMaster active initializations, it proceeds further.

    if (!waitForNamespaceOnline()) {
      return;
    }
    status.setStatus("Starting cluster schema service");
    initClusterSchemaService();
....
....
....

We should return from the above if condition ideally. Good to maintain same logic for namespace and meta regions initializations. Thought?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramkrish86 from this code block, we just return false and we return back from finishActiveMasterInitialization() without any further initialization because mostly this will happen when HM server is going down so we expect finishActiveMasterInitialization() to return instead of going further with any further init:

finishActiveMasterInitialization():

    if (!waitForNamespaceOnline()) {
      return;
    }

waitForNamespaceOnline():

      if (!isRegionOnline(ri)) {
        return false;
      }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya seeing the meta case this seems fine. Just a unification. +1.

return false;
}
}
return true;
}
Expand Down