Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2270] Improvement(IT): increase retry interval of container status check #2271

Closed
wants to merge 3 commits into from

Conversation

mchades
Copy link
Contributor

@mchades mchades commented Feb 20, 2024

What changes were proposed in this pull request?

increase retry interval of container status check from 5s to 10s

Why are the changes needed?

As issue #2270 described, the retry interval is too short to wait HDFS initialization completed

Fix: #2270

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@mchades mchades self-assigned this Feb 20, 2024
@justinmclean
Copy link
Member

You need to run ./gradlew :integration-test:spotlessApply to fix some formatting issue.

@justinmclean
Copy link
Member

Running locally I'm getting errors like:
CatalogHiveIT > initializationError FAILED
java.lang.RuntimeException: Failed to connect to Hive Metastore
at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:96)
at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:39)
at com.datastrato.gravitino.utils.ClientPoolImpl.get(ClientPoolImpl.java:126)
at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:57)
at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:52)
at com.datastrato.gravitino.integration.test.container.HiveContainer.checkContainerStatus(HiveContainer.java:111)
at com.datastrato.gravitino.integration.test.container.HiveContainer.start(HiveContainer.java:62)
at com.datastrato.gravitino.integration.test.container.ContainerSuite.startHiveContainer(ContainerSuite.java:86)
at com.datastrato.gravitino.integration.test.catalog.hive.CatalogHiveIT.startup(CatalogHiveIT.java:144)

@mchades
Copy link
Contributor Author

mchades commented Feb 20, 2024

Running locally I'm getting errors like: CatalogHiveIT > initializationError FAILED java.lang.RuntimeException: Failed to connect to Hive Metastore at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:96) at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:39) at com.datastrato.gravitino.utils.ClientPoolImpl.get(ClientPoolImpl.java:126) at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:57) at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:52) at com.datastrato.gravitino.integration.test.container.HiveContainer.checkContainerStatus(HiveContainer.java:111) at com.datastrato.gravitino.integration.test.container.HiveContainer.start(HiveContainer.java:62) at com.datastrato.gravitino.integration.test.container.ContainerSuite.startHiveContainer(ContainerSuite.java:86) at com.datastrato.gravitino.integration.test.catalog.hive.CatalogHiveIT.startup(CatalogHiveIT.java:144)

It seems caused by the Hive server check after HDFS dataNode check, can you provide the content of integration-test/build/integration-test.log?

@justinmclean
Copy link
Member

Here's one after a couple of failures, but not run to completion.

integration-test.log

@mchades
Copy link
Contributor Author

mchades commented Feb 20, 2024

Here's one after a couple of failures, but not run to completion.

integration-test.log

The 124 line of the log shows the Hive container has started successfully:

2024-02-20 17:44:57 INFO HiveContainer:83 - Hive container startup success!

seems the logs of the failed execution may have been overwritten, but based on the stack trace you provided earlier, the failure appears to be unrelated to the changes in this PR.

nRetry,
retryLimit,
sleepTimeMillis);
Thread.sleep(sleepTimeMillis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 5 seconds not enough to ensure that the Trino server has started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a Trino server, it's HDFS initialization.

From the logs, it does seem that 5s is not enough. You can take a look at the description in the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the exponent numbers like 1, 2, 4, and so on, I believe it will function more well than fixed time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking 10 seconds to complete a check is too slow.
If the service is normal in the second second, this program needs to wait for 10 seconds.
I suggest check the interface to:
···
checkContainerStatus(int timeout)
···
Check every second, continuing until timeout or success.

Copy link
Contributor Author

@mchades mchades Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frequent checking can result in additional performance overhead and stress on the target process. However, setting an overall timeout is a good suggestion.

From the log information in the issue description, it is evident that HDFS completed initialization after 33 seconds of container startup, but the detection mechanism returned a failure at 28 seconds.

Considering that this method includes three checks (DataNode status, Hive connection, HDFS connection), I recommend setting the overall timeout to 60s and increasing the retry interval exponentially. If the total time spent on retries exceeds the specified overall timeout, it should be deemed a failed initialization.

What do you think of this approach? @yuqi1129 @diqiu50

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the whole, I'm afraid 60 seconds is not enough in some cases, can we make it larger?

@mchades mchades requested a review from yuqi1129 February 21, 2024 01:49
@jerryshao
Copy link
Contributor

@yuqi1129 @diqiu50 can you please help to review?

@yuqi1129
Copy link
Contributor

@mchades
Could you please update the PR?

@mchades
Copy link
Contributor Author

mchades commented Feb 26, 2024

It is ready for review now, please help to review it when you have time, thx~ @yuqi1129 @diqiu50

@mchades mchades requested a review from diqiu50 February 26, 2024 02:30
totalSleepTimeMillis += sleepTimeMillis;
}
}
return totalSleepTimeMillis < timeoutMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the check procedure is successful, I believe we should use the value of "check.getAsBoolean()."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here outside the block of FOR loop, if we use check.getAsBoolean(), an additional check is performed.

BTW, there are multiple checkers here. Which checker's method check.getAsBoolean() are you referring to?

protected abstract boolean checkContainerStatus(int retryLimit);
protected abstract boolean checkContainerStatus(int timeoutMillis);

protected boolean checkContainerStatusWithRetry(int timeoutMillis, BooleanSupplier... checker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support multiple checks? If so, should we only perform them in serialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, such as the Hive container, we need to check the HDFS data node, HMS client connection and HDFS client connection

break;
}

sleepTimeMillis = sleepTimeMillis * (int) Math.pow(2, nRetry++);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sleep time is too long, This computation method is increasing too rapidly. 5 , 10, 20, 40 ...
  2. Use shift operator instead of power

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sleep time is too long, This computation method is increasing too rapidly. 5 , 10, 20, 40 ...

How about 1, 2, 4, 8, 16 ... ? What are the issues with rapid growth? What is your opinion?

  1. Use shift operator instead of power

How does it differ from the current implementation?

@charliecheng630
Copy link
Contributor

charliecheng630 commented Mar 23, 2024

@mchades Can you please confirm if this PR is related to the issue I encountered? I can build successfully on the local environment, but it's failed in the CI environment.
image

https://github.com/datastrato/gravitino/actions/runs/8402223775/job/23011468726?pr=2661

If so, can we review this PR again? @yuqi1129 @diqiu50

@mchades
Copy link
Contributor Author

mchades commented Mar 24, 2024

@charliecheng630 I think your guess is correct. The following is the key log of CI failure, it indicates that the current 50s is not enough for HDFS to complete initialization.

2024-03-23 14:08:34 ERROR HiveContainer:73 - stdout: HDFS is not ready

2024-03-23 14:08:34 INFO HiveContainer:79 - Hive container is not ready, recheck(5/5) after 10000ms
2024-03-23 14:08:59 INFO HiveContainer:157 - Hive container status: isHiveContainerReady=false, isHiveConnectSuccess=true, isHdfsConnectSuccess=true
2024-03-23 14:08:59 INFO CommandExecutor:51 - Sending command "bash -c /home/runner/work/gravitino/gravitino/distribution/package/bin/gravitino.sh stop" to localhost
2024-03-23 14:09:03 INFO AbstractIT:168 - Tearing down Gravitino Server

However, extending the detection time is not a good practice, and I think we should first investigate the reasons why HDFS initialization takes so long and whether this time-consuming is reasonable.

Perhaps we can save the process log in the container first, so that we can investigate it further

@mchades
Copy link
Contributor Author

mchades commented Apr 11, 2024

It may fixed by #2871

@mchades mchades closed this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] The interval of container status check is too short
7 participants