[#2270] Improvement(IT): increase retry interval of container status check #2271

mchades · 2024-02-20T05:32:32Z

What changes were proposed in this pull request?

increase retry interval of container status check from 5s to 10s

Why are the changes needed?

As issue #2270 described, the retry interval is too short to wait HDFS initialization completed

Fix: #2270

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

justinmclean · 2024-02-20T05:57:24Z

You need to run ./gradlew :integration-test:spotlessApply to fix some formatting issue.

justinmclean · 2024-02-20T06:28:05Z

Running locally I'm getting errors like:
CatalogHiveIT > initializationError FAILED
java.lang.RuntimeException: Failed to connect to Hive Metastore
at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:96)
at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:39)
at com.datastrato.gravitino.utils.ClientPoolImpl.get(ClientPoolImpl.java:126)
at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:57)
at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:52)
at com.datastrato.gravitino.integration.test.container.HiveContainer.checkContainerStatus(HiveContainer.java:111)
at com.datastrato.gravitino.integration.test.container.HiveContainer.start(HiveContainer.java:62)
at com.datastrato.gravitino.integration.test.container.ContainerSuite.startHiveContainer(ContainerSuite.java:86)
at com.datastrato.gravitino.integration.test.catalog.hive.CatalogHiveIT.startup(CatalogHiveIT.java:144)

mchades · 2024-02-20T06:34:16Z

Running locally I'm getting errors like: CatalogHiveIT > initializationError FAILED java.lang.RuntimeException: Failed to connect to Hive Metastore at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:96) at com.datastrato.gravitino.catalog.hive.HiveClientPool.newClient(HiveClientPool.java:39) at com.datastrato.gravitino.utils.ClientPoolImpl.get(ClientPoolImpl.java:126) at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:57) at com.datastrato.gravitino.utils.ClientPoolImpl.run(ClientPoolImpl.java:52) at com.datastrato.gravitino.integration.test.container.HiveContainer.checkContainerStatus(HiveContainer.java:111) at com.datastrato.gravitino.integration.test.container.HiveContainer.start(HiveContainer.java:62) at com.datastrato.gravitino.integration.test.container.ContainerSuite.startHiveContainer(ContainerSuite.java:86) at com.datastrato.gravitino.integration.test.catalog.hive.CatalogHiveIT.startup(CatalogHiveIT.java:144)

It seems caused by the Hive server check after HDFS dataNode check, can you provide the content of integration-test/build/integration-test.log?

justinmclean · 2024-02-20T06:49:27Z

Here's one after a couple of failures, but not run to completion.

integration-test.log

mchades · 2024-02-20T07:02:31Z

Here's one after a couple of failures, but not run to completion.

integration-test.log

The 124 line of the log shows the Hive container has started successfully:

2024-02-20 17:44:57 INFO HiveContainer:83 - Hive container startup success!

seems the logs of the failed execution may have been overwritten, but based on the stack trace you provided earlier, the failure appears to be unrelated to the changes in this PR.

yuqi1129 · 2024-02-20T07:48:50Z

...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/HiveContainer.java

+            nRetry,
+            retryLimit,
+            sleepTimeMillis);
+        Thread.sleep(sleepTimeMillis);


Is 5 seconds not enough to ensure that the Trino server has started?

It's not a Trino server, it's HDFS initialization.

From the logs, it does seem that 5s is not enough. You can take a look at the description in the issue.

Can we use the exponent numbers like 1, 2, 4, and so on, I believe it will function more well than fixed time.

Taking 10 seconds to complete a check is too slow.
If the service is normal in the second second, this program needs to wait for 10 seconds.
I suggest check the interface to:
···
checkContainerStatus(int timeout)
···
Check every second, continuing until timeout or success.

Frequent checking can result in additional performance overhead and stress on the target process. However, setting an overall timeout is a good suggestion.

From the log information in the issue description, it is evident that HDFS completed initialization after 33 seconds of container startup, but the detection mechanism returned a failure at 28 seconds.

Considering that this method includes three checks (DataNode status, Hive connection, HDFS connection), I recommend setting the overall timeout to 60s and increasing the retry interval exponentially. If the total time spent on retries exceeds the specified overall timeout, it should be deemed a failed initialization.

What do you think of this approach? @yuqi1129 @diqiu50

Agree on the whole, I'm afraid 60 seconds is not enough in some cases, can we make it larger?

jerryshao · 2024-02-21T11:01:39Z

@yuqi1129 @diqiu50 can you please help to review?

yuqi1129 · 2024-02-22T01:53:14Z

@mchades
Could you please update the PR?

mchades · 2024-02-26T02:30:46Z

It is ready for review now, please help to review it when you have time, thx~ @yuqi1129 @diqiu50

yuqi1129 · 2024-02-26T03:07:42Z

...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/BaseContainer.java

+        totalSleepTimeMillis += sleepTimeMillis;
+      }
+    }
+    return totalSleepTimeMillis < timeoutMillis;


If the check procedure is successful, I believe we should use the value of "check.getAsBoolean()."

Here outside the block of FOR loop, if we use check.getAsBoolean(), an additional check is performed.

BTW, there are multiple checkers here. Which checker's method check.getAsBoolean() are you referring to?

yuqi1129 · 2024-02-26T03:09:20Z

...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/BaseContainer.java

-  protected abstract boolean checkContainerStatus(int retryLimit);
+  protected abstract boolean checkContainerStatus(int timeoutMillis);
+
+  protected boolean checkContainerStatusWithRetry(int timeoutMillis, BooleanSupplier... checker) {


Do we support multiple checks? If so, should we only perform them in serialization?

Yes, such as the Hive container, we need to check the HDFS data node, HMS client connection and HDFS client connection

diqiu50 · 2024-02-26T03:38:18Z

...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/BaseContainer.java

+          break;
+        }
+
+        sleepTimeMillis = sleepTimeMillis * (int) Math.pow(2, nRetry++);


Sleep time is too long, This computation method is increasing too rapidly. 5 , 10, 20, 40 ...

Use shift operator instead of power

Sleep time is too long, This computation method is increasing too rapidly. 5 , 10, 20, 40 ...

How about 1, 2, 4, 8, 16 ... ? What are the issues with rapid growth? What is your opinion?

Use shift operator instead of power

How does it differ from the current implementation?

...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/BaseContainer.java

charliecheng630 · 2024-03-23T14:34:56Z

@mchades Can you please confirm if this PR is related to the issue I encountered? I can build successfully on the local environment, but it's failed in the CI environment.

https://github.com/datastrato/gravitino/actions/runs/8402223775/job/23011468726?pr=2661

If so, can we review this PR again? @yuqi1129 @diqiu50

mchades · 2024-03-24T07:15:38Z

@charliecheng630 I think your guess is correct. The following is the key log of CI failure, it indicates that the current 50s is not enough for HDFS to complete initialization.

2024-03-23 14:08:34 ERROR HiveContainer:73 - stdout: HDFS is not ready

2024-03-23 14:08:34 INFO HiveContainer:79 - Hive container is not ready, recheck(5/5) after 10000ms
2024-03-23 14:08:59 INFO HiveContainer:157 - Hive container status: isHiveContainerReady=false, isHiveConnectSuccess=true, isHdfsConnectSuccess=true
2024-03-23 14:08:59 INFO CommandExecutor:51 - Sending command "bash -c /home/runner/work/gravitino/gravitino/distribution/package/bin/gravitino.sh stop" to localhost
2024-03-23 14:09:03 INFO AbstractIT:168 - Tearing down Gravitino Server

However, extending the detection time is not a good practice, and I think we should first investigate the reasons why HDFS initialization takes so long and whether this time-consuming is reasonable.

Perhaps we can save the process log in the container first, so that we can investigate it further

mchades · 2024-04-11T13:20:54Z

It may fixed by #2871

mchades requested review from justinmclean, jerryshao and xunliu February 20, 2024 05:32

mchades self-assigned this Feb 20, 2024

yuqi1129 reviewed Feb 20, 2024

View reviewed changes

mchades requested a review from yuqi1129 February 21, 2024 01:49

mchades added 3 commits February 24, 2024 23:36

increase retry interval

13b4520

fix code style

ea8cd44

fix retry

14ec352

mchades force-pushed the fix-part branch from d7fd1e3 to 14ec352 Compare February 24, 2024 15:36

mchades requested a review from diqiu50 February 26, 2024 02:30

yuqi1129 reviewed Feb 26, 2024

View reviewed changes

diqiu50 reviewed Feb 26, 2024

View reviewed changes

xunliu reviewed Feb 26, 2024

View reviewed changes

...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/BaseContainer.java Show resolved Hide resolved

mchades mentioned this pull request Feb 27, 2024

[Improvement] increase retry interval of container status check #2365

Closed

mchades closed this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#2270] Improvement(IT): increase retry interval of container status check #2271

[#2270] Improvement(IT): increase retry interval of container status check #2271

mchades commented Feb 20, 2024

justinmclean commented Feb 20, 2024

justinmclean commented Feb 20, 2024

mchades commented Feb 20, 2024 •

edited

Loading

justinmclean commented Feb 20, 2024

mchades commented Feb 20, 2024

yuqi1129 Feb 20, 2024

mchades Feb 20, 2024

yuqi1129 Feb 21, 2024

yuqi1129 Feb 21, 2024

diqiu50 Feb 21, 2024

mchades Feb 21, 2024 •

edited

Loading

yuqi1129 Feb 21, 2024

jerryshao commented Feb 21, 2024

yuqi1129 commented Feb 22, 2024

mchades commented Feb 26, 2024

yuqi1129 Feb 26, 2024

mchades Feb 26, 2024

yuqi1129 Feb 26, 2024

mchades Feb 26, 2024

diqiu50 Feb 26, 2024

mchades Feb 27, 2024

charliecheng630 commented Mar 23, 2024 •

edited

Loading

mchades commented Mar 24, 2024

mchades commented Apr 11, 2024

[#2270] Improvement(IT): increase retry interval of container status check #2271

[#2270] Improvement(IT): increase retry interval of container status check #2271

Conversation

mchades commented Feb 20, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

justinmclean commented Feb 20, 2024

justinmclean commented Feb 20, 2024

mchades commented Feb 20, 2024 • edited Loading

justinmclean commented Feb 20, 2024

mchades commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mchades Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao commented Feb 21, 2024

yuqi1129 commented Feb 22, 2024

mchades commented Feb 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charliecheng630 commented Mar 23, 2024 • edited Loading

mchades commented Mar 24, 2024

mchades commented Apr 11, 2024

mchades commented Feb 20, 2024 •

edited

Loading

mchades Feb 21, 2024 •

edited

Loading

charliecheng630 commented Mar 23, 2024 •

edited

Loading