-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#2270] Improvement(IT): increase retry interval of container status check #2271
Conversation
You need to run |
Running locally I'm getting errors like: |
It seems caused by the Hive server check after HDFS dataNode check, can you provide the content of |
Here's one after a couple of failures, but not run to completion. |
The 124 line of the log shows the Hive container has started successfully:
seems the logs of the failed execution may have been overwritten, but based on the stack trace you provided earlier, the failure appears to be unrelated to the changes in this PR. |
nRetry, | ||
retryLimit, | ||
sleepTimeMillis); | ||
Thread.sleep(sleepTimeMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 5 seconds not enough to ensure that the Trino server has started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a Trino server, it's HDFS initialization.
From the logs, it does seem that 5s is not enough. You can take a look at the description in the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the exponent numbers like 1
, 2
, 4
, and so on, I believe it will function more well than fixed time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking 10 seconds to complete a check is too slow.
If the service is normal in the second second, this program needs to wait for 10 seconds.
I suggest check the interface to:
···
checkContainerStatus(int timeout)
···
Check every second, continuing until timeout or success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frequent checking can result in additional performance overhead and stress on the target process. However, setting an overall timeout is a good suggestion.
From the log information in the issue description, it is evident that HDFS completed initialization after 33 seconds of container startup, but the detection mechanism returned a failure at 28 seconds.
Considering that this method includes three checks (DataNode status, Hive connection, HDFS connection), I recommend setting the overall timeout
to 60s
and increasing the retry interval exponentially. If the total time spent on retries exceeds the specified overall timeout
, it should be deemed a failed initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on the whole, I'm afraid 60 seconds is not enough in some cases, can we make it larger?
@mchades |
totalSleepTimeMillis += sleepTimeMillis; | ||
} | ||
} | ||
return totalSleepTimeMillis < timeoutMillis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the check procedure is successful, I believe we should use the value of "check.getAsBoolean()."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here outside the block of FOR loop, if we use check.getAsBoolean()
, an additional check is performed.
BTW, there are multiple checkers here. Which checker's method check.getAsBoolean()
are you referring to?
protected abstract boolean checkContainerStatus(int retryLimit); | ||
protected abstract boolean checkContainerStatus(int timeoutMillis); | ||
|
||
protected boolean checkContainerStatusWithRetry(int timeoutMillis, BooleanSupplier... checker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support multiple checks? If so, should we only perform them in serialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, such as the Hive container, we need to check the HDFS data node, HMS client connection and HDFS client connection
break; | ||
} | ||
|
||
sleepTimeMillis = sleepTimeMillis * (int) Math.pow(2, nRetry++); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sleep time is too long, This computation method is increasing too rapidly. 5 , 10, 20, 40 ...
- Use shift operator instead of power
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sleep time is too long, This computation method is increasing too rapidly. 5 , 10, 20, 40 ...
How about 1, 2, 4, 8, 16 ... ? What are the issues with rapid growth? What is your opinion?
- Use shift operator instead of power
How does it differ from the current implementation?
...on-test/src/test/java/com/datastrato/gravitino/integration/test/container/BaseContainer.java
Show resolved
Hide resolved
@mchades Can you please confirm if this PR is related to the issue I encountered? I can build successfully on the local environment, but it's failed in the CI environment. https://github.com/datastrato/gravitino/actions/runs/8402223775/job/23011468726?pr=2661 |
@charliecheng630 I think your guess is correct. The following is the key log of CI failure, it indicates that the current 50s is not enough for HDFS to complete initialization.
However, extending the detection time is not a good practice, and I think we should first investigate the reasons why HDFS initialization takes so long and whether this time-consuming is reasonable. Perhaps we can save the process log in the container first, so that we can investigate it further |
It may fixed by #2871 |
What changes were proposed in this pull request?
increase retry interval of container status check from 5s to 10s
Why are the changes needed?
As issue #2270 described, the retry interval is too short to wait HDFS initialization completed
Fix: #2270
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests