Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [1215 main tke regression] sysbench update report 'FATAL: Worker threads failed to initialize within 30 seconds'. #20765

Closed
1 task done
Ariznawlll opened this issue Dec 15, 2024 · 10 comments
Assignees
Labels
kind/bug Something isn't working no-pr-linked Issue Closed without PR phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@Ariznawlll
Copy link
Contributor

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

main

Commit ID

0d92298

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job url: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/12330700463/job/34421856358

image

log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22ONj%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-0d9229833-20241214%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221734214924000%22,%22to%22:%221734215289000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

trigger tke workflow

Additional information

No response

@Ariznawlll Ariznawlll added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels Dec 15, 2024
@Ariznawlll Ariznawlll added this to the 2.1.0 milestone Dec 15, 2024
@heni02 heni02 assigned reusee and unassigned matrix-meow Dec 17, 2024
@reusee
Copy link
Contributor

reusee commented Dec 17, 2024

same issue to #18725

@reusee
Copy link
Contributor

reusee commented Dec 19, 2024

初始化超时的问题,不一定和 S3 连接数过多问题相关。
以 mo-main-nightly-e791f9335-20241217 为例:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/12375243410/job/34540441574

连接数超过1万的有两个时间,一个是 UTC 12-17 17:00,一个是 12-18 06:00:https://grafana.ci.matrixorigin.cn/goto/JxW1H1INR?orgId=1

Screenshot From 2024-12-19 11-44-47

对应的时间段,做的测试,都是 TPCH,而且没有出错:

Screenshot From 2024-12-19 11-45-38

Screenshot From 2024-12-19 11-46-05

日志里也没有 cannot assign requested address 的错误:https://grafana.ci.matrixorigin.cn/goto/_5zUH1IHR?orgId=1

结论是,Worker threads failed to initialize within 30 seconds 这个测试工具侧的报错,和 S3 连接数过多的问题不等价。
S3 连接数过多,可能导致这个问题。但出现这个问题,不一定是因为 S3 连接数过多。

@reusee
Copy link
Contributor

reusee commented Dec 19, 2024

S3 连接数过多的问题的一个修正,没有包含在上述流程:3c10894
连接数的问题,需要等这个跑完,再做分析:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/12395187099

@reusee
Copy link
Contributor

reusee commented Dec 19, 2024

下面以 mo-main-nightly-e791f9335-20241217 为案例分析 Worker threads failed to initialize within 30 seconds 的问题:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/12375243410/job/34540441574

第一次报告这个错误,是 UTC 12-17 22:17:32:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/12375243410/job/34560148100

Screenshot From 2024-12-19 11-54-59

查看对应时段的日志:https://grafana.ci.matrixorigin.cn/goto/_ZFSD1SHR?orgId=1

Screenshot From 2024-12-19 11-56-39

去掉 "INFO",以及 "use of closed network connection":https://grafana.ci.matrixorigin.cn/goto/EKncvJIHR?orgId=1

剩余的错误,包含以下几个:

WARN: export/merge.go, packet for query is too large. Try adjusting the Config.MaxAllowedPacket`
日志merge任务出错,应该不影响

ERROR: mo-service/debug.go, cpu profiling already in use
cpu profile 重复开启的错误,应该不影响

ERROR: morpc/backend.go, hakeeper-client-backend, read loop stopped
HAKeeper 报错,似乎是严重错误

再搜索其他时间段内的 HAKeeper 错误:https://grafana.ci.matrixorigin.cn/goto/QbO2d1INR?orgId=1 ,似乎经常出现

综上所述,从日志未看出异常。

@reusee
Copy link
Contributor

reusee commented Dec 19, 2024

猜测只是因为负载增高,来不及响应所以客户端超时。

需要修改客户端超时时间,来验证是否正确。

@reusee
Copy link
Contributor

reusee commented Dec 24, 2024

客户端超时时间已改长,继续观察

@reusee
Copy link
Contributor

reusee commented Dec 27, 2024

继续观察

1 similar comment
@reusee
Copy link
Contributor

reusee commented Jan 1, 2025

继续观察

@reusee reusee assigned heni02 and unassigned reusee Jan 3, 2025
@reusee
Copy link
Contributor

reusee commented Jan 3, 2025

https://github.com/matrixorigin/mo-nightly-regression/actions/workflows/nightly-regression-tke-new.yaml
更改超时时间之后,超时错误不再出现,初步验证前面的推测。后续如果需要优化,可以另外开issue。

@heni02
Copy link
Contributor

heni02 commented Jan 10, 2025

sysbench添加 --thread-init-timeout=180 将默认30seconds更改为180sec,持续观察2周后没有再出现该问题,closed

@heni02 heni02 closed this as completed Jan 10, 2025
@matrix-meow matrix-meow reopened this Jan 10, 2025
@matrix-meow matrix-meow added the no-pr-linked Issue Closed without PR label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working no-pr-linked Issue Closed without PR phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

5 participants