-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-14816] Add thread dump feature for taskmanager #10228
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit c1577c7 (Sat Nov 16 02:58:01 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
hi, @tisonkun please cc, thanks. |
I'm not sure whether or not it is a desirable feature to implement. Could you share the user case a bit? @lamber-ken |
It's hard to get the thread dump of tm when job is hanging, because flink tasks are deployed on yarn cluster. For example, when fixing the deadlock of elasticserch-connector, it needs to jump to the tm machine |
I think that this could be an interesting feature to various users, for debugging. I found myself debugging deadlocks on remote clusters as well before. I cannot review the UI changes, but some comments on the backend are below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
There are some checkstyle failures.
-
Can you rename
DUMP
toTHREAD_DUMP
in all cases? Dump is much too generic.
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/FileType.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
Outdated
Show resolved
Hide resolved
Thanks for your explanation. I think this is a valuable feature! @flinkbot approve-until consensus |
hi, @StephanEwen. I had update the pr as your suggestion, thanks. |
You're welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be modifying the old UI. There's no reasonable way for us to review this change.
Given that no issues were found in 1.9 in regards to the new UI, we could even think about removing the old one altogether.
hi, I build the project local, the thread dump function works ok on both old ui and new ui. as your reply, need I revert the old ui? |
Okay, then let's drop the changes to files from the old UI. |
Now that the old Web UI is dropped, can we rebase this on the latest master? |
551b4df
to
53cbeb6
Compare
Done. |
45449fd
to
1a3ab9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for creating this PR @lamber-ken and sorry that it took so long to get back to you. I've given your PR a round and it looks already very good. I had few comments. In particular, it would be good to add a test for this feature. Moreover, it would be awesome if you could rebase this PR onto the latest master because it has diverged a bit.
@vthinkxie could you help reviewing the web ui changes? I can only try it out.
/** | ||
* the thread dump type for taskmanager | ||
*/ | ||
THREAD_DUMP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new FileType
, how about introducing a separate TaskExecutorGateway#requestThreadDump
method? This would give the operation a more expressive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, I think it's best to keep it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree since we are not requesting a file here. We are only using the file transfer mechanism to send the data.
The file transfer aspect is another thing we should discuss. Wouldn't it be possible and simpler to directly send the stringified stack trace instead of using the blob service?
...ava/org/apache/flink/runtime/rest/messages/taskmanager/TaskManagerThreadDumpFileHeaders.java
Outdated
Show resolved
Hide resolved
case THREAD_DUMP: | ||
return putTransientBlobStream(JvmUtils.threadDumpStream(), fileType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add a test for the thread dump feature (e.g. testing that TaskExecutorGateway#requestThreadDump()
will return the blob key to a thread dump file.
List<InputStream> streams = Arrays | ||
.stream(threadMxBean.dumpAllThreads(true, true)) | ||
.map((v) -> v.toString().getBytes(StandardCharsets.UTF_8)) | ||
.map(ByteArrayInputStream::new) | ||
.collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the output of threadMxBean.dumpAllThreads
look like? Will there be one or two line breaks between individual ThreadInfo
instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, as follows:
- Definition
ThreadMXBean#dumpAllThreads
public ThreadInfo[] dumpAllThreads(boolean lockedMonitors, boolean lockedSynchronizers);
- Each ThreadInfo output
"Monitor Ctrl-Break" Id=5 RUNNABLE
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:375)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at com.intellij.rt.execution.application.AppMainV2$1.run(AppMainV2.java:59)
Hi @lamber-ken , the UI of TM logs has been changed according to https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427143 |
@lamber-ken if you want, then I can also help with rebasing this PR. Just let me know. I really would like to include this feature into the upcoming Flink 1.11 release :-) |
hi @tillrohrmann @vthinkxie, welcome : ) Thanks for reviewing again, will ping you again when finished. |
45a8d72
to
c0a3c6e
Compare
Hi @tillrohrmann @vthinkxie , please review again, thanks |
Hi @lamber-ken |
hi @vthinkxie, thread dump is not a real file, we shouldn't place it on log list
|
@lamber-ken @vthinkxie I would actually argue exactly the other way around. Why does the stack trace is being treated like a file if it isn't a file? Wouldn't it be much simpler to expose the stack trace (maybe in its stringified version) directly without going through the indirection of the blob service? I don't think that the stringified stack trace will be larger than 10MB. Another argument for changing the cluster RPC is that https://issues.apache.org/jira/browse/FLINK-13550 would need a similar mechanism to obtain the stack trace for a set of tasks. One could use the same RPC to obtain it. I will create a simple mock to show you what I mean. |
@lamber-ken @vthinkxie I've created a draft PR #11887 which is based on this one here. The difference is that we don't use the blob cache service to transmit the thread dump from the |
@tillrohrmann Thanks for doing that, I'm ok closing this 👍 |
What is the purpose of the change
Add thread dump feature for taskmanager, so use can get thread information easily.
Brief change log
Verifying this change
This change is already covered by existing tests.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no) noDocumentation