Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-14816] Add thread dump feature for taskmanager #10228

Closed
wants to merge 1 commit into from

Conversation

lamberken
Copy link
Member

What is the purpose of the change

Add thread dump feature for taskmanager, so use can get thread information easily.

image

Brief change log

  • add thread dump util
  • update taskmanager web

Verifying this change

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no) no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no) no
  • The serializers: (yes / no / don't know) no
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know) no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know) no
  • The S3 file system connector: (yes / no / don't know) no

Documentation

  • Does this pull request introduce a new feature? (yes / no) yes
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) no

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit c1577c7 (Sat Nov 16 02:58:01 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 16, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@lamberken
Copy link
Member Author

hi, @tisonkun

please cc, thanks.

@tisonkun
Copy link
Member

I'm not sure whether or not it is a desirable feature to implement. Could you share the user case a bit? @lamber-ken

@lamberken
Copy link
Member Author

I'm not sure whether or not it is a desirable feature to implement. Could you share the user case a bit? @lamber-ken

It's hard to get the thread dump of tm when job is hanging, because flink tasks are deployed on yarn cluster.

For example, when fixing the deadlock of elasticserch-connector, it needs to jump to the tm machine
to dump the jstack. see elasticsearch-connector-deadlock.

@StephanEwen
Copy link
Contributor

I think that this could be an interesting feature to various users, for debugging. I found myself debugging deadlocks on remote clusters as well before.

I cannot review the UI changes, but some comments on the backend are below.

Copy link
Contributor

@StephanEwen StephanEwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There are some checkstyle failures.

  • Can you rename DUMP to THREAD_DUMP in all cases? Dump is much too generic.

@tisonkun
Copy link
Member

I'm not sure whether or not it is a desirable feature to implement. Could you share the user case a bit? @lamber-ken

It's hard to get the thread dump of tm when job is hanging, because flink tasks are deployed on yarn cluster.

For example, when fixing the deadlock of elasticserch-connector, it needs to jump to the tm machine
to dump the jstack. see elasticsearch-connector-deadlock.

Thanks for your explanation. I think this is a valuable feature!

@flinkbot approve-until consensus

@lamberken
Copy link
Member Author

hi, @StephanEwen. I had update the pr as your suggestion, thanks.

@lamberken
Copy link
Member Author

I'm not sure whether or not it is a desirable feature to implement. Could you share the user case a bit? @lamber-ken

It's hard to get the thread dump of tm when job is hanging, because flink tasks are deployed on yarn cluster.
For example, when fixing the deadlock of elasticserch-connector, it needs to jump to the tm machine
to dump the jstack. see elasticsearch-connector-deadlock.

Thanks for your explanation. I think this is a valuable feature!

@flinkbot approve-until consensus

You're welcome.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be modifying the old UI. There's no reasonable way for us to review this change.
Given that no issues were found in 1.9 in regards to the new UI, we could even think about removing the old one altogether.

@lamberken
Copy link
Member Author

lamberken commented Nov 21, 2019

We shouldn't be modifying the old UI. There's no reasonable way for us to review this change.
Given that no issues were found in 1.9 in regards to the new UI, we could even think about removing the old one altogether.

hi, I build the project local, the thread dump function works ok on both old ui and new ui.

as your reply, need I revert the old ui?

@StephanEwen
Copy link
Contributor

Okay, then let's drop the changes to files from the old UI.

@StephanEwen
Copy link
Contributor

Now that the old Web UI is dropped, can we rebase this on the latest master?

@lamberken
Copy link
Member Author

Now that the old Web UI is dropped, can we rebase this on the latest master?

Done.

@lamberken lamberken force-pushed the flink-14816 branch 2 times, most recently from 45449fd to 1a3ab9e Compare January 4, 2020 19:31
@tillrohrmann tillrohrmann self-assigned this Apr 14, 2020
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for creating this PR @lamber-ken and sorry that it took so long to get back to you. I've given your PR a round and it looks already very good. I had few comments. In particular, it would be good to add a test for this feature. Moreover, it would be awesome if you could rebase this PR onto the latest master because it has diverged a bit.

@vthinkxie could you help reviewing the web ui changes? I can only try it out.

Comment on lines +35 to +38
/**
* the thread dump type for taskmanager
*/
THREAD_DUMP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a new FileType, how about introducing a separate TaskExecutorGateway#requestThreadDump method? This would give the operation a more expressive name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think it's best to keep it as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree since we are not requesting a file here. We are only using the file transfer mechanism to send the data.

The file transfer aspect is another thing we should discuss. Wouldn't it be possible and simpler to directly send the stringified stack trace instead of using the blob service?

Comment on lines 909 to 949
case THREAD_DUMP:
return putTransientBlobStream(JvmUtils.threadDumpStream(), fileType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add a test for the thread dump feature (e.g. testing that TaskExecutorGateway#requestThreadDump() will return the blob key to a thread dump file.

Comment on lines +45 to +49
List<InputStream> streams = Arrays
.stream(threadMxBean.dumpAllThreads(true, true))
.map((v) -> v.toString().getBytes(StandardCharsets.UTF_8))
.map(ByteArrayInputStream::new)
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the output of threadMxBean.dumpAllThreads look like? Will there be one or two line breaks between individual ThreadInfo instances?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, as follows:

  1. Definition ThreadMXBean#dumpAllThreads
public ThreadInfo[] dumpAllThreads(boolean lockedMonitors, boolean lockedSynchronizers);
  1. Each ThreadInfo output
"Monitor Ctrl-Break" Id=5 RUNNABLE
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:375)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at java.net.Socket.<init>(Socket.java:434)
	at java.net.Socket.<init>(Socket.java:211)
	at com.intellij.rt.execution.application.AppMainV2$1.run(AppMainV2.java:59)

@vthinkxie
Copy link
Contributor

Thanks a lot for creating this PR @lamber-ken and sorry that it took so long to get back to you. I've given your PR a round and it looks already very good. I had few comments. In particular, it would be good to add a test for this feature. Moreover, it would be awesome if you could rebase this PR onto the latest master because it has diverged a bit.

@vthinkxie could you help reviewing the web ui changes? I can only try it out.

Hi @lamber-ken , the UI of TM logs has been changed according to https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=147427143
could you rebase the frontend part?

@tillrohrmann
Copy link
Contributor

@lamber-ken if you want, then I can also help with rebasing this PR. Just let me know. I really would like to include this feature into the upcoming Flink 1.11 release :-)

@lamberken
Copy link
Member Author

lamberken commented Apr 20, 2020

hi @tillrohrmann @vthinkxie, welcome : )

Thanks for reviewing again, will ping you again when finished.

@lamberken lamberken force-pushed the flink-14816 branch 2 times, most recently from 45a8d72 to c0a3c6e Compare April 20, 2020 15:55
@lamberken
Copy link
Member Author

Hi @tillrohrmann @vthinkxie , please review again, thanks

image

@lamberken lamberken requested a review from tillrohrmann April 22, 2020 05:17
@vthinkxie
Copy link
Contributor

Hi @lamber-ken
could the thread dump be part of the log list?
cc @jinglining

@lamberken
Copy link
Member Author

Hi @lamber-ken
could the thread dump be part of the log list?
cc @jinglining

hi @vthinkxie, thread dump is not a real file, we shouldn't place it on log list

// TaskExecutor#requestLogList
@Override
public CompletableFuture<Collection<LogInfo>> requestLogList(Time timeout) {
	return CompletableFuture.supplyAsync(() -> {
		final String logDir = taskManagerConfiguration.getTaskManagerLogDir();
		if (logDir != null) {
			final File[] logFiles = new File(logDir).listFiles();

			if (logFiles == null) {
				throw new CompletionException(new FlinkException(String.format("There isn't a log file in TaskExecutor’s log dir %s.", logDir)));
			}

			return Arrays.stream(logFiles)
					.filter(File::isFile)
					.map(logFile -> new LogInfo(logFile.getName(), logFile.length()))
					.collect(Collectors.toList());
		}
		return Collections.emptyList();
	}, ioExecutor);
}

@tillrohrmann
Copy link
Contributor

@lamber-ken @vthinkxie I would actually argue exactly the other way around. Why does the stack trace is being treated like a file if it isn't a file? Wouldn't it be much simpler to expose the stack trace (maybe in its stringified version) directly without going through the indirection of the blob service? I don't think that the stringified stack trace will be larger than 10MB.

Another argument for changing the cluster RPC is that https://issues.apache.org/jira/browse/FLINK-13550 would need a similar mechanism to obtain the stack trace for a set of tasks. One could use the same RPC to obtain it.

I will create a simple mock to show you what I mean.

@tillrohrmann
Copy link
Contributor

@lamber-ken @vthinkxie I've created a draft PR #11887 which is based on this one here. The difference is that we don't use the blob cache service to transmit the thread dump from the TaskExecutor. Moreover, I changed the return type of the handler from text to JSON. This would allow to extend the handler in the future. Please take a look and let me know what you think.

@lamberken
Copy link
Member Author

lamberken commented Apr 23, 2020

@lamber-ken @vthinkxie I've created a draft PR #11887 which is based on this one here. The difference is that we don't use the blob cache service to transmit the thread dump from the TaskExecutor. Moreover, I changed the return type of the handler from text to JSON. This would allow to extend the handler in the future. Please take a look and let me know what you think.

@tillrohrmann Thanks for doing that, I'm ok closing this 👍

@lamberken lamberken closed this Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants