Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: add pprof command #62796

Merged
merged 1 commit into from
Mar 31, 2021
Merged

Conversation

stevendanna
Copy link
Collaborator

The new pprof command allows to user to collect CPU and heap profiles
from nodes in a roachprod cluster using pprof's HTTP endpoints.

Profiles are collected in parallel and stored in the current working
directory of the machine running pprof. The command can also
optionally open the profiles using go tool pprof.

Some usage examples:

# Capture CPU profile for all nodes in the cluster
roachprod pprof CLUSTERNAME
# Capture CPU profile for the first node in the cluster for 60 seconds
roachprod pprof CLUSTERNAME:1 --duration 60s
# Capture a Heap profile for the first node in the cluster
roachprod pprof CLUSTERNAME:1 --heap
# Same as above
roachprod pprof-heap CLUSTERNAME:1

Fixes #62309

Note that I've skipped the pprofurl command in the original issue. I noticed that the adminurl
command has an option to append any string you'd like to the URL, which is probably most of what one
needs from that command in most cases. But, I can add it if others feel differently.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna force-pushed the ssd/roachprod-pprof branch 3 times, most recently from ff99b82 to 0a5f420 Compare March 30, 2021 16:49
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you!

Note that I've skipped the pprofurl command in the original issue.

Probably fine, we can add it later if anyone really needs it. I just had to dig around a bit to figure it out, but people are likely to just use the pprof command anyway.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/cmd/roachprod/main.go, line 1517 at r1 (raw file):

	Use:     "pprof <cluster>",
	Args:    cobra.ExactArgs(1),
	Aliases: []string{"pprof-cpu", "pprof-profile", "pprof-heap"},

If we're using --heap for the heap profile then maybe that's sufficient instead of cluttering up the command namespace?


pkg/cmd/roachprod/main.go, line 1551 at r1 (raw file):

		mu := &syncutil.Mutex{}
		pprofPath := fmt.Sprintf("debug/pprof/%s?seconds=%d", profType, int(pprofOptions.duration.Seconds()))
		httpClient := httputil.NewClientWithTimeout(5 * pprofOptions.duration)

5x seems a bit excessive, maybe just add on a fixed 10 seconds or something?


pkg/cmd/roachprod/main.go, line 1560 at r1 (raw file):

				scheme = "https"
			}
			outputFile := fmt.Sprintf("pprof-%s-%s-%04d-%d.out", profType, c.Name, i, timeutil.Now().Unix())

We should probably generate the timestamp once outside of the loop, to ensure we don't get files with different timestamps if they happen right as the clock ticks. Also, I'd probably use the time before the index in the filename so that all profiles for the same run are grouped together.


pkg/cmd/roachprod/main.go, line 1562 at r1 (raw file):

			outputFile := fmt.Sprintf("pprof-%s-%s-%04d-%d.out", profType, c.Name, i, timeutil.Now().Unix())
			outputDir := filepath.Dir(outputFile)
			file, err := ioutil.TempFile(outputDir, ".pprof")

I don't think we end up closing this file. Not that it matters that much since we'll just be exiting the process anyway, but it's the polite thing to do and the linter will probably yell at you for it anyway.


pkg/cmd/roachprod/main.go, line 1602 at r1 (raw file):

		for _, s := range outputFiles {
			fmt.Printf("Created %s\n", s)

Maybe just move this into the function above, and output as the files are ready?


pkg/cmd/roachprod/main.go, line 1617 at r1 (raw file):

			for i, file := range outputFiles {
				port := pprofOptions.startingPort + i
				cmd := exec.Command("go", "tool", "pprof",

Will ending the main process (e.g. with ctrl-C) also reap the child processes? If not, we may have to do something along these lines:

cmd.SysProcAttr = &syscall.SysProcAttr{
    Pdeathsig: syscall.SIGKILL,
}

Just to make sure we don't get background processes hogging the ports.

@stevendanna stevendanna force-pushed the ssd/roachprod-pprof branch 2 times, most recently from ea9efbc to d348ad1 Compare March 30, 2021 17:28
Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/cmd/roachprod/main.go, line 1517 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

If we're using --heap for the heap profile then maybe that's sufficient instead of cluttering up the command namespace?

I removed pprof-cpu and pprof-profile, but I left pprof-heap just because it kinda amuses me to allow that. Perhaps in the morning I'll remove it to, I'll think about it. :D


pkg/cmd/roachprod/main.go, line 1551 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

5x seems a bit excessive, maybe just add on a fixed 10 seconds or something?

The worry I had with that is that I believe that timeout includes downloading the response, so if you've generated a very large profile, it might take a long time to download, especially if you have slower home internet.


pkg/cmd/roachprod/main.go, line 1560 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

We should probably generate the timestamp once outside of the loop, to ensure we don't get files with different timestamps if they happen right as the clock ticks. Also, I'd probably use the time before the index in the filename so that all profiles for the same run are grouped together.

Good idea. Done.


pkg/cmd/roachprod/main.go, line 1562 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I don't think we end up closing this file. Not that it matters that much since we'll just be exiting the process anyway, but it's the polite thing to do and the linter will probably yell at you for it anyway.

Good catch. Interestingly the linter doesn't seem to catch this. Perhaps we should add a linter for this. We now close in the defer and we also close before the rename in the happy path.


pkg/cmd/roachprod/main.go, line 1602 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Maybe just move this into the function above, and output as the files are ready?

Unfortunately ParallelE write progress to the terminal by updating the same line repeatedly which means that if we write it while that is still progressing, we either need to write them to stderr despite them not being errors or we lose the output line in most cases.


pkg/cmd/roachprod/main.go, line 1617 at r1 (raw file):

Will ending the main process (e.g. with ctrl-C) also reap the child processes?

Ending the main process in general will not reap the children, but ctrl-C at the terminal will work. Control-C at the terminal sends SIGINT to all process in the foreground process group which will include all the go pprof processes. Since this is an "interactive" type feature I thought this would probably be sufficient, but happy to go further and handle signals more gracefully in general.

Pdeathsig doesn't work on macOS AFAIK. So I think there our best option might be a signal handler for the common signals we expect to kill us.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)


pkg/cmd/roachprod/main.go, line 1551 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

The worry I had with that is that I believe that timeout includes downloading the response, so if you've generated a very large profile, it might take a long time to download, especially if you have slower home internet.

Right, makes sense. I suppose we'd really want to use a timeout until the first byte of the body or something, but let's not complicate it, this is fine.


pkg/cmd/roachprod/main.go, line 1562 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Good catch. Interestingly the linter doesn't seem to catch this. Perhaps we should add a linter for this. We now close in the defer and we also close after the rename in the happy path.

Huh, that's odd, would've expected it to pick up on that.


pkg/cmd/roachprod/main.go, line 1602 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Unfortunately ParallelE write progress to the terminal by updating the same line repeatedly which means that if we write it while that is still progressing, we either need to write them to stderr despite them not being errors or we lose the output line in most cases.

Ah right, make sense.


pkg/cmd/roachprod/main.go, line 1617 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Will ending the main process (e.g. with ctrl-C) also reap the child processes?

Ending the main process will not reap the children, but ctrl-C at the terminal will work. Control-C at the terminal sends SIGINT to all process in the foreground process group which will include all the go pprof processes. Since this is an "interactive" type feature I thought this would probably be sufficient, but happy to go further and handle signals more gracefully in general.

Because Pdeathsig doesn't work on macOS AFAIK so I think there our best option might be a signal handler for the common signals we expect to kill us.

As long as Control-C does the right thing that's good enough for me, thanks!

The new pprof command allows to user to collect CPU and heap profiles
from nodes in a roachprod cluster using pprof's HTTP endpoints.

Profiles are collected in parallel and stored in the current working
directory of the machine running pprof. The command can also
optionally open the profiles using `go tool pprof`.

Some usage examples:

    # Capture CPU profile for all nodes in the cluster
    roachprod pprof CLUSTERNAME
    # Capture CPU profile for the first node in the cluster for 60 seconds
    roachprod pprof CLUSTERNAME:1 --duration 60s
    # Capture a Heap profile for the first node in the cluster
    roachprod pprof CLUSTERNAME:1 --heap
    # Same as above
    roachprod pprof-heap CLUSTERNAME:1

Fixes cockroachdb#62309

Release note: None
@stevendanna stevendanna force-pushed the ssd/roachprod-pprof branch from d348ad1 to 1d2b57e Compare March 31, 2021 14:40
@stevendanna
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build succeeded:

@craig craig bot merged commit 00e8d9c into cockroachdb:master Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachprod: commands for pprof profiling
3 participants