Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Python profiling of Server requests #3502

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

webbnh
Copy link
Member

@webbnh webbnh commented Jul 24, 2023

This PR adds support for performing execution profiling of Pbench Server API requests.

The profiling is performed by the Python Standard Library cProfile package. Profiling is controlled via Gunicorn pre- and post-request hook functions, which are installed via a Gunicorn configuration file (which is actually Python code...pretty cool).

How to use it: if the PB_PROFILE_DUMP_FILE environment variable is defined when shell.py executes, it adds the configuration file to the Gunicorn command line, and everything else is handled automagically.

Profiling data is collected only during a single API request, starting when that request is invoked and ending when the request is complete. If the environment variable is defined to a file path, then the profile data is dumped to that file when the request completes. The data from each request overwrites the data from the previous request.

I recommend defining PB_PROFILE_DUMP_FILE to /srv/pbench/public_html/server_profile.prof: this results in the profile data being placed where it can easily be fetched by pointing a browser at https://<server_address>/server_profile.prof. Once downloaded, the profile data can be nicely visualized using SnakeViz.

Note that if PB_PROFILE_DUMP_FILE is defined instead to an empty string, the top time-consuming functions are dumped to the log, instead. Unfortunately, the list appears twice in the log -- I don't know why -- but, since it's so easy to use the file dump and so much more effective to use SnakeViz's "icicle graph", I'm not overly concerned about it.

There is no turn-key method at present for setting PB_PROFILE_DUMP_FILE in the canned server (the easiest way seems to be editing the server/lib/systemd/pbench-server.service file before the container is built or editing /etc/systemd/system/pbench-server.service inside the container when it's running). But, this support was quite valuable in tracking down the issues with upload timeouts, so I figured I would just capture it in its current state, which is quite usable once it is enabled.

PBENCH-1228

@webbnh webbnh added this to the v0.73 milestone Jul 24, 2023
@webbnh webbnh self-assigned this Jul 24, 2023
@webbnh webbnh force-pushed the server_profiling branch from a2b8c01 to e91454b Compare July 25, 2023 00:35
lib/pbench/profiling.conf.py Show resolved Hide resolved
worker object (ain't Python grand!?), and we use the worker's log.)
"""
worker.profile = cProfile.Profile()
worker.log.info(f"PROFILING {worker.pid}: {req.uri}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally bothered by the "single API" profiling, since it relies on "stopping the ride" at just the right time. A more useful alternative might be to provide a way to specify a full or base URI to profile, e.g., /api/v1/upload/.* and only profile on a match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll grant that it's not a fully general solution (but it was sufficiently useful to my needs). Using an environment variable to control it means restarting the server to enable/disable the profiling, which, in my mind, relegates it to use in a controlled/test environment...where "stopping the ride" is really not a problem.

We can imagine lots of ways to improve the controls over it, but I would defer them until we have an actual need.

Comment on lines 69 to 72
s = StringIO()
ps = pstats.Stats(worker.profile, stream=s).sort_stats("cumulative")
ps.print_stats(30) # Limit the number of lines output
worker.log.info(f"[{worker.pid}] [INFO] {s.getvalue()}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the limitations in format, this gives the advantage of profiling data for every call in an accessible place. The [INFO], by the way, seems unnecessary bulk. Hopefully pstats.Stats identifies the URI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur that there is value to being able to put it in the journal-log, which is why I've retained that as a choice. I just don't see the value of having it there and in the external file...so, as the code stands, it's one or the other. Also, the output in the log is only the top 30 functions in terms of the cumulative time attributed to them including their subroutine calls; unfortunately, the "tail" is pretty long (between 980 and 1646), so 30 functions doesn't capture much of it...but posting more than 30 will make a mess of the log...which makes the external file much more attractive.

I've revised the logging to remove much of the unnecessary bulk. It was there in the code that I was cribbing from, so I kept it. (That output format seems to be consistent with other Gunicorn messages, but I concur that it doesn't serve this purpose well, so I've removed it.)

pstats.Stats does not identify the URI, but the log message in one of the previous lines does, so the context is available in the log. (I considered adding it to this message, but I don't think it actually accomplishes anything over having it in an adjacent message.)

@webbnh webbnh force-pushed the server_profiling branch from e91454b to 00501a4 Compare July 28, 2023 22:34
@webbnh webbnh marked this pull request as ready for review July 29, 2023 04:16
Copy link
Member Author

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now "ready for review".

lib/pbench/profiling.conf.py Show resolved Hide resolved
worker object (ain't Python grand!?), and we use the worker's log.)
"""
worker.profile = cProfile.Profile()
worker.log.info(f"PROFILING {worker.pid}: {req.uri}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll grant that it's not a fully general solution (but it was sufficiently useful to my needs). Using an environment variable to control it means restarting the server to enable/disable the profiling, which, in my mind, relegates it to use in a controlled/test environment...where "stopping the ride" is really not a problem.

We can imagine lots of ways to improve the controls over it, but I would defer them until we have an actual need.

Comment on lines 69 to 72
s = StringIO()
ps = pstats.Stats(worker.profile, stream=s).sort_stats("cumulative")
ps.print_stats(30) # Limit the number of lines output
worker.log.info(f"[{worker.pid}] [INFO] {s.getvalue()}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur that there is value to being able to put it in the journal-log, which is why I've retained that as a choice. I just don't see the value of having it there and in the external file...so, as the code stands, it's one or the other. Also, the output in the log is only the top 30 functions in terms of the cumulative time attributed to them including their subroutine calls; unfortunately, the "tail" is pretty long (between 980 and 1646), so 30 functions doesn't capture much of it...but posting more than 30 will make a mess of the log...which makes the external file much more attractive.

I've revised the logging to remove much of the unnecessary bulk. It was there in the code that I was cribbing from, so I kept it. (That output format seems to be consistent with other Gunicorn messages, but I concur that it doesn't serve this purpose well, so I've removed it.)

pstats.Stats does not identify the URI, but the log message in one of the previous lines does, so the context is available in the log. (I considered adding it to this message, but I don't think it actually accomplishes anything over having it in an adjacent message.)

@webbnh webbnh requested review from dbutenhof and ndokos July 29, 2023 04:17
Copy link
Member

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplication of log output is a concern. (Are you really generating two dump files, also, and just not noticing?) But since this is something that needs to be explicitly enabled with a server restart, and despite the significant shortcomings of the current approach provides at worst a viable template for valuable information going forward, I'm OK with just pushing it through and polishing the floors later when we get around to it... 😁

@webbnh webbnh merged commit 7682537 into distributed-system-analysis:main Jul 31, 2023
@webbnh webbnh deleted the server_profiling branch July 31, 2023 22:07
@webbnh
Copy link
Member Author

webbnh commented Jul 31, 2023

The duplication of log output is a concern.

At this point, I would call it a "mystery" rather than a "concern" (and, I suspect that it is actually a feature of the Python logger which we just weren't expecting to see show up like that).

The two entries in the log have different prefixes and different formatting (i.e., one of them has all the text (including newlines) in a single log entry while the other logs the same text with each line in a separate log message). This seems to me like they both come from the same call in the source code but go through different handlers in the logger stack whose output ends up redirected (unexpectedly, perhaps) to the same place.

Are you really generating two dump files, also, and just not noticing?

That is an excellent question...I don't know how we could tell.... 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants