-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are reference cycles a performance problem? #4987
Comments
FWIW, I've seen in certain "flaky" timed out CI runs log messages claiming that we spend >98% of time in GC. I'll watch out for those tests and post links if I find any. |
As long as those are live cycles, it shouldn't be a problem. It's probably good to try to break those cycles when a task is removed from the graph, but IIRC this is/was already the case (e.g. explicitly clear dependents and dependencies in the removed task).
I don't know. Cycles will probably difficult to avoid in async code, I think. One tidbit: if you know you're keeping a traceback or exception somewhere, you can help breaking cycles by calling |
Any suggestions on how to track down what is causing excessive GC use? |
First, the GC will run whether you have reference cycles or not (it needs to run precisely to detect dead reference cycles). |
@crusaderky, @jrbourbeau and I discussed this today. We agreed that the main mystery is why the logs from GC debug mode mostly show an elapsed time of 0.0000s, the longest one recorded was 0.1061s, yet the py-spy profiles make it look like pauses are in the 1-2sec range. (Note that the logs were recorded without py-spy running, since the combo made things non-functionally slow.) We need to figure out which is true—or if they're both true, why. Some next steps:
|
You're looking for collections of the oldest generation ("gc: collecting generation 2"). Your logs only show 3 instances of that, AFAIK. By the way: it may well be that the reason for long GC runs is not the cost of garbage collection, but some costly finalizers (for example if a finalizer does some IO). |
Doing an audit of our finalizers and del methods makes a lot of sense to
me.
…On Thu, Jul 1, 2021, 2:01 PM Antoine Pitrou ***@***.***> wrote:
We agreed that the main mystery is why the logs from GC debug mode mostly
show an elapsed time of 0.0000s
You're looking for collections of the oldest generation ("gc: collecting
generation 2"). Your logs only show 3 instances of that, AFAIK.
By the way: it may well be that the reason for long GC runs is not the
cost of garbage collection, but some costly finalizers (for example if a
finalizer does some IO).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4987 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCC2HPLUODTXSYRXPDTVTJSJANCNFSM47PBXN3Q>
.
|
The finalizers are a good idea. I would have thought they'd show up in the py-spy profiles, but maybe because of how they're called within the GC process, they don't show up in the interpreter's normal call stack? |
I think at one point we discussed tracking what objects are being GC'd. Has that been explored? If we are concerned about how specific objects are being cleaned up, this would be useful info when identifying what cleanup methods are problematic |
There was an interesting issue in |
@dhirschfeld thanks a ton these links—they're very helpful. Something like python-trio/trio#1805 (comment) is what we'll want to do to debug this. Also, I wouldn't be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I've been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you'd read that sort of code and think "how does adding a |
I've gotten some statistics about how turning off GC affects performance. First, disabling GC on the scheduler gives a 14-37% speedup on our shuffle workload on a 100-worker Coiled cluster. The range is because I've accidentally discovered that this workload runs slower and slower each subsequent time it re-runs on the same cluster, in a concerningly linear way: The doubled lines are with and without So at trial 1, disabling GC is 14% faster. By trial 10, it's 37% faster. This slowdown is really significant when GC is on—after 10 repetitions, the same operation is 75% slower. FWIW I tried adding a I also recorded both memory info and the change in CPU times before each trial (@crusaderky). We can see that both user and idle CPU time is higher when GC is active. Also, scheduler memory grows linearly in both cases, but grows much faster when GC is off (so just turning off GC entirely by default certainly does not seem like a viable option). Also reminds me of #3898. Unless anyone has ideas about what's causing the slowdowns with each repetition, my next steps will be:
|
I'd say two things are particularly of concern:
|
That said, I don't understand the idle time on your graphs. Basically, your numbers show that user time == elapsed time, but idle time also grows a lot (and a lot larger than elapsed time!). |
@pitrou this could just be because the scheduler node has 4 CPUs, but is single-threaded. So 1s of extra wall time will usually give a minimum of 3s of idle time, I'd think. Maybe I'll rerun with 1 CPU for the scheduler so these numbers are more meaningful. |
The growth of memory signals to me that maybe we are genuinely leaking
objects. We might consider comparing the counts of types of objects that
are around both with and without GC to see if there is some notable
difference. For example, maybe we find that lots of TaskState objects
exist without gc turned on.
We can probably do this either with the gc module, or with some fancier
solution like https://pythonhosted.org/Pympler/muppy.html
This is just a thought here. Other folks here know more than I do.
…On Fri, Jul 9, 2021 at 9:53 AM Gabe Joseph ***@***.***> wrote:
@pitrou <https://github.com/pitrou> this could just be because the
scheduler node has 4 CPUs, but is single-threaded. So 1s of extra wall time
will usually give 3s of idle time, I'd think.
Maybe I'll rerun with 1 CPU for the scheduler so these numbers are more
meaningful.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4987 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAFH4ONGR735VMI7FTTW4EQJANCNFSM47PBXN3Q>
.
|
I reran with:
For both workloads, disabling GC gives both better absolute performance, and significantly better performance for repeated runs: Looking at only the CPU where the scheduler is running, idle time is very low, and consistent between runs: These are all run with @mrocklin I'm working now on trying to figure out what new objects are being created/leaked between runs. Though I'm currently just starting with a local cluster to figure out what information to collect. |
Based on, dask/distributed#4987, ran SF1k w/ GC disabled on the scheduler process, and observed total benchmark runtime improvement of about 5s. This was consistent across many runs.
the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime?
|
A few of us (@gjoseph92 @crusaderky @fjetter @jrbourbeau ) just met on this topic and came up with a few questions. First, some context. The dask scheduler constucts a massive self-referential data structure. We're curious how this structure interacts with the garbage collector. We're looking to get information from people like @pitrou (or anyone that he may know) who can help us with some general information.
|
In general this is the kind of structure that GC is designed to handle. It's hard to know when a large self-referential object can be cleaned up. So this is what GC checks for. Maybe this is already obvious, but thought it deserved stating to set expectations. |
Is it expensive to walk through such a structure? Are there ways where we can ask the GC to ignore this structure? How about with Cython? |
@gjoseph92 I would be curious if the pauses that you see are only in the oldest generation. I suspect that we could learn some things by tweaking some of the parameters that @pitrou mentions above |
The same as with a massive non-self-referential data structure. The fact that the data structure is self-referential is irrelevant for the cost of walking the structure.
Basically, it's O(number of objects referenced by the structure).
Not really, there's gc.freeze but it's very coarse-grained as you'll see from the description.
Neither. Cython doesn't change how the GC works. |
It's possible to tell Cython not to GC certain Cython extension objects. Am less clear on if that will work in our use case or whether it will be a good idea. Should add Cython extension types already disable features like the A separate, but also interesting idea in Cython (that I believe I've raised before) is to use Cython More generally while we have seen some pains around GC. I don't think we have narrowed it down to particular objects (though please correct me if things have changed here I may be a bit out-of-date). Think this is very important to determine where effort is best spent. For example we might think To put a finer point on this, we have seen that turning off GC seems to be help during |
FYI @wence-, this thread, and especially #4987 (comment), might be interesting to you with the profiling you're doing. |
@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I'm going to include some notes from him in the summary below
Notes from Gabe
See #4825 for initial discussion of the problem. It also comes up on #4881 (comment).
I've also run these with GC debug mode on (gjoseph92/dask-profiling-coiled@c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:
Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true
The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It's also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it's been running for a little bit.
I've also tried turning off:
But none of those affected the issue.
What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what's causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.
I notice that we have reference cycles in our scheduler state
Should we be concerned about our use of reference cycles?
cc @jakirkham @pitrou
The text was updated successfully, but these errors were encountered: