Are reference cycles a performance problem? #4987

mrocklin · 2021-06-29T01:06:36Z

@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I'm going to include some notes from him in the summary below

Notes from Gabe

See #4825 for initial discussion of the problem. It also comes up on #4881 (comment).

I've also run these with GC debug mode on (gjoseph92/dask-profiling-coiled@c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:

gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed

Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true

The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It's also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it's been running for a little bit.

I've also tried turning off:

statistical profiling
bokeh dashboard
uvloop instead of native asyncio

But none of those affected the issue.

What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what's causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.

I notice that we have reference cycles in our scheduler state

In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()

In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>

In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>

In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: True

Should we be concerned about our use of reference cycles?

cc @jakirkham @pitrou

The text was updated successfully, but these errors were encountered:

fjetter · 2021-06-29T09:09:40Z

FWIW, I've seen in certain "flaky" timed out CI runs log messages claiming that we spend >98% of time in GC. I'll watch out for those tests and post links if I find any.

pitrou · 2021-06-29T09:16:51Z

I notice that we have reference cycles in our scheduler state

As long as those are live cycles, it shouldn't be a problem. It's probably good to try to break those cycles when a task is removed from the graph, but IIRC this is/was already the case (e.g. explicitly clear dependents and dependencies in the removed task).

Should we be concerned about our use of reference cycles?

I don't know. Cycles will probably difficult to avoid in async code, I think.

One tidbit: if you know you're keeping a traceback or exception somewhere, you can help breaking cycles by calling traceback.clear_frames: https://docs.python.org/3/library/traceback.html#traceback.clear_frames

mrocklin · 2021-06-29T14:18:15Z

Any suggestions on how to track down what is causing excessive GC use?

pitrou · 2021-06-29T15:33:41Z

First, the GC will run whether you have reference cycles or not (it needs to run precisely to detect dead reference cycles).
Second, the GC is generational. The most costly part is collecting the oldest generation, which contains long-lived objects (for example scheduler tasks ;-)).
Third, what triggers cyclic garbage collections is a set of heuristics based on the number of allocations since the last collection. They are described in the docs. As you can see, you can tweak those heuristics somehow.

gjoseph92 · 2021-07-01T20:56:33Z

@crusaderky, @jrbourbeau and I discussed this today.

We agreed that the main mystery is why the logs from GC debug mode mostly show an elapsed time of 0.0000s, the longest one recorded was 0.1061s, yet the py-spy profiles make it look like pauses are in the 1-2sec range. (Note that the logs were recorded without py-spy running, since the combo made things non-functionally slow.) We need to figure out which is true—or if they're both true, why.

Some next steps:

Get actual statistics about the performance gain from disabling GC on the scheduler. The tests I've been running are meant for finer-grained profiling only, and total runtimes should not really be compared between tests. I'll change things around to run the same workload many times (without py-spy) to get some actual metrics. If these show little difference, then we close this and move on.
I'll also look at total CPU times spent on the scheduler during the compute. @crusaderky's idea was that GC should be CPU-intensive, so if these pauses are actually spent traversing cycles, we'd see higher user CPU time. But if the pauses are something else (resource contention? acquiring a lock?), we wouldn't.
Add gc callbacks to log GC time, as perceived by user code, and to correlate to timestamps in the py-spy profiles. Compare to the times logged by GC debug mode. Is there some other time we're not accounting for?
Try disabling only generation 2 collection. Same performance as disabling it entirely?
Either try to line-profile CPython while it's GC'ing (possible with a GC callback and some clever perf-ing), or take the big hammer of commenting out parts of gc_collect_main, rebuilding Python, and re-running to see what part is slowing us down? (This is assuming elapsed time as logged by a gc callback != elapsed time as logged by GC debug, and we're trying to figure out where the time is going. If we've instead found that GC just has a ton of work to do, then we need to figure out why we have so many cycles, and what we can do to reduce them.)

pitrou · 2021-07-01T21:01:14Z

We agreed that the main mystery is why the logs from GC debug mode mostly show an elapsed time of 0.0000s

You're looking for collections of the oldest generation ("gc: collecting generation 2"). Your logs only show 3 instances of that, AFAIK.

By the way: it may well be that the reason for long GC runs is not the cost of garbage collection, but some costly finalizers (for example if a finalizer does some IO).

mrocklin · 2021-07-01T21:31:18Z

Doing an audit of our finalizers and del methods makes a lot of sense to me.

…

On Thu, Jul 1, 2021, 2:01 PM Antoine Pitrou ***@***.***> wrote: We agreed that the main mystery is why the logs from GC debug mode mostly show an elapsed time of 0.0000s You're looking for collections of the oldest generation ("gc: collecting generation 2"). Your logs only show 3 instances of that, AFAIK. By the way: it may well be that the reason for long GC runs is not the cost of garbage collection, but some costly finalizers (for example if a finalizer does some IO). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4987 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCC2HPLUODTXSYRXPDTVTJSJANCNFSM47PBXN3Q> .

gjoseph92 · 2021-07-01T21:45:17Z

The finalizers are a good idea. I would have thought they'd show up in the py-spy profiles, but maybe because of how they're called within the GC process, they don't show up in the interpreter's normal call stack?

jakirkham · 2021-07-02T02:10:52Z

I think at one point we discussed tracking what objects are being GC'd. Has that been explored? If we are concerned about how specific objects are being cleaned up, this would be useful info when identifying what cleanup methods are problematic

dhirschfeld · 2021-07-02T04:41:37Z

There was an interesting issue in trio where they tracked down GC issues causing latency spikes. Might be some useful debugging techniques to crib from there?

Trio run loop generates cyclic garbage, which can cause more frequent GC stalls on CPython python-trio/trio#1770
Avoid creating reference cycles python-trio/trio#1805

gjoseph92 · 2021-07-02T15:39:05Z

@dhirschfeld thanks a ton these links—they're very helpful. Something like python-trio/trio#1805 (comment) is what we'll want to do to debug this.

Also, I wouldn't be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I've been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you'd read that sort of code and think "how does adding a del as the last line of a function matter; the function is about to go out of scope anyway." But it's a different story entirely when something (the raised exception) holds a reference to that Frame.

gjoseph92 · 2021-07-09T01:58:09Z

I've gotten some statistics about how turning off GC affects performance. First, disabling GC on the scheduler gives a 14-37% speedup on our shuffle workload on a 100-worker Coiled cluster.

The range is because I've accidentally discovered that this workload runs slower and slower each subsequent time it re-runs on the same cluster, in a concerningly linear way:

The doubled lines are with and without MALLOC_TRIM_THRESHOLD_=0. It gave a slight performance improvement, but not that significant.

So at trial 1, disabling GC is 14% faster. By trial 10, it's 37% faster.

This slowdown is really significant when GC is on—after 10 repetitions, the same operation is 75% slower.

FWIW I tried adding a client.restart() before each repetition, but it made no difference/was a tiny bit worse, so I left it off this plot. I also couldn't reproduce the slowdown locally with a subsample of the data. It may have something to do with the number of workers?

I also recorded both memory info and the change in CPU times before each trial (@crusaderky). We can see that both user and idle CPU time is higher when GC is active. Also, scheduler memory grows linearly in both cases, but grows much faster when GC is off (so just turning off GC entirely by default certainly does not seem like a viable option). Also reminds me of #3898.

Unless anyone has ideas about what's causing the slowdowns with each repetition, my next steps will be:

Try a totally different workload, and see if it also gets slower with repetitions on the same cluster.
Record py-spy profiles of each repetition and try to see what's slowing down.

pitrou · 2021-07-09T08:28:35Z

I'd say two things are particularly of concern:

that it grows a lot with the number of trials
the fact that idle time grows a lot (!!!)

pitrou · 2021-07-09T08:33:35Z

That said, I don't understand the idle time on your graphs. Basically, your numbers show that user time == elapsed time, but idle time also grows a lot (and a lot larger than elapsed time!).

gjoseph92 · 2021-07-09T14:53:45Z

@pitrou this could just be because the scheduler node has 4 CPUs, but is single-threaded. So 1s of extra wall time will usually give a minimum of 3s of idle time, I'd think.

Maybe I'll rerun with 1 CPU for the scheduler so these numbers are more meaningful.

mrocklin · 2021-07-09T16:22:56Z

The growth of memory signals to me that maybe we are genuinely leaking objects. We might consider comparing the counts of types of objects that are around both with and without GC to see if there is some notable difference. For example, maybe we find that lots of TaskState objects exist without gc turned on. We can probably do this either with the gc module, or with some fancier solution like https://pythonhosted.org/Pympler/muppy.html This is just a thought here. Other folks here know more than I do.

…

On Fri, Jul 9, 2021 at 9:53 AM Gabe Joseph ***@***.***> wrote: @pitrou <https://github.com/pitrou> this could just be because the scheduler node has 4 CPUs, but is single-threaded. So 1s of extra wall time will usually give 3s of idle time, I'd think. Maybe I'll rerun with 1 CPU for the scheduler so these numbers are more meaningful. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4987 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTAFH4ONGR735VMI7FTTW4EQJANCNFSM47PBXN3Q> .

gjoseph92 · 2021-07-09T21:48:15Z

I reran with:

a new workload ((x + x.T).mean() sort of thing; 93,977 optimized tasks vs 274,712 optimized tasks for the shuffle)
pinning the scheduler process to CPU 0, then only looking at times for CPU 0, so @pitrou idle times should be accurate now.

For both workloads, disabling GC gives both better absolute performance, and significantly better performance for repeated runs:

Looking at only the CPU where the scheduler is running, idle time is very low, and consistent between runs:

These are all run with MALLOC_TRIM_THRESHOLD_=0. I dropped the no-trim case because it was effectively the same, and just added clutter.

@mrocklin I'm working now on trying to figure out what new objects are being created/leaked between runs. Though I'm currently just starting with a local cluster to figure out what information to collect.

Based on, dask/distributed#4987, ran SF1k w/ GC disabled on the scheduler process, and observed total benchmark runtime improvement of about 5s. This was consistent across many runs.

crusaderky · 2021-07-16T11:40:40Z

the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime?
If positive, I suggest that you

do one run (to make sure everything's warm)
run pympler to log the id of all objects in the interpreter
run many more runs and observe the progressive slowdown
run pympler again to find all the objects in the interpreter again, then remove those that were already present in (2)
use pympler.refbrowser (warning: mortally slow) to find out what's referencing them and why they aren't being garbage collected

mrocklin · 2021-07-30T16:21:12Z

A few of us (@gjoseph92 @crusaderky @fjetter @jrbourbeau ) just met on this topic and came up with a few questions. First, some context. The dask scheduler constucts a massive self-referential data structure. We're curious how this structure interacts with the garbage collector. We're looking to get information from people like @pitrou (or anyone that he may know) who can help us with some general information.

What is the behavior of the GC when the scheduler is at rest but with a massive self-referential data structure? Is it expensive to keep such a structure in memory?
Why do we need the GC anyway, what are we producing that isn't cleaned up naturally? This is more a question for future experimentation. I would be curious to turn off the GC and see what builds up. If we can get the scheduler to a state where everything is cleaned up on its own that would be nice.
We're seeing pauses that exist when we keep GC on. However we don't see a lot of things getting cleaned up during these pauses. Is it possible that there are expensive things in memory that take up a lot of GC time, but that don't ultimately get GC'ed? If so, is it possible to find out what those things are?
How does GC interact with our Cythonization system? Is there a way to remove our massive collection of self-referential TaskState/WorkerState/TaskGroup system from the view of the garbage collector using the recent Cythonization work? Maybe there is some magic directive here?

jakirkham · 2021-07-30T17:19:23Z

In general this is the kind of structure that GC is designed to handle. It's hard to know when a large self-referential object can be cleaned up. So this is what GC checks for. Maybe this is already obvious, but thought it deserved stating to set expectations.

mrocklin · 2021-07-30T17:21:37Z

Is it expensive to walk through such a structure? Are there ways where we can ask the GC to ignore this structure? How about with Cython?

mrocklin · 2021-07-30T18:34:52Z

@gjoseph92 I would be curious if the pauses that you see are only in the oldest generation. I suspect that we could learn some things by tweaking some of the parameters that @pitrou mentions above

pitrou · 2021-07-30T18:40:04Z

What is the behavior of the GC when the scheduler is at rest but with a massive self-referential data structure?

The same as with a massive non-self-referential data structure. The fact that the data structure is self-referential is irrelevant for the cost of walking the structure.

Is it expensive to walk through such a structure?

Basically, it's O(number of objects referenced by the structure).

Are there ways where we can ask the GC to ignore this structure?

Not really, there's gc.freeze but it's very coarse-grained as you'll see from the description.

How about with Cython?

Neither. Cython doesn't change how the GC works.

jakirkham · 2021-07-30T18:59:55Z

It's possible to tell Cython not to GC certain Cython extension objects. Am less clear on if that will work in our use case or whether it will be a good idea. Should add Cython extension types already disable features like the trashcan by default.

A separate, but also interesting idea in Cython (that I believe I've raised before) is to use Cython freelist for commonly used extension types. This cuts down on the allocation/deallocation cost of a particular type (especially if it is created/deleted frequently). That said, I don't think we've narrowed down what objects this would be useful for or how many objects we tend to use. Both would really be needed to make this useful.

More generally while we have seen some pains around GC. I don't think we have narrowed it down to particular objects (though please correct me if things have changed here I may be a bit out-of-date). Think this is very important to determine where effort is best spent. For example we might think Task* and *State objects and spend a bunch of time improving allocation/deallocation costs around them only to learn it doesn't help.

To put a finer point on this, we have seen that turning off GC seems to be help during reads and MsgPack unpackb specifically ( #4881 ). More recently we have identified large memory usage during serialization causing some pain ( #5107 ). Large memory usage while creating a lot of small temporary objects during serialization may be exactly the sort of thing that could trigger heavy GC usage (and some of these other problems discussed). If this is true, optimizing Task* and *State objects wouldn't help much. Instead being more careful to use memoryviews ( #5112 ), investigating other serialization libraries like msgspec, and possibly an overhaul of serialization altogether ( #4923 ) may all be needed to address this kind of issue.

gjoseph92 · 2022-08-17T02:51:27Z

FYI @wence-, this thread, and especially #4987 (comment), might be interesting to you with the profiling you're doing.

randerzander mentioned this issue Jul 12, 2021

[DON'T MERGE] Disable GC on scheduler at benchmark start rapidsai/gpu-bdb#220

Open

gjoseph92 mentioned this issue Jul 13, 2021

Adjust worker batched send interval to cluster size #5036

Closed

2 tasks

gjoseph92 mentioned this issue Aug 12, 2021

Fix speedscope time interval benfred/py-spy#434

Merged

gjoseph92 mentioned this issue Dec 7, 2021

Set scheduler log sizes automatically based on available memory #5570

Open

gjoseph92 mentioned this issue Dec 16, 2021

How to make a big-ish median mosaic microsoft/PlanetaryComputer#12

Closed

gjoseph92 mentioned this issue Dec 27, 2021

[Idea] Handle large graphs by spilling tasks to disk on Scheduler #5630

Open

This was referenced Mar 22, 2022

Unique keys runs slow with time #5965

Open

Track Event Loop intervals in dashboard plot #5964

Merged

This was referenced Aug 16, 2022

Benchmark time spent in GIL and GC coiled/benchmarks#248

Open

Benchmarks for some common workloads coiled/benchmarks#243

Merged

wence- mentioned this issue Aug 17, 2022

scheduler.get_comm_cost a significant portion of runtime in merge benchmarks #6899

Closed

This was referenced Aug 17, 2022

How to handle tests that "should work" but don't? coiled/benchmarks#252

Open

Failing test_climatic_mean coiled/benchmarks#253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are reference cycles a performance problem? #4987

Are reference cycles a performance problem? #4987

mrocklin commented Jun 29, 2021 •

edited

Loading

fjetter commented Jun 29, 2021

pitrou commented Jun 29, 2021

mrocklin commented Jun 29, 2021

pitrou commented Jun 29, 2021

gjoseph92 commented Jul 1, 2021 •

edited

Loading

pitrou commented Jul 1, 2021

mrocklin commented Jul 1, 2021 via email

gjoseph92 commented Jul 1, 2021

jakirkham commented Jul 2, 2021

dhirschfeld commented Jul 2, 2021

gjoseph92 commented Jul 2, 2021

gjoseph92 commented Jul 9, 2021

pitrou commented Jul 9, 2021

pitrou commented Jul 9, 2021

gjoseph92 commented Jul 9, 2021 •

edited

Loading

mrocklin commented Jul 9, 2021 via email

gjoseph92 commented Jul 9, 2021

crusaderky commented Jul 16, 2021 •

edited

Loading

mrocklin commented Jul 30, 2021

jakirkham commented Jul 30, 2021

mrocklin commented Jul 30, 2021

mrocklin commented Jul 30, 2021

pitrou commented Jul 30, 2021

jakirkham commented Jul 30, 2021

gjoseph92 commented Aug 17, 2022

Are reference cycles a performance problem? #4987

Are reference cycles a performance problem? #4987

Comments

mrocklin commented Jun 29, 2021 • edited Loading

fjetter commented Jun 29, 2021

pitrou commented Jun 29, 2021

mrocklin commented Jun 29, 2021

pitrou commented Jun 29, 2021

gjoseph92 commented Jul 1, 2021 • edited Loading

pitrou commented Jul 1, 2021

mrocklin commented Jul 1, 2021 via email

gjoseph92 commented Jul 1, 2021

jakirkham commented Jul 2, 2021

dhirschfeld commented Jul 2, 2021

gjoseph92 commented Jul 2, 2021

gjoseph92 commented Jul 9, 2021

pitrou commented Jul 9, 2021

pitrou commented Jul 9, 2021

gjoseph92 commented Jul 9, 2021 • edited Loading

mrocklin commented Jul 9, 2021 via email

gjoseph92 commented Jul 9, 2021

crusaderky commented Jul 16, 2021 • edited Loading

mrocklin commented Jul 30, 2021

jakirkham commented Jul 30, 2021

mrocklin commented Jul 30, 2021

mrocklin commented Jul 30, 2021

pitrou commented Jul 30, 2021

jakirkham commented Jul 30, 2021

gjoseph92 commented Aug 17, 2022

mrocklin commented Jun 29, 2021 •

edited

Loading

gjoseph92 commented Jul 1, 2021 •

edited

Loading

gjoseph92 commented Jul 9, 2021 •

edited

Loading

crusaderky commented Jul 16, 2021 •

edited

Loading