Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GC docs for incremental collection. #1379

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

markshannon
Copy link
Member

@markshannon markshannon commented Aug 23, 2024

Copy link
Member

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to do this. A couple of notes:

(Also, @savannahostrowski: this PR diff is a good summary of the differences between the GC implementation you've been working with and the current devguide article, which is outdated.)



To collect all unreachable cycles in the heap, the garbage collector must scan the
whole heap. This whole heap scan is called a cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider a different term here (and throughout below) since "cycle" already has an important meaning here and it can be confusing if we overload it:

Suggested change
whole heap. This whole heap scan is called a cycle.
whole heap. This whole heap scan is called a full collection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of cycle use "full scavenge".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A collection is an increment, so that's not a good term.
Cycle is overloaded, so that's not great either.
"Scavenge" is the least ambiguous, but obscure.

Anyone have any other suggestions? I'll go with "scavenge" if not.

* All any objects reachable from those objects that have not yet been scanned this cycle.

Any objects surviving this collection are moved to the old generation.
ollection from cycles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ollection from cycles.

* The oldest fraction of the old generation
* All any objects reachable from those objects that have not yet been scanned this cycle.

Any objects surviving this collection are moved to the old generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify that the objects that started in the old generation are considered "youngest of the old" instead of "oldest of the old" now (there's probably a better way of phrasing it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't really the oldest, it is the "least recently scanned". I'll rework this section.

implementation for the default build uses incremental collection with two
generations.

The purpose of generations is to take advantage of what is known as the weak
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The purpose of generations is to take advantage of what is known as the weak
Generational garbage collection takes advantage of what is known as the weak

two generations: young and old. Every new object starts in the young generation.


To collect all unreachable cycles in the heap, the garbage collector must scan the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add detection as distinct from collection.

Suggested change
To collect all unreachable cycles in the heap, the garbage collector must scan the
To detect and collect all unreachable cycles in the heap, the garbage collector must scan the

To collect all unreachable cycles in the heap, the garbage collector must scan the
whole heap. This whole heap scan is called a cycle.

In order to limit the time each garbage collection takes, the previous algorithm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to limit the time each garbage collection takes, the previous algorithm
To limit the time each garbage collection takes, the detection and collection algorithm


* The young generation
* The oldest fraction of the old generation
* All any objects reachable from those objects that have not yet been scanned this cycle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All any objects reachable from those objects that have not yet been scanned this cycle.
* All objects reachable from those objects that have not yet been scanned this cycle.



To collect all unreachable cycles in the heap, the garbage collector must scan the
whole heap. This whole heap scan is called a cycle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of cycle use "full scavenge".

internals/garbage-collector.rst Outdated Show resolved Hide resolved
When all objects in the heap have been scanned a cycle ends, and all objects are
considered unscanned again.

In order to collect all unreachable cycles, each increment must contain all of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure that I understand this section on unreachable cycles, I think you're saying that we want to ensure that we fully capture the unreachable cycle because we want to ensure that the cycle is either fully gc'd or not, to avoid partial processing, which could be problematic later on?

If so, maybe it's worth mentioning explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't scan the full cycle at once, we cannot collect it. It is otherwise safe.

The old generational collector would scan part cycles all the time; it just delayed the collection of the cycle, at worst until a full collection.

With the incremental collector, if we only scan part of a cycle, it may never be collected. Which would be a problem.

@markshannon
Copy link
Member Author

Thanks for the comments and suggestions.

Survivors are moved to the back of the scanned list. The old part of increment is taken
from the front of the unscanned list.

When a cycle starts, no objects in the heap are considered to have been scanned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using cycle of a "reference cycle" and a "gc execution" it's a bit confusing, can we use other terminology?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used "full scavenge"

In order to make sure that the whole of any unreachable cycle is contained in an
increment, all unscanned objects reachable from any object in the increment must
be included in the increment.
Thus, to form a complete increment we perform a transitive closure over reachable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I enjoy proper terminology myself. Many people reading this document may not immediately know what a transitive closure is, could you add a brief explanation in parentheses or just explain the process?


* The young generation
* The least recently scanned fraction of the old generation.
* All objects reachable from those objects that have not yet been scanned this cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my other comments about "cycle"

unscanned objects from the initial increment.
We can exclude scanned objects, as they must have been reachable when scanned.
If a scanned object becomes part of an unreachable cycle after being scanned, it
will not be collected this cycle, but it will be collected next cycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can benefit from a concrete simple example showing how this can happen and how will be eventually cleaned. Either in English or with some diagram or pseudo code

generations. Every collection operates on the entire heap.
two generations: young and old. Every new object starts in the young generation.

To detect and collect all unreachable cycles in the heap, the garbage collector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use objects instead of cycles here.

Suggested change
To detect and collect all unreachable cycles in the heap, the garbage collector
To detect and collect all unreachable objects in the heap, the garbage collector

of three parts:

* The young generation
* The least recently scanned fraction of the old generation.
Copy link
Collaborator

@willingc willingc Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The least recently scanned fraction of the old generation.
* The old generation's least recently scanned heap found in the scanned objects list


* The young generation
* The least recently scanned fraction of the old generation.
* All objects reachable from those objects that have not yet been scanned this cycle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All objects reachable from those objects that have not yet been scanned this cycle.
* All objects reachable from those objects that have not yet been scanned.

* The least recently scanned fraction of the old generation.
* All objects reachable from those objects that have not yet been scanned this cycle.

Any objects surviving this collection are moved to the old generation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Any objects surviving this collection are moved to the old generation.
Any young generation objects surviving this collection are moved to the old generation, and reachable objects in the old generation remain in the old generation.

* All objects reachable from those objects that have not yet been scanned this cycle.

Any objects surviving this collection are moved to the old generation.
The old generation is composed of two lists, scanned and unscanned.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The old generation is composed of two lists, scanned and unscanned.
The old generation is composed of two lists: scanned increments and unscanned.

@mr-bronson
Copy link

Please note the inconsistency in argument or variable names:

This devguide speaks of threshold_0 and threshold_1 (note the underscore), but also threshold2 (no underscore). If the aim is to match the main gc documentation, the underscores should be omitted.

If a scanned object becomes part of an unreachable cycle after being scanned, it
will not be collected this cycle, but it will be collected next cycle.

The GC implementation for the free-threaded build does not use incremental collection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be called out in an Important block.

will not be collected this cycle, but it will be collected next cycle.

The GC implementation for the free-threaded build does not use incremental collection.
Every collection operates on the entire heap.

In order to decide when to run, the collector keeps track of the number of object
allocations and deallocations since the last collection. When the number of
allocations minus the number of deallocations exceeds ``threshold_0``,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
allocations minus the number of deallocations exceeds ``threshold_0``,
allocations minus the number of deallocations exceeds ``threshold0``,

Comment on lines 415 to 420
collection starts. ``threshold_1`` determines the fraction of the old
collection that is included in the increment.
The fraction is inversely proportional to ``threshold_1``,
as historically a larger ``threshold_1`` meant that old generation
collections were performed less frequency.
``threshold2`` is ignored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct as mentioned in @mr-bronson's message. Good catch @mr-bronson. 😄

Suggested change
collection starts. ``threshold_1`` determines the fraction of the old
collection that is included in the increment.
The fraction is inversely proportional to ``threshold_1``,
as historically a larger ``threshold_1`` meant that old generation
collections were performed less frequency.
``threshold2`` is ignored.
collection starts. ``threshold1`` determines the fraction of the old
collection that is included in the increment.
The fraction is inversely proportional to ``threshold1``,
as historically a larger ``threshold1`` meant that old generation
collections were performed less frequency.
``threshold2`` is ignored.

Copy link
Collaborator

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me! Thanks @markshannon.

@pablogsal
Copy link
Member

@willingc
Copy link
Collaborator

Garbage collection doc has been moved to the CPython repo. Unless there is an objection, I recommend closing this devguide issue.

@@ -399,6 +439,10 @@ more information. These thresholds can be examined using the
The content of these generations can be examined using the
``gc.get_objects(generation=NUM)`` function and collections can be triggered
specifically in a generation by calling ``gc.collect(generation=NUM)``.
Prior to 3.13, there we three generations. For that reason the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Prior to 3.13, there we three generations. For that reason the
Prior to 3.14, there we three generations. For that reason the

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants