Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: dynamically increase compaction concurrency during downloads #116633

Merged
merged 3 commits into from
Dec 18, 2023

Conversation

dt
Copy link
Member

@dt dt commented Dec 17, 2023

Early in an online restore, a workload may not be able to utilize
all CPU due to the high latency of accessing still-remote data meaning
that many or most queries spend most of their time waiting rather than
executing. During this phase, increasing the compaciton concurrency can
expedite getting data downloaded, to reduce that latency and improve the
workload performance.

However later in the download phase, when more of the most accessed data
has been downloaded, the workload itself may be able to execute fast enough
that it requires the majority of available CPU. At this point, excessive
CPU usage by download compactions will actually negatively impact the workload
performance.

Thus it is desirable to maximize compaction concurrency when CPU is available
but reduce it when becomes scarce.

This change introduces an additional goroutine to the download call that
monitors CPU usage and adjusts compaction concurrency up and down based on
the CPU usage being below 70% or above 80% respectively.

Release note: none.
Epic: none.

dt added 2 commits December 17, 2023 13:31
These still only get executed at the compaction concurrency limit but
having more of them in the queue will allow us to utilize higher
compaction concurrency if it is configured.

Release note: none.
Epic: none.
This adds a new method to the Engine API to increase or decrease the
compaction concurrency by some amount. This is added along-side the
existing API for setting the compaction concurrency to a specific
amount, as it allows callers to increase or decrease concurrency
without needing to know the current value.

Release note: none.
Epic: none.
@dt dt requested review from adityamaru and msbutler December 17, 2023 16:01
@dt dt requested a review from a team as a code owner December 17, 2023 16:01
@dt dt requested a review from jbowens December 17, 2023 16:01
Copy link

blathers-crl bot commented Dec 17, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API doesn't work well with SetCompactionConcurrency, it can error out and/or restore the incorrect value.

Should we maybe make this specific to download compactions? Like the compaction limit would be X concurrent non-download compactions + Y concurrent download compactions, and this API would just adjust Y.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @jbowens, and @msbutler)

@dt
Copy link
Member Author

dt commented Dec 18, 2023

it can error out and/or restore the incorrect value.

I thought about this a bit which is why I also added the CAS loop checking that it isn't setting it <1.

I think I'm OK with the sorta meh interaction though, at least until pebble exposes a separate download compaction limit and then we switch this to control that. It can't reduce to below 1 -- it'll fail and log if it were to attempt that, e.g. if you'd manually reduced compaction concurrency by more than it had upped it. And if you manually upped it to some bigger number, well this will reduce that big number by the amount it had automatically upped it before. Presumably when you upped it you picked a number including the fact it was temporarily doing extra (download) compactions?🤷

Also, I think there's already an implicit assumption that crdb_internal.set_compaction_concurrency has external coordination around mutual exclusion, since it's undo behavior isn't CAS'ed, so it already restores to "incorrect" values.

@dt
Copy link
Member Author

dt commented Dec 18, 2023

My inclination is to file something to separate the limits as you suggest so we come back to this later, but not let potential interactions with an internal undocumented built-in hold up iterating on online restore, especially since the interactions aren't that worrying?

@RaduBerinde
Copy link
Member

Sure. One thing I would do here is to cap the number to 1 instead of causing an error to the download (e.g. is something else sets the compaction limit lower)

@dt dt force-pushed the or/download-concurrency branch from b74d996 to 66ebeba Compare December 18, 2023 02:59
@dt
Copy link
Member Author

dt commented Dec 18, 2023

cap the number to 1 instead of causing an error to the download

done.

Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we could unit test the loop that adjusts compaction concurrency, but since this is a prototype, I'm ok delaying the unit test until we develop a more principled method for adjusting concurrency.

@itsbilal
Copy link
Member

I'm good with this as a stopgap for now. However I think one better longer-term solution would be to have a separate download compactions concurrency in Pebble (it could be defined as "additional download compaction concurrency"), which can then be used by downloads only and not by other compactions. Pebble already has a Download() method so theoretically this change doesn't need to be in Cockroach at all.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Please file an issue to keep track of this, ideally we would have a bucket for Online Restore issues related to making the feature as robust as possible.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @jbowens)


pkg/storage/pebble.go line 933 at r4 (raw file):

}

// AdjustCompactionConcurrency adjusts the compaction concurrency up or down by

[nit] move/copy this comment to the interface (the comment there is not very useful)

Early in an online restore, a workload may not be able to utilize
all CPU due to the high latency of accessing still-remote data meaning
that many or most queries spend most of their time waiting rather than
executing. During this phase, increasing the compaciton concurrency can
expedite getting data downloaded, to reduce that latency and improve the
workload performance.

However later in the download phase, when more of the most accessed data
has been downloaded, the workload itself may be able to execute fast enough
that it requires the majority of available CPU. At this point, excessive
CPU usage by download compactions will actually negatively impact the workload
performance.

Thus it is desirable to maximize compaction concurrency when CPU is available
but reduce it when becomes scarce.

This change introduces an additional goroutine to the download call that
monitors CPU usage and adjusts compaction concurrency up and down based on
the CPU usage being below 70% or above 80% respectively.

Release note: none.
Epic: none.
@dt dt force-pushed the or/download-concurrency branch from 66ebeba to acff1f2 Compare December 18, 2023 17:34
@dt
Copy link
Member Author

dt commented Dec 18, 2023

#116679

@dt
Copy link
Member Author

dt commented Dec 18, 2023

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 18, 2023

Build succeeded:

@craig craig bot merged commit 6fae6a7 into cockroachdb:master Dec 18, 2023
8 of 9 checks passed
@dt dt deleted the or/download-concurrency branch December 18, 2023 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants