Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crosscluster/physical: split PCR procs dest side instead #135637

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

dt
Copy link
Member

@dt dt commented Nov 18, 2024

This allows the dest clsuter to choose to split less when it has fewer nodes.

Release note: none.
Epic: none.

@dt dt requested a review from a team as a code owner November 18, 2024 22:02
Copy link

blathers-crl bot commented Nov 18, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt
Copy link
Member Author

dt commented Nov 18, 2024

I think this should be fine in mixed verisons without any sort of cross-cluster flags:

  • If src is newer than dst, we just fall back to one proc per src node.
  • If dst is newer, it just doesn't re-split since it is already at target size.

This allows the dest clsuter to choose to split less when it has fewer nodes.

Release note: none.
Epic: none.
@@ -320,14 +310,6 @@ func buildReplicationStreamSpec(
return nil, err
}

// If more partitions were requested, try to repartition the spans.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad to see this logic moved to the consumer.


// If we have fewer partitions than we have nodes, try to repartition the
// topology to have more partitions.
topology = repartitionTopology(topology, len(sqlInstanceIDs)*8)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you wanted to remove this tunable cluster setting? Like, if we encounter another pcr oom, it may be nice to have it.

Copy link
Member Author

@dt dt Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OOM was in smaller dest clusters than src clusters where we'd create >8 procs per node, whereas now we are using the src cluster size directly here, so we no longer risk that specific behavior. The setting also wasn't settable in CC where the sys tenant is inaccessible so I think having the code tune itself, as we do now, rather than having a tuning knob, is what we need anyway.

out.Partitions = make([]streamclient.PartitionInfo, 0, targetPartCount)
// For each partition in the input, put some number of copies of it into the
// output each containing some fraction of its spans.
for _, p := range in.Partitions {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you removed the round robin behavior from the previous repartition algorithm. why? I'm referring to this code block from the old func:

	for x, sp := range partitions[part].Spans {
			repartitioned[x%parts].Spans = append(repartitioned[x%parts].Spans, sp)
		}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We like adjacent spans in the same proc -- we added a whole new planning mode to get it! -- and round-robin seems antithetical to that, so I think I was just misguided when I added it initially. If I'm wrong we can being it back.

@dt
Copy link
Member Author

dt commented Nov 19, 2024

TFTR!

bors r+

@craig craig bot merged commit 2b243f2 into cockroachdb:master Nov 19, 2024
22 of 23 checks passed
dt added a commit to dt/cockroach that referenced this pull request Nov 25, 2024
Release note: none.
Epic: none.

This regressed in cockroachdb#135637 which was assigning all conusmer sub-partitions the whole partition due
to sharing the original token.
dt added a commit to dt/cockroach that referenced this pull request Nov 26, 2024
Release note: none.
Epic: none.

This regressed in cockroachdb#135637 which was assigning all conusmer sub-partitions the whole partition due
to sharing the original token.
dt added a commit to dt/cockroach that referenced this pull request Nov 26, 2024
Release note: none.
Epic: none.

This regressed in cockroachdb#135637 which was assigning all conusmer sub-partitions the whole partition due
to sharing the original token.
@dt dt deleted the pcr-shard-dest-side branch November 26, 2024 16:20
craig bot pushed a commit that referenced this pull request Nov 26, 2024
135564: build/teamcity: add changes to enable openmetrics in nightly roachtests r=nameisbhaskar a=sambhav-jain-16

There was a regression in #135239.
The change was reverted and this PR aims to fix the regressions

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-41852

Release note: None

135596: security: bugfix, ensure cert expiry metrics reflect reloaded certs r=angles-n-daemons a=angles-n-daemons

security: bugfix, ensure cert expiry metrics reflect reloaded certs

The PR #130110 added certificate TTL metrics alongside our existing expiration metrics. Prior to that change, the certificate metrics values were updated on each metrics load. Afterwards, new metrics objects were created for each load of certificates.

This created a bug in that the new expiration values would not be found in any of the system exhaust (metrics scrape or tsdb) because the registered metrics objects were the ones created on startup.

This new change instead allows the metrics to close the whole CertificateManager object, so that they only need to be created once, and therefore the initial registration of metrics reflects persistently valid values.

Release note (bug fix): security.certificate.* metrics will now be updated if a node loads new certificates while running.

Epic: none
Fixes: #135093

136122: crosscluster/physical: update tokens when altering topology r=dt a=dt

Release note: none.
Epic: none.

This regressed in #135637 which was assigning all conusmer sub-partitions the whole partition due to sharing the original token.

136182: kvserver: use leader leases in various flow control tests r=kvoli a=arulajmani

See individual commits for details. 

Co-authored-by: Sambhav Jain <[email protected]>
Co-authored-by: Brian Dillmann <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants