Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver/reports: replication_constraint_stats use voter constraints #84727

Merged
merged 1 commit into from
Sep 2, 2022

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Jul 20, 2022

Previously replication_constraint_stats were using zone constraints regardless
of the fact that num_voters and a separate voter constraints were provided.
This was causing report to have violations because range had less voters than
replicas, but still were trying to check if we have as many voters as
full replica constraints.
This was causing ambiguity in reports as there were no violation and no way
to figure out which constraint is being violated.
This patch changes report to use voter constraints if they are provided or
use voters and non voters if voter count is configured lower than replica
count, but voter constraints are not provided.

Release note (bug fix): table system.replication_constraint_stats is not
showing erroneous voter contraint violations when num_voters is configured.

Release justification: bugfix to an issue which also needs backporting.

Fixes #84695

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 requested a review from andreimatei July 20, 2022 16:42
@aliher1911
Copy link
Contributor Author

@andreimatei what do you think? I don't want to polish all the formatting, tests etc if that looks completely off.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch 2 times, most recently from a212342 to c10a9e1 Compare July 20, 2022 19:11
@aliher1911 aliher1911 marked this pull request as ready for review July 20, 2022 21:58
@aliher1911 aliher1911 requested a review from a team as a code owner July 20, 2022 21:58
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The negative present tense in the release note is weird. Say that a bug causing bogus constraint violations was fixed, and don't say anything about "voters" and "num_voter" - these things won't make sense to users.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 435 at r1 (raw file):

			}
			constraints = zone.Constraints
			// If we have number of voters configured separately, then we need to use

I've changed my mind from what I think I said on Slack: for this report, I think we should always consider voters and non-voters (excluding VOTER_DEMOTING and LEARNER; I think excluding these ephemeral guys makes sense). For the replication_stats report we ignore non-voters (for the over-replicated and under-replicated ranges counts); for that table, fitting non-voters in those columns didn't make sense, which is why we need new columns. But for this report here, there's no equivalent argument.

So, I think we should do both:

  • check the union of voters and non-voters against constraints
  • check voters against voter_constraints if voter_constraints is set

The report allows for different types of violations - so I'm thinking we would now introduce a 2nd one for the voter_constraints.


pkg/kv/kvserver/reports/reporter.go line 207 at r1 (raw file):

		return storeDescs
	}
	// This resolver is similar to the one above but uses different filter. This

I kinda doubt that adding a new argument to getStoresFromGossip would really make backporting harder. Even if it would, this does not look to me like the kind of patch that needs to be developed with backporting in mind; it's quite small. So, as a matter of principle, let's get a patch that we like without any constraints and then adapt the backport, or don't backport at all in favor of original work on the old branch. The beauty standards on an old branch can be lower.

@aliher1911
Copy link
Contributor Author

Something in these lines:

[email protected]:26257/movr> select * from system.replication_constraint_stats;
  zone_id | subzone_id |       type       |         config         | report_id |        violation_start        | violating_ranges
----------+------------+------------------+------------------------+-----------+-------------------------------+-------------------
      104 |          0 | constraint       | +region=europe-west2:1 |         1 | NULL                          |                0
      104 |          0 | constraint       | +region=us-east1:2     |         1 | NULL                          |                0
      104 |          0 | constraint       | +region=us-west1:2     |         1 | NULL                          |                0
      104 |          0 | voter_constraint | +region=us-east1:2     |         1 | 2022-07-21 09:58:37.871234+00 |               13
(4 rows)

seem pretty easy to add actually. I like this.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from c10a9e1 to c4d230b Compare July 21, 2022 17:33
@aliher1911 aliher1911 requested a review from andreimatei July 21, 2022 18:08
@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from c4d230b to c62568b Compare July 21, 2022 20:09
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 347 at r2 (raw file):

type constraintConformanceVisitor struct {
	cfg                        *config.SystemConfig
	voterStoreResolver         StoreResolver

let's remove the filtering by replica type from StoreResolver. Let's also change it's interface to take a store id and return a store descriptor, rather than take a range descriptor and return the store descriptors in bulk.


pkg/kv/kvserver/reports/constraint_stats_report.go line 461 at r3 (raw file):

func voterNonVoterPredicate(r roachpb.ReplicaDescriptor) bool {
	switch r.GetType() {
	case roachpb.VOTER_FULL, roachpb.VOTER_INCOMING, roachpb.VOTER_OUTGOING, roachpb.VOTER_DEMOTING_NON_VOTER, roachpb.VOTER_DEMOTING_LEARNER, roachpb.NON_VOTER:

you could have this delegate to voterPredicate


pkg/kv/kvserver/reports/constraint_stats_report.go line 470 at r3 (raw file):

func voterPredicate(r roachpb.ReplicaDescriptor) bool {
	switch r.GetType() {
	case roachpb.VOTER_FULL, roachpb.VOTER_INCOMING, roachpb.VOTER_OUTGOING, roachpb.VOTER_DEMOTING_NON_VOTER, roachpb.VOTER_DEMOTING_LEARNER:

I'm not entirely sure about how we should treat VOTER_DEMOTING_* (and VOTER_OUTGOING, if they still exist). I think it's not entirely black and white.
What you have now, I think, will trivially produce violations when a voter is moved from a node to another (within the same locality, say) - you're gonna have a VOTER_DEMOTING and a VOTER_INCOMING at the same time, and that will be in violation of constraints that precisely place voters in regions. I think that'd be bad.
I don't think ignoring VOTER_DEMOTING is entirely satisfying either, particularly in the case of negative constraints. As these become more important (perhaps for domiciling, whatever), simply lying about data not being where it's not supposed to be seems hard to defend.
So I think we should take a page out of both of the other reports, and treat the two quorums separately. Run the constraint checker against each quorum, individually, and declare the range to be in violation if either quorum is in violation.


pkg/kv/kvserver/reports/constraint_stats_report.go line 484 at r3 (raw file):

	constraints []zonepb.ConstraintsConjunction,
	storeResolver StoreResolver,
	replicaFilter func(rDesc roachpb.ReplicaDescriptor) bool,

I think it'd be better if the coupling between Constraint and filter1 and VoterConstraint and filter2 would be coupled at the bottom of the stack, in this function. So make (a single call to) this countRange report both violations for a range. That's also what the function name suggests.


pkg/kv/kvserver/reports/critical_localities_report.go line 400 at r3 (raw file):

func criticalReplicasPred(r roachpb.ReplicaDescriptor) bool {
	switch r.GetType() {
	case roachpb.VOTER_FULL, roachpb.VOTER_OUTGOING, roachpb.VOTER_DEMOTING_NON_VOTER, roachpb.VOTER_DEMOTING_LEARNER:

This one wants INCOMING too. A locality is critical if it's critical for the incoming quorum or the outgoing quorum.

But I think this whole thing is fragile. We're making assumptions here about which descriptors CanMakeProgress will inquire about. Let's pass to processLocalityForRange a function that can resolve any store ID (by delegating to v.storeResolver. Nobody needs to care what kind of replica that store corresponds to, or indeed whether it corresponds to a replica at all.


pkg/kv/kvserver/reports/reporter.go line 193 at r3 (raw file):

	allStores := stats.storePool.GetStores()
	// Used to be voter descriptors

sup with this comment


pkg/kv/kvserver/reports/reporter.go line 511 at r3 (raw file):

}

// StoreResolver is a function resolving a range to a store descriptor for each

comment is stale


pkg/kv/kvserver/reports/constraint_stats_report_test.go line 904 at r3 (raw file):

	}
	allLocalities := expandLocalities(nodeLocalities)
	nodeResolver := func(rs []roachpb.ReplicaDescriptor) []roachpb.StoreDescriptor {

I think it's still storeResolver, isn't it?


pkg/kv/kvserver/reports/critical_localities_report_test.go line 134 at r2 (raw file):

				// actually need to rethink report if it should only look on voters
				// and incoming voters but also should it include outgoing?
				// {object: "t6", locality: "region=reg1,az=az2", atRiskRanges: 1},

This is getting fixed right?
Btw, I think in situations like this it's better to keep the test in, with a bad expectation, and a comment about the bug. Then, if the bug is fixed, the test will fail and remind you to clean up the comment.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from c62568b to b6bcb3a Compare July 22, 2022 09:37
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 461 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you could have this delegate to voterPredicate

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 484 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it'd be better if the coupling between Constraint and filter1 and VoterConstraint and filter2 would be coupled at the bottom of the stack, in this function. So make (a single call to) this countRange report both violations for a range. That's also what the function name suggests.

Done.


pkg/kv/kvserver/reports/critical_localities_report.go line 400 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This one wants INCOMING too. A locality is critical if it's critical for the incoming quorum or the outgoing quorum.

But I think this whole thing is fragile. We're making assumptions here about which descriptors CanMakeProgress will inquire about. Let's pass to processLocalityForRange a function that can resolve any store ID (by delegating to v.storeResolver. Nobody needs to care what kind of replica that store corresponds to, or indeed whether it corresponds to a replica at all.

Looks like original filtering of ranges was a byproduct of sharing rangeResolver in the first place. So having a bare resolver is the correct way here as the decision is made inside canMakeProgress().


pkg/kv/kvserver/reports/reporter.go line 207 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I kinda doubt that adding a new argument to getStoresFromGossip would really make backporting harder. Even if it would, this does not look to me like the kind of patch that needs to be developed with backporting in mind; it's quite small. So, as a matter of principle, let's get a patch that we like without any constraints and then adapt the backport, or don't backport at all in favor of original work on the old branch. The beauty standards on an old branch can be lower.

Done.


pkg/kv/kvserver/reports/constraint_stats_report_test.go line 904 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it's still storeResolver, isn't it?

Done.


pkg/kv/kvserver/reports/critical_localities_report_test.go line 134 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This is getting fixed right?
Btw, I think in situations like this it's better to keep the test in, with a bad expectation, and a comment about the bug. Then, if the bug is fixed, the test will fail and remind you to clean up the comment.

It works now.
How would we get broken test without making build fail or do you mean keep test breaking while PR is being reviewed?

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from b6bcb3a to e26750e Compare July 22, 2022 10:33
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/reporter.go line 193 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

sup with this comment

Done.


pkg/kv/kvserver/reports/reporter.go line 511 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment is stale

Done.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch 2 times, most recently from 71ae3ae to aee49a4 Compare July 22, 2022 10:55
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 435 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I've changed my mind from what I think I said on Slack: for this report, I think we should always consider voters and non-voters (excluding VOTER_DEMOTING and LEARNER; I think excluding these ephemeral guys makes sense). For the replication_stats report we ignore non-voters (for the over-replicated and under-replicated ranges counts); for that table, fitting non-voters in those columns didn't make sense, which is why we need new columns. But for this report here, there's no equivalent argument.

So, I think we should do both:

  • check the union of voters and non-voters against constraints
  • check voters against voter_constraints if voter_constraints is set

The report allows for different types of violations - so I'm thinking we would now introduce a 2nd one for the voter_constraints.

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 347 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let's remove the filtering by replica type from StoreResolver. Let's also change it's interface to take a store id and return a store descriptor, rather than take a range descriptor and return the store descriptors in bulk.

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 470 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not entirely sure about how we should treat VOTER_DEMOTING_* (and VOTER_OUTGOING, if they still exist). I think it's not entirely black and white.
What you have now, I think, will trivially produce violations when a voter is moved from a node to another (within the same locality, say) - you're gonna have a VOTER_DEMOTING and a VOTER_INCOMING at the same time, and that will be in violation of constraints that precisely place voters in regions. I think that'd be bad.
I don't think ignoring VOTER_DEMOTING is entirely satisfying either, particularly in the case of negative constraints. As these become more important (perhaps for domiciling, whatever), simply lying about data not being where it's not supposed to be seems hard to defend.
So I think we should take a page out of both of the other reports, and treat the two quorums separately. Run the constraint checker against each quorum, individually, and declare the range to be in violation if either quorum is in violation.

I think that voter_incoming vs voter_outgoing is a fair concern, and we can address that by ignoring VOTER_INCOMING when we evaluate voter_constraints.
For data domiciling part, we are lying currently, but that would be a separate concern from the bug we are trying to fix. We can include both voter incoming and non voter replicas for now, but not sure where learner replica falls, because it already has data if we are worried about that.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 470 at r3 (raw file):

but not sure where learner replica falls

Good point.

What you have now, I think, will trivially produce violations when a voter is moved from a node to another (within the same locality, say) - you're gonna have a VOTER_DEMOTING and a VOTER_INCOMING at the same time, and that will be in violation of constraints that precisely place voters in regions. I think that'd be bad.

I think I was wrong about this, wasn't I? The positive constraints we have are not "exactly this many replicas should be in a locality"; they're "at least this many replicas". The negative constraints don't permit replica counts, I think; they only permit saying "no replicas should be in this locality". So, for positive constraints, having more replicas never hurts (it hurts for the overreplication report, but we have a story there), and for negative constraints, any replica is bad - including learners.

So here's my current thought about what we should do:

  1. For negative constraints, consider all replicas (including learners, etc), on the argument that negative constraints are set for domiciling (*).
  2. For voter constraints (including the constraints field in case num_voters is not set), I think we need to consider the two quorums separately (so VOTER + VOTER_INCOMING and VOTER + VOTER_DEMOTING/OUTGOING), and declare the range to be in violation if either one of the quorums is in violation. This is on the argument that voter constraints are used for survivability, and the range is not getting the intended survivability if either quorum is in violation (this is analogous to the way the critical localities report works, and how the unavailability report works).
  3. For the overall constraints (excluding the case where num_voters is not set), I think we need to look at VOTER + VOTER_INCOMING + NON_VOTER (i.e. exclude LEARNER and VOTER_DEMOTING/VOTER_OUTGOING). These constraints are set for having a replica ready to serve reads nearby, so the argument for excluding LEARNERs is that we don't route reads to learners (for good reason). The argument for excluding VOTER_DEMOTING is more debatable, but arguably, since these replicas are supposed to go away momentarily, we might as well stop considering them for the specific purpose of follower reads. Specifically, I think we should match the behavior of the DistSender, which doesn't route reads to either VOTER_DEMOTING_LEARNER or VOTER_DEMOTING_NON_VOTER. But, actually, I think that this policy is not great - I think we should route to VOTER_DEMOTING_NON_VOTER (but not to VOTER_DEMOTING_LEARNER), on the argument that the former will stick around. cc @nvanbenschoten , @aayushshah15 , @tbg - do you see any problems with routing to VOTER_DEMOTING_NON_VOTER?

Please verify that what I'm saying makes sense to you. And let's also bring @aayushshah15 and @lidorcarmel and @kvoli into this.

(*) @ajstorm @RichardJCai I know super-regions are a thing now, and their purpose is data domiciling, right? How exactly do these super-regions translate into zone config constraints? I assume negative constraints are not used, and instead we use positive constraints, which sum up to num_replicas, right? If that's the case, and we want to use the constraint conformance report for domiciling (which I think we should), should we consider that, in the case where there is a set of constraints that adds to num_replicas, there is an implicit negative constraint for every locality not included in that set?
Let's make this concrete with an example:
Let's say we have a zone with num_replicas=5 and a constraint that reads {+region=r1:2, +region=r2:2, +region=r3=1} (so each replica is precisely placed). Now, if a range has 6 replicas, with the 6th one being in region=r4, should we declare this to be a violation of the aforementioned constraint (in addition to being a violation of the replication factor, which is reported separately in the replication_stats report)? Similarly if the constraint had been {+super_region=sr1} and there's a replica outside of sr1.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch 2 times, most recently from af16b3f to cca6211 Compare August 1, 2022 21:17
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 470 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but not sure where learner replica falls

Good point.

What you have now, I think, will trivially produce violations when a voter is moved from a node to another (within the same locality, say) - you're gonna have a VOTER_DEMOTING and a VOTER_INCOMING at the same time, and that will be in violation of constraints that precisely place voters in regions. I think that'd be bad.

I think I was wrong about this, wasn't I? The positive constraints we have are not "exactly this many replicas should be in a locality"; they're "at least this many replicas". The negative constraints don't permit replica counts, I think; they only permit saying "no replicas should be in this locality". So, for positive constraints, having more replicas never hurts (it hurts for the overreplication report, but we have a story there), and for negative constraints, any replica is bad - including learners.

So here's my current thought about what we should do:

  1. For negative constraints, consider all replicas (including learners, etc), on the argument that negative constraints are set for domiciling (*).
  2. For voter constraints (including the constraints field in case num_voters is not set), I think we need to consider the two quorums separately (so VOTER + VOTER_INCOMING and VOTER + VOTER_DEMOTING/OUTGOING), and declare the range to be in violation if either one of the quorums is in violation. This is on the argument that voter constraints are used for survivability, and the range is not getting the intended survivability if either quorum is in violation (this is analogous to the way the critical localities report works, and how the unavailability report works).
  3. For the overall constraints (excluding the case where num_voters is not set), I think we need to look at VOTER + VOTER_INCOMING + NON_VOTER (i.e. exclude LEARNER and VOTER_DEMOTING/VOTER_OUTGOING). These constraints are set for having a replica ready to serve reads nearby, so the argument for excluding LEARNERs is that we don't route reads to learners (for good reason). The argument for excluding VOTER_DEMOTING is more debatable, but arguably, since these replicas are supposed to go away momentarily, we might as well stop considering them for the specific purpose of follower reads. Specifically, I think we should match the behavior of the DistSender, which doesn't route reads to either VOTER_DEMOTING_LEARNER or VOTER_DEMOTING_NON_VOTER. But, actually, I think that this policy is not great - I think we should route to VOTER_DEMOTING_NON_VOTER (but not to VOTER_DEMOTING_LEARNER), on the argument that the former will stick around. cc @nvanbenschoten , @aayushshah15 , @tbg - do you see any problems with routing to VOTER_DEMOTING_NON_VOTER?

Please verify that what I'm saying makes sense to you. And let's also bring @aayushshah15 and @lidorcarmel and @kvoli into this.

(*) @ajstorm @RichardJCai I know super-regions are a thing now, and their purpose is data domiciling, right? How exactly do these super-regions translate into zone config constraints? I assume negative constraints are not used, and instead we use positive constraints, which sum up to num_replicas, right? If that's the case, and we want to use the constraint conformance report for domiciling (which I think we should), should we consider that, in the case where there is a set of constraints that adds to num_replicas, there is an implicit negative constraint for every locality not included in that set?
Let's make this concrete with an example:
Let's say we have a zone with num_replicas=5 and a constraint that reads {+region=r1:2, +region=r2:2, +region=r3=1} (so each replica is precisely placed). Now, if a range has 6 replicas, with the 6th one being in region=r4, should we declare this to be a violation of the aforementioned constraint (in addition to being a violation of the replication factor, which is reported separately in the replication_stats report)? Similarly if the constraint had been {+super_region=sr1} and there's a replica outside of sr1.

So I gave it a go with having separate sets of replicas for inclusion and exclusion sets. And also comparing incoming and outgoing subsets to evaluate consensus.
It looks slightly mad as we can have a conjunction that includes both + and - constraints and they are checked against different sets of replicas.

I think names might need some improvements, but what do you think about general concept?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @aliher1911, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 470 at r3 (raw file):

Previously, aliher1911 (Oleg) wrote…

So I gave it a go with having separate sets of replicas for inclusion and exclusion sets. And also comparing incoming and outgoing subsets to evaluate consensus.
It looks slightly mad as we can have a conjunction that includes both + and - constraints and they are checked against different sets of replicas.

I think names might need some improvements, but what do you think about general concept?

I'm liking what I'm seeing :)


pkg/kv/kvserver/reports/constraint_stats_report.go line 350 at r7 (raw file):

	cType        ConstraintType
	conjunctions []zonepb.ConstraintsConjunction
	// If any of subsets of replicas generated by predicates are not sa...

...


pkg/kv/kvserver/reports/constraint_stats_report.go line 351 at r7 (raw file):

	conjunctions []zonepb.ConstraintsConjunction
	// If any of subsets of replicas generated by predicates are not sa...
	inclusionPredicates []replicaPredicate

I think the binding between these predicates and the constraint type would be better done at a lower level, in countRange(), instead of visitNewZone. The mapping between the two is static, I think, so I don't think you need the flexibility of arbitrary associations of constraints and predicates that this struct gives you.


pkg/kv/kvserver/reports/constraint_stats_report.go line 451 at r7 (raw file):

						cType:               Constraint,
						conjunctions:        zone.Constraints,
						inclusionPredicates: []replicaPredicate{voterNonVoterPredicate},

shouldn't this be {incomingConsensus, outgoingConsensus} ?


pkg/kv/kvserver/reports/constraint_stats_report.go line 460 at r7 (raw file):

						cType:               Constraint,
						conjunctions:        zone.Constraints,
						inclusionPredicates: []replicaPredicate{voterNonVoterPredicate},

I was suggesting that this should exclude VOTER_DEMOTING/OUTGOING, as per the DistSender follower read routing logic. If you do that, please add a comment to the respective policy in the DistSender, noting that it should be kept in sync with this code.


pkg/kv/kvserver/reports/constraint_stats_report.go line 495 at r7 (raw file):

	}
	switch r.GetType() {
	case roachpb.NON_VOTER, roachpb.VOTER_INCOMING:

VOTER_INCOMING is covered by voterPredicate


pkg/kv/kvserver/reports/constraint_stats_report.go line 532 at r7 (raw file):

) {
	storesFromPredicates := func(ps []replicaPredicate) [][]roachpb.StoreDescriptor {
		storeVariants := make([][]roachpb.StoreDescriptor, len(ps))

storeGroups perhaps?


pkg/kv/kvserver/reports/constraint_stats_report.go line 557 at r7 (raw file):

func getViolations(
	ctx context.Context,
	requiredStoreDescs [][]roachpb.StoreDescriptor,

I don't think the names requiredSD and prohibitedSD make sense. Maybe
requiredSD -> ``relevantReplicas prohibitedSD` -> `allReplicas`

Have you verified that we actually allow positive and negative constraints in the same conjunction? Cause if we don't, I'd suggest splitting this function into getViolationsPositive and getViolationsNegative so they can take a single [][]StoreDescriptor.
Even if we can mix negative with positive like that, I'd still consider lifting the for _, c := range conjunction.Constraints { into the caller, and restricting this function to a single Constraint.


pkg/kv/kvserver/reports/constraint_stats_report.go line 571 at r7 (raw file):

			}
			// We need to check all replica variants against constraint and ensure
			// none of them violates it. This case represents incoming and outgoing

It's not clear what "this case" is referring to. Say that there might be multiple groups because the two quorums usually need checking separately.


pkg/kv/kvserver/reports/critical_localities_report_test.go line 134 at r2 (raw file):

Previously, aliher1911 (Oleg) wrote…

It works now.
How would we get broken test without making build fail or do you mean keep test breaking while PR is being reviewed?

I meant you can keep a test exercising buggy code by having the test expect the buggy result. The test will pass until someone fixes the code.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from cca6211 to a8878fe Compare August 12, 2022 21:56
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @aliher1911, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 451 at r7 (raw file):

{incomingConsensus, outgoingConsensus}
I think we wanted to also include non voters here for replication constraints.


pkg/kv/kvserver/reports/constraint_stats_report.go line 557 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think the names requiredSD and prohibitedSD make sense. Maybe
requiredSD -> ``relevantReplicas prohibitedSD` -> `allReplicas`

Have you verified that we actually allow positive and negative constraints in the same conjunction? Cause if we don't, I'd suggest splitting this function into getViolationsPositive and getViolationsNegative so they can take a single [][]StoreDescriptor.
Even if we can mix negative with positive like that, I'd still consider lifting the for _, c := range conjunction.Constraints { into the caller, and restricting this function to a single Constraint.

I don't see any obvious restrictions on mixing types of constraints together in our docs. The restriction there says you can use positive constraint with explicit replica count and you can't use negative with it.
I tried constraints = '{"+node2,-node3"}' and it is accepted and in tests they are passed a single list down to the checks so they aren't treated as separate with their own expected replica counts.

With adding an extra loop I actually broke the semantics a bit because previously it was bailing once a single constraint in conjunction is violated. But then again if the violation is defined in terms of constraints, we don't we expose all, but only the first found one.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from a8878fe to d19a1eb Compare August 13, 2022 19:00
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 470 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm liking what I'm seeing :)

Ok, I rewrote that nested loop a bit and moved binding of predicates into policy. Based on what we have just replicas or num voters < replicas we will apply one of the constraints you listed above. So we really have 3 variants there. One for voter constraints and two for constraints depending on num voter configuration.


pkg/kv/kvserver/reports/constraint_stats_report.go line 350 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

...

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 351 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the binding between these predicates and the constraint type would be better done at a lower level, in countRange(), instead of visitNewZone. The mapping between the two is static, I think, so I don't think you need the flexibility of arbitrary associations of constraints and predicates that this struct gives you.

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 460 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I was suggesting that this should exclude VOTER_DEMOTING/OUTGOING, as per the DistSender follower read routing logic. If you do that, please add a comment to the respective policy in the DistSender, noting that it should be kept in sync with this code.

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 495 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

VOTER_INCOMING is covered by voterPredicate

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 532 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

storeGroups perhaps?

Ditched this part all together.


pkg/kv/kvserver/reports/constraint_stats_report.go line 571 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

It's not clear what "this case" is referring to. Say that there might be multiple groups because the two quorums usually need checking separately.

Done.


pkg/kv/kvserver/reports/critical_localities_report_test.go line 134 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I meant you can keep a test exercising buggy code by having the test expect the buggy result. The test will pass until someone fixes the code.

Done.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from d19a1eb to 2da376c Compare August 14, 2022 16:02
@aliher1911 aliher1911 requested a review from andreimatei August 14, 2022 16:15
@aliher1911 aliher1911 self-assigned this Aug 14, 2022
@aliher1911 aliher1911 dismissed andreimatei’s stale review August 17, 2022 09:03

addressed via reviewable

@aliher1911
Copy link
Contributor Author

@andreimatei ping

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @aliher1911, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 460 at r7 (raw file):

Previously, aliher1911 (Oleg) wrote…

Done.

I don't see any changes to the DistSender.


pkg/kv/kvserver/reports/constraint_stats_report.go line 474 at r9 (raw file):

}

// replicaConstraintsAllVoters are applied to replica constraints when number

these policies are used only in one place, right? I think it'd be better if they were inlined there, in visitSameZone(). They need to be read together with the constraint.


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):
Do you have a position the the last part of this comment :

I know super-regions are a thing now, and their purpose is data domiciling, right? How exactly do these super-regions translate into zone config constraints? I assume negative constraints are not used, and instead we use positive constraints, which sum up to num_replicas, right? If that's the case, and we want to use the constraint conformance report for domiciling (which I think we should), should we consider that, in the case where there is a set of constraints that adds to num_replicas, there is an implicit negative constraint for every locality not included in that set?
Let's make this concrete with an example:
Let's say we have a zone with num_replicas=5 and a constraint that reads {+region=r1:2, +region=r2:2, +region=r3=1} (so each replica is precisely placed). Now, if a range has 6 replicas, with the 6th one being in region=r4, should we declare this to be a violation of the aforementioned constraint (in addition to being a violation of the replication factor, which is reported separately in the replication_stats report)? Similarly if the constraint had been {+super_region=sr1} and there's a replica outside of sr1.

While reflecting again, I had come back to this idea. Conceptually, positive overall constraints can have two purposes, I think:

  1. Specify that a replica should be nearby for follower reads.
  2. Control the localities that data is in and the localities where data is not. This may be the intent (although not necessarily) when the contraints target all of num_replicas (otherwise, if there are unconstrained replicas, the user is clearly saying that domiciling is not important).

What we have implemented here addresses 1). I'm thinking we should do something about 2) also.


pkg/kv/kvserver/reports/constraint_stats_report.go line 492 at r9 (raw file):

	predicates: map[zonepb.Constraint_Type][]replicaPredicate{
		zonepb.Constraint_REQUIRED: {
			isIncomingAndNonVoter,

I think this is right, but please explain the rationale in a comment. Say that, when


pkg/kv/kvserver/reports/constraint_stats_report.go line 564 at r9 (raw file):

			v.prevConstraints)
	} else {
		v.countRange(ctx, r, v.prevZoneKey, Constraint, replicaConstraintsAllVoters, v.prevConstraints)

do you think this case should also be in effect when num_voters == num_replicas && voter_constraints == nil ?


pkg/kv/kvserver/reports/constraint_stats_report.go line 573 at r9 (raw file):

}

func isIncomingAndNonVoter(r roachpb.ReplicaDescriptor) bool {

OrNonVoter ?

and s/Incoming/InIncomingQuorum


pkg/kv/kvserver/reports/constraint_stats_report.go line 582 at r9 (raw file):

func isNonVoter(r roachpb.ReplicaDescriptor) bool {
	switch r.Type {

return r.Type == NON_VOTER


pkg/kv/kvserver/reports/constraint_stats_report.go line 594 at r9 (raw file):

}

func isIncoming(r roachpb.ReplicaDescriptor) bool {

I think inIncomingQuorum/inOutgoingQuorum would be better names

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from 2da376c to c094bb2 Compare August 20, 2022 14:51
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @aliher1911, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 460 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't see any changes to the DistSender.

The logic is not in DistSender itself. It delegates filtering to ReplicaSlice which in turn delegates to roachpb.CheckCanReceiveLease. Putting comment in DistSender is far away from the actual filter. Not sure how helpful that would be.


pkg/kv/kvserver/reports/constraint_stats_report.go line 474 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

these policies are used only in one place, right? I think it'd be better if they were inlined there, in visitSameZone(). They need to be read together with the constraint.

They are static though. Does it make sense to construct the map and fill it with data on every invocation rather than keep them pre made upfront. I'm strongly against doing "if a fill this filter here, if !b add another one there approach" as I have to run program in my head to understand what's going on and I'm bad at running programs in my head, I delegate that to silicon.


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):

in and the localities where data is not
Wait! We are talking about positive constraints here. Why would they control where data is not? I think that should be solely negative constraint area. I think positive constraints could be I want some replica to be here, but the rest of them could be just anywhere. For any counters we always check that we have at least n replicas within constraints, if we have more that's fine. So any replica residing in location that we didn't list is valid as we should have 0 or more there.


pkg/kv/kvserver/reports/constraint_stats_report.go line 564 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

do you think this case should also be in effect when num_voters == num_replicas && voter_constraints == nil ?

This case is handled implicitly here. We discard num_voters by setting them to 0 if they are equal to num_replicas and we can call countRange with nil voter constraints because it'll just iterate nil slice as noop. At least that was the intent!


pkg/kv/kvserver/reports/constraint_stats_report.go line 582 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

return r.Type == NON_VOTER

Done.

Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajstorm, @aliher1911, @andreimatei, @kvoli, @lidorcarmel, @nvanbenschoten, @RichardJCai, and @tbg)


pkg/kv/kvserver/reports/constraint_stats_report.go line 492 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this is right, but please explain the rationale in a comment. Say that, when

Done.


pkg/kv/kvserver/reports/constraint_stats_report.go line 573 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

OrNonVoter ?

and s/Incoming/InIncomingQuorum

Done.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please leave an empty line in replies between the text you're quoting with > quoted text and your response. Otherwise it all renders as quoted.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 460 at r7 (raw file):

Previously, aliher1911 (Oleg) wrote…

The logic is not in DistSender itself. It delegates filtering to ReplicaSlice which in turn delegates to roachpb.CheckCanReceiveLease. Putting comment in DistSender is far away from the actual filter. Not sure how helpful that would be.

ok, then perhaps CheckCanReceiveLease is a better place for a linking comment. Or, in fact, can the report use CheckCanReceiveLease?


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):

Wait! We are talking about positive constraints here. Why would they control where data is not?

If the positive constraints constrain all the replica, then it stands to reason that no replicas are elsewhere. In practice, I think that's how we'll do domiciling - I doubt we'd add a new negative constraint to everything when a new region is added to the cluster.
How are super-regions implemented?


pkg/kv/kvserver/reports/constraint_stats_report.go line 564 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

This case is handled implicitly here. We discard num_voters by setting them to 0 if they are equal to num_replicas and we can call countRange with nil voter constraints because it'll just iterate nil slice as noop. At least that was the intent!

ack


pkg/kv/kvserver/reports/constraint_stats_report.go line 524 at r10 (raw file):
I'd move this to where it's used... You've said:

I'm strongly against doing "if a fill this filter here, if !b add another one there approach" as I have to run program in my head to understand what's going on and I'm bad at running programs in my head, I delegate that to silicon.

which I think meant to explain why you don't want to do that, but I don't understand it. Instead of having a comment explaining when this is used later, if you inline this, then I can see when it's used :)


pkg/kv/kvserver/reports/constraint_stats_report.go line 559 at r10 (raw file):

		predicates: map[zonepb.Constraint_Type][]replicaPredicate{
			zonepb.Constraint_REQUIRED: {
				isInIncomingQuorumOrNonVoter, isInOutgoingQuorumOrNonVoter,

is the OrNonVoter part right, given that this is about VoterConstraints (particularly in the case when v.prevNumVoters != 0) ?

Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 460 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ok, then perhaps CheckCanReceiveLease is a better place for a linking comment. Or, in fact, can the report use CheckCanReceiveLease?

I put notes in both places for descendants.


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Wait! We are talking about positive constraints here. Why would they control where data is not?

If the positive constraints constrain all the replica, then it stands to reason that no replicas are elsewhere. In practice, I think that's how we'll do domiciling - I doubt we'd add a new negative constraint to everything when a new region is added to the cluster.
How are super-regions implemented?

Yeah that sounds reasonable this way. But this will change semantics of reports slightly.
We need to check if all replicas have at least one match. That should be doable. I'll need to change the code slightly so that all the rule matching could be tested without the need for zone visitor, tables and what not.


pkg/kv/kvserver/reports/constraint_stats_report.go line 524 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd move this to where it's used... You've said:

I'm strongly against doing "if a fill this filter here, if !b add another one there approach" as I have to run program in my head to understand what's going on and I'm bad at running programs in my head, I delegate that to silicon.

which I think meant to explain why you don't want to do that, but I don't understand it. Instead of having a comment explaining when this is used later, if you inline this, then I can see when it's used :)

I inlined that and maps are created and populated for every range now. :D


pkg/kv/kvserver/reports/constraint_stats_report.go line 559 at r10 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is the OrNonVoter part right, given that this is about VoterConstraints (particularly in the case when v.prevNumVoters != 0) ?

Oops, I think it's a copy-pasta error. Didn't have non-voters before refactoring.

@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from c094bb2 to cb09aca Compare August 31, 2022 09:06
Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to easily test any changes to report that we are discussing. It is not "feature complete" as it just checks there are no violations. But I'll update it soon to include report check.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)

Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

Yeah that sounds reasonable this way. But this will change semantics of reports slightly.
We need to check if all replicas have at least one match. That should be doable. I'll need to change the code slightly so that all the rule matching could be tested without the need for zone visitor, tables and what not.

So you were basically referring to this part of allocator:
https://github.com/cockroachdb/cockroach/blob/e44fbae27196ffaca386f4687fcc40de0ed0da39/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1778-L1777
It looks like it is doing what you are saying, allowing unconstraint replicas if we don't have enough "specific" ones to cover numReplicas.

Copy link
Contributor Author

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):

Previously, aliher1911 (Oleg) wrote…

So you were basically referring to this part of allocator:
https://github.com/cockroachdb/cockroach/blob/e44fbae27196ffaca386f4687fcc40de0ed0da39/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go#L1778-L1777
It looks like it is doing what you are saying, allowing unconstraint replicas if we don't have enough "specific" ones to cover numReplicas.

I tried to create a test to see how it should behave and I don't think I fully understand. If we have constraint as a list of tags, then all replicas must comply - we don't have a problem with non listed localities. If we specify exact number of replicas per constraint, then there could be a ambiguity with what to do with "unaccounted" replicas, but I don't think there's a way to specify what to do with "remainder" in existing system.
@andreimatei can you give an example if you have something in mind?

Previously replication_constraint_stats were using zone constraints regardless
of the fact that num_voters and a separate voter constraints were provided.
This was causing report to have violations because range had less voters than
replicas, but still were trying to check if we have as many voters as
full replica constraints.
This was causing ambiguity in reports as there were no violation and no way
to figure out which constraint is being violated.
This patch changes report to use voter constraints if they are provided or
use voters and non voters if voter count is configured lower than replica
count, but voter constraints are not provided.

Release note (bug fix): table system.replication_constraint_stats is not
showing erroneous voter contraint violations when num_voters is configured.

Release justification: bugfix to an issue which also needs backporting.
@aliher1911 aliher1911 force-pushed the constraints_stats_report branch from cb09aca to c8c29f7 Compare September 1, 2022 18:49
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911 and @andreimatei)


pkg/kv/kvserver/reports/constraint_stats_report.go line 491 at r9 (raw file):

If we specify exact number of replicas per constraint, then there could be a ambiguity with what to do with "unaccounted" replicas, but I don't think there's a way to specify what to do with "remainder" in existing system.

Can you explain the ambiguous part some more?
I'm thinking that, when there are supposed to be no unconstrained replicas, every replica that doesn't meet any of the sub-constraints is reported as a violation. I think there is subtlety in deciding when there are unconstrained replicas, though, and I think that might be a per-tag/locality tier concept, not a per-constraint one; see below.

Let's take the example above: num_replicas=5 and a constraint that reads {+region=r1:2, +region=r2:2, +region=r3=1}.
Let's consider a few cases described as maps from locality to the number of replicas in that locality, annotated with what I think should be reported for them:
r1:2,r2:2,r3:1,r4:1: the replica in r4 violates the constraint. The range is also over-replicated, which is reflected in replication_stats.
r1:2,r2:2,r4:1: the replica in r4 violates the constraint, and another violation is that there are not enough replicas in r3. I forgot how we report multiple violations of the same constraint.
r1:2,r2:1,r4:1: the replica in r4 violates the constraint, and another violation is that there are not enough replicas in r3, and another violation is that there are not enough replicas in r2. The range is also under-replicated.

I think a funky case is when the sub-constraints are not mutually-exclusive. E.g. {+ssd:1,+intel:1}. Now, if you have a replica on a store that has ssd,intel, and another replica on a store with no tags, I guess there's no violation here. But, the code you're pointing to I think would consider that second replica invalid, if I read it correctly. I think this is one of many examples about how our constraint system was not thought through.
Perhaps we should restrict our reporting to tags and localities for which the sub-constraints are clearly mutually-exclusive. So, that would be tags and localities in constraints that don't specify the number of replicas, and tags and localities that appear in every sub-constraint. In other words, I think for every tag and locality tier we should decide whether there are replicas that are unconstrained with respect to it or not.
So, for example, with the constraint {+region=r1,+zone=A,+ssd,+intel:1, +region=r1,+zone=A,+ssd:1, +region=r2,+ssd:1}, we would report violations for replicas that are not in regions r1 or r2, and for replicas that are not on ssd, but not for replicas that are not on intel or on zone=A. The reason being that all replicas are constrained with respect to their region and with respect to ssd, but not with respect to other tags or locality tiers.

Btw, seeing how things are complicated, if you want not report the violations we're discussing here, that's now fine by me. Although I'd prefer we continue working through it.

@aliher1911
Copy link
Contributor Author

I think we should merge this as a bugfix and create an issue where we could discuss this further.
Otherwise we'll end up with scope creep as I think some of the problems we are discussing are coming from our very flexible system of constraints.

@aliher1911
Copy link
Contributor Author

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Sep 2, 2022

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Sep 2, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from c8c29f7 to blathers/backport-release-21.2-84727: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

system.replication_constraint_stats incorrectly reports violations when voter_constraints are defined
3 participants