Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add readIndex check in readyz #16792

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

siyuanfoundation
Copy link
Contributor

@siyuanfoundation siyuanfoundation marked this pull request as draft October 17, 2023 23:02
}
}

func (s *EtcdServer) GetReadIndex(ctx context.Context) (uint64, error) {
Copy link
Member

@chaochn47 chaochn47 Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use LinearizableReadNotify to achieve the same semantic of readIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more clear even though there is some duplicate code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siyuanfoundation Is there any counter argument of this above comment? Could you please clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetReadIndex and LinearizableReadNotify sends messages to different channels, and waits on different notifiers. The former needs a index returned from the notifier, while the latter only needs the err.

Copy link
Member

@chaochn47 chaochn47 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, misread that the new readIndexLoop won't wait for the applied index catch up.

Right after read index is received in linearizableReadLoop, readIndexNotifier would notify ReadIndex that listens on the channel. I think that simplifies the implementation, wdyt?

In this way, we can reduce the new background go routine readIndexLoop, avoid readMu lock twice and don't need to add index into the notifier struct.

@serathius
Copy link
Member

Please merge the e2e tests before this.

@chaochn47
Copy link
Member

chaochn47 commented Oct 19, 2023

@siyuanfoundation DCO check failed. Understood the corrupt alarm e2e test was authored by @serathius but you have to sign off this commit to pass the check.

git rebase HEAD~2 --signoff

@chaochn47
Copy link
Member

Please merge the e2e tests before this.

@serathius Did you mean separating the e2e test to another PR for validating existing livez / readyz checks behaviors?

ReadIndex check changes and related test case could stay in the current PR, correct?

/cc @siyuanfoundation

@chaochn47
Copy link
Member

Please merge the e2e tests before this.

@serathius Did you mean separating the e2e test to another PR for validating existing livez / readyz checks behaviors?

ReadIndex check changes and related test case could stay in the current PR, correct?

/cc @siyuanfoundation

@siyuanfoundation Existing livez readyz e2e test cases are added. Could you please rebase on top of main and resolve the comments?

@siyuanfoundation siyuanfoundation marked this pull request as ready for review October 30, 2023 23:10
{
name: "Alive ok",
healthCheckURL: "/livez",
readIndexError: fmt.Errorf("failed to get read index"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is here is to test livez is ok even when server cannot get read index.

}
}

func (s *EtcdServer) GetReadIndex(ctx context.Context) (uint64, error) {
Copy link
Member

@chaochn47 chaochn47 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, misread that the new readIndexLoop won't wait for the applied index catch up.

Right after read index is received in linearizableReadLoop, readIndexNotifier would notify ReadIndex that listens on the channel. I think that simplifies the implementation, wdyt?

In this way, we can reduce the new background go routine readIndexLoop, avoid readMu lock twice and don't need to add index into the notifier struct.

@siyuanfoundation
Copy link
Contributor Author

@chaochn47
The problem with sharing the same loop is a slow apply loop would block readIndex request when there is a pending read, which makes it not much better than just using linearizableRead. So readIndex and linearizableRead has to be in separate loops.

}

nextnr := newNotifier()
s.readMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a separate lock dedicated for read index so adding this loop has no impact on the linearizableReadLoop?

nc := s.readIndexNotifier
s.readMu.RUnlock()

// signal linearizable loop for current notify if it hasn't been already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// signal linearizable loop for current notify if it hasn't been already
// signal readIndexLoop for current notify if it hasn't been already

Comment on lines +88 to +89
// pass some values in the notifier
uint64Val uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// pass some values in the notifier
uint64Val uint64
readIndex uint64

It's an unexported pkg local struct, I think explicitly naming it is better than generic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notifier was meant to be generic. Not sure if we should add fields here.

@chaochn47
Copy link
Member

@chaochn47 The problem with sharing the same loop is a slow apply loop would block readIndex request when there is a pending read, which makes it not much better than just using linearizableRead.

Makes sense.

So readIndex and linearizableRead has to be in separate loops.

It may not be necessary if requestCurrentIndex can be used for read index check while the current server side 7s timeout is too long for the check. This function is likely to be refactored if we go down this route.

Comment on lines +246 to +247
`[+]serializable_read ok`,
`[+]data_corruption ok`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to validate other checks, just read_index.

@@ -799,8 +794,9 @@ func (s *EtcdServer) linearizableReadLoop() {
nr := s.readNotifier
s.readNotifier = nextnr
s.readMu.Unlock()

confirmedIndex, err := s.requestCurrentIndex(leaderChangedNotifier, requestId)
ctx, cancel := context.WithTimeout(context.Background(), s.Cfg.ReqTimeout())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add timeout here?


func readIndexCheck(srv ServerHealth) func(ctx context.Context) error {
return func(ctx context.Context) error {
_, err := srv.GetReadIndex(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing a critical loop like linearizableReadLoop seems like a bad idea for backport. What's wrong with directly just calling requestCurrentIndex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to have a public function.

func (s *EtcdServer) RequestCurrentIndex() (uint64, error) {
  requestId := s.reqIDGen.Next()
  leaderChangedNotifier := s.leaderChanged.Receive()
  retun s.requestCurrentIndex(leaderChangedNotifier, requestId)
}

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would problematic to use requestCurrentIndex, because each health check would start a new instance of requestCurrentIndex. In that function, the index responses are read from the single readStateC channel, and if there are multiple requestCurrentIndex instances running, some calls may never get to see the response.

Even though the requestId is unique, but requestCurrentIndex ignores the response if id does not match. So if the response falls to a different requestCurrentIndex , the right owner would not see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. requestCurrentIndex was designed to be only called from single routine like linearizableReadLoop. Nice catch!

Let me think about this more. Need to think on how to minimize the change to linearizableReadLoop for a safer backport.
cc @ahrtr for their ideas

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link

@siyuanfoundation: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify e998d9a link true /test pull-etcd-verify
pull-etcd-unit-test-amd64 e998d9a link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 e998d9a link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants