Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: multiple failure cases in non-trivial receiver topologies #4359

Closed
bill3tt opened this issue Jun 18, 2021 · 1 comment · Fixed by #4368
Closed

receive: multiple failure cases in non-trivial receiver topologies #4359

bill3tt opened this issue Jun 18, 2021 · 1 comment · Fixed by #4368

Comments

@bill3tt
Copy link
Contributor

bill3tt commented Jun 18, 2021

Thanos, Prometheus and Golang version used:

  • Thanos: HEAD
  • Prometheus: v2.26.0
  • Golang: go version go1.16.4 linux/amd64

Object Storage Provider: N/A

What happened:

While working on #4326 - we discovered 4 failure cases when deploying non-trivial post-split proposal receive topologies. This results in the Thanos deployment not behaving as expected.

Issues

1: Request Replica > Ingestor Replication Factor

Following the post-split proposal, we are able to configure recveive instances to operate in router, ingester or routing and ingesting modes.

arch1

In the above example, if router is configured with a replication factor of 3, we would expect the data to be replicated to all ingestor instances.

Instead, we observe that router fails to replicate the data with quorum not reached errors like the following:

Routing logs (formatted for clarity)

09:44:41 receive-d1: level=error name=receive-d1 ts=2021-06-18T08:44:41.943443895Z caller=handler.go:352 component=receive component=receive-handler 
      err="3 errors: 
          replicate write request for endpoint receive_distributor_ingestor_mode-receive-i1:9091: quorum not reached: 
          2 errors: forwarding request to endpoint receive_distributor_ingestor_mode-receive-i2:9091: rpc error: code = InvalidArgument 
          desc = replica count exceeds replication factor; forwarding request to endpoint receive_distributor_ingestor_mode-receive-i1:9091: rpc error: code = AlreadyExists 
          desc = store locally for endpoint receive_distributor_ingestor_mode-receive-i1:9091: conflict; 
           
          replicate write request for endpoint receive_distributor_ingestor_mode-receive-i2:9091: quorum not reached: 
          2 errors: forwarding request to endpoint receive_distributor_ingestor_mode-receive-i3:9091: rpc error: code = InvalidArgument 
          desc = replica count exceeds replication factor; forwarding request to endpoint receive_distributor_ingestor_mode-receive-i2:9091: rpc error: code = AlreadyExists 
          desc = store locally for endpoint receive_distributor_ingestor_mode-receive-i2:9091: conflict; 
           
          replicate write request for endpoint receive_distributor_ingestor_mode-receive-i3:9091: quorum not reached: 
          2 errors: forwarding request to endpoint receive_distributor_ingestor_mode-receive-i1:9091: rpc error: code = InvalidArgument 
          desc = replica count exceeds replication factor; forwarding request to endpoint receive_distributor_ingestor_mode-receive-i3:9091: rpc error: code = AlreadyExists 
          desc = store locally for endpoint receive_distributor_ingestor_mode-receive-i3:9091: conflict" msg="internal server error"

This happens because when router constructs downstream requests, the replica number n is supplied in the request.

When single ingestor instances are spun-up, their default receive.replication-factor is set to 1. When they receive the request from router, they return errBadReplica responses.

The result is that the first downstream request succeeds, but all subsequent ones fail.

2: SingleNodeHashring GetN where n > 0

If a single ingestor instance receives a request, and the replica number n is > 0, the request will fail with insufficientNodesError.

The ingestor correctly believes there is only one node in its hashring (itself) and rejects all requests that indicate they are for larger hashrings.

3: SimpleHashring GetN where n > len(hashring)

Similarly to the above, if a router or router & ingestor are configured with a hashring with > 1 instance, and it receives a request with n is greater than the number of instances, it will return an insufficientNodesError.

4: Replicated requests can't be replicated again.

Let's consider the following architecture, which is valid under the split proposal.

arch2

If router1 and router2 both have receive.replication-factor set to 2 we would expect the data to be replicated three times across all ingestor instances.

Instead, what we find is that the data is only replicated twice.

This happens because when router2 receives the request from router1 the storepb.WriteRequest replica field is > 0, which indicates that this request has already been replicated (which is true).

Since this request has already been replicated, router2 will forward this request to a maximum of one downstream hashring instance (src).

This mechanism prevents infinite loops in fully-connected Router & Ingestor topologies, but prevents us from forming functional trees of depth n.

Integration Tests

Three additional e2e integration tests that exercise the above failure cases have been added: #4362

@bill3tt bill3tt changed the title receive: 'quorum not reached' when --receive.replication-factor > 1 receive: multiple failure cases in non-trivial receiver topologies Jun 23, 2021
@bill3tt
Copy link
Contributor Author

bill3tt commented Jun 23, 2021

We think we have a fix for the above behaviour - but the ergonomics and explainability to end-users will be poor. The plan is to propose this PR and discuss at the next Thanos contributor hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants