-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: add endpointset flow #4421
Conversation
@hitanshu-mehta just an FYI your first link is broken - it should be https://thanos.io/tip/proposals-accepted/202101-endpoint-discovery.md/ |
Changed it. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today has been a day of really big PRs 😅
I'm afraid that I'm not currently familiar enough with storeset.go
to be able to review this PR will any level of detail or confidence.
If there is a lot of code in common, and we plan on deprecating --store
and --rule
eventually, I'm wondering whether we can make storeset a special case of endpointset or vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have skimmed it. It looks good. I need to spare more time to review the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hitanshu-mehta This looks good. But I failed to find where do we actually use the new EndpointSet
? Will there be a follow-up PR? If so we should mention this. Even better let's utilize it in this PR.
Earlier I opened #4282 PR, where a new endpoint flow was baked into Yes, there will be follow-up PR or I will rebase and update my old PR. My plan is to follow this deprecation plan. Hence, I am doing changes according to that plan. Please let me know if I can improve something :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love it, LGTM, some suggestions only.
BUT! I would remove storeset and start using new endpointset (with old flags purely, don't change that yet) in this PR.
Reasons:
- We can verify if endpointset we work on on now works as designed
- We don't commit code we don't use
- Reduce migration duration
WDYT? @hitanshu-mehta ? Up for challenge? I am happy to work with you and do reviews everyday so this can land for Thanos v0.22.0 which we plan to cut on Monday.
pkg/query/endpointset.go
Outdated
} | ||
|
||
// StrictStatic returns true if the endpoint has been statically defined and it is under a strict mode. | ||
func (es *grpcEndpointSpec) StrictStatic() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (es *grpcEndpointSpec) StrictStatic() bool { | |
func (es *grpcEndpointSpec) IsStrictStatic() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we are not creating something for tests? I think I did that for storeset. Otherwise we can remove yea
Yes. We only have one implementation which is also used in tests.
|
||
// endpointSetNodeCollector is a metric collector reporting the number of available storeAPIs for Querier. | ||
// A Collector is required as we want atomic updates for all 'thanos_store_nodes_grpc_connections' series. | ||
// TODO(hitanshu-mehta) Currently,only collecting metrics of storeAPI. Make this struct generic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO(hitanshu-mehta) Currently,only collecting metrics of storeAPI. Make this struct generic. | |
// TODO(hitanshu-mehta) Currently, only collecting metrics of storeAPI. Make this struct generic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we cannot make this struct genereric? What about using storeSet node collector then instead of creating new code that has to be refactored in new PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR was already very big, so I thought it would be better if I do this in a separate PR.
Is it fine if I made this struct generic in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, as long as there are not duplicated storeset vs endpointset, all good
pkg/query/endpointset.go
Outdated
} | ||
|
||
if er.HasMetricMetadataAPI() { | ||
level.Info(e.logger).Log("msg", "adding new MetricMetadataAPI to query endpointset", "address", addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmhmh, we are not adding anything here ;p just logging I think it might confuse people if some component has all 5 APIs implemented. Users will think we added 5 components but we added one. Maybe more code, but I combine all into one message: added component with A, B ,C APIs
.
WDYT? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good :)
@bwplotka Agree with the reasons. I will do the required changes :)
Up for the challenge!! Thank you the help and lets merge it by Monday 🚀 |
8e23c4a
to
c2afd79
Compare
Signed-off-by: Hitanshu Mehta <[email protected]>
Signed-off-by: Hitanshu Mehta <[email protected]>
Signed-off-by: Hitanshu Mehta <[email protected]>
Signed-off-by: Hitanshu Mehta <[email protected]>
Signed-off-by: Hitanshu Mehta <[email protected]>
c2afd79
to
171231a
Compare
return nil, err | ||
} | ||
|
||
srv := grpc.NewServer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opt.semgrep.go.grpc.security.grpc-server-insecure-connection.grpc-server-insecure-connection: Found an insecure gRPC connection. This allows for a connection without encryption to this server. A malicious attacker could tamper with the gRPC message, which could compromise the machine.
(at-me in a reply with help
or ignore
)
Signed-off-by: Hitanshu Mehta <[email protected]>
0c01cdb
to
b7fc894
Compare
Signed-off-by: Hitanshu Mehta <[email protected]>
b7fc894
to
291bcdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice work. I have small comments only, otherwise LGTM. Let's go before release 💪🏽
func (es *grpcEndpointSpec) Metadata(ctx context.Context, client *endpointClients) (*endpointMetadata, error) { | ||
resp, err := client.info.Info(ctx, &infopb.InfoRequest{}, grpc.WaitForReady(true)) | ||
if err != nil { | ||
// Call Info method of StoreAPI, this way querier will be able to discovery old components not exposing InfoAPI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
pkg/query/endpointset.go
Outdated
metadata, err := es.getMetadataUsingStoreAPI(ctx, client.store) | ||
if err != nil { | ||
return nil, errors.Wrapf(err, "fetching info from %s", es.addr) | ||
} | ||
return metadata, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata, err := es.getMetadataUsingStoreAPI(ctx, client.store) | |
if err != nil { | |
return nil, errors.Wrapf(err, "fetching info from %s", es.addr) | |
} | |
return metadata, nil | |
metadata, merr := es.getMetadataUsingStoreAPI(ctx, client.store) | |
if merr != nil { | |
return nil, errors.Wrapf(merr, "fallback fetching info from %s after err: %v", es.addr, err) | |
} | |
return metadata, nil |
|
||
// endpointSetNodeCollector is a metric collector reporting the number of available storeAPIs for Querier. | ||
// A Collector is required as we want atomic updates for all 'thanos_store_nodes_grpc_connections' series. | ||
// TODO(hitanshu-mehta) Currently,only collecting metrics of storeAPI. Make this struct generic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you, as long as there are not duplicated storeset vs endpointset, all good
pkg/query/endpointset.go
Outdated
type EndpointSet struct { | ||
logger log.Logger | ||
|
||
// Endpoint specifications can change dynamically. If some store is missing from the list, we assuming it is no longer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Endpoint specifications can change dynamically. If some store is missing from the list, we assuming it is no longer | |
// Endpoint specifications can change dynamically. If some component is missing from the list, we assume it is no longer |
pkg/query/endpointset.go
Outdated
logger log.Logger | ||
|
||
// Endpoint specifications can change dynamically. If some store is missing from the list, we assuming it is no longer | ||
// accessible and we close gRPC client for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// accessible and we close gRPC client for it. | |
// accessible and we close gRPC client for it, unless it is strict. |
pkg/query/endpointset.go
Outdated
gRPCInfoCallTimeout: 5 * time.Second, | ||
endpoints: make(map[string]*endpointRef), | ||
endpointStatuses: make(map[string]*EndpointStatus), | ||
unhealthyEndpointTimeout: unhealthyStoreTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhealthyEndpointTimeout: unhealthyStoreTimeout, | |
unhealthyEndpointTimeout: unhealthyEndpointTimeout, |
pkg/query/endpointset.go
Outdated
reg *prometheus.Registry, | ||
endpointSpecs func() []EndpointSpec, | ||
dialOpts []grpc.DialOption, | ||
unhealthyStoreTimeout time.Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhealthyStoreTimeout time.Duration, | |
unhealthyEndpointTimeout time.Duration, |
pkg/query/endpointset.go
Outdated
} | ||
} | ||
|
||
// TODO(bwplotka): Consider moving storeRef out of this package and renaming it, as it also supports rules API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
clients.store = storepb.NewStoreClient(er.cc) | ||
er.StoreClient = clients.store | ||
} else { | ||
er.clients.store = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because here when we see the endpoint for the first time we assume the StoreAPI is exposed by that endpoint (which may not be true for some ruler) and we create a store API client because as a fallback we might have to call info method of storeAPI.
In this step, I am setting it to null if we find out that the store API is not exposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, some comment would be nice here then
} | ||
return endpointSpec | ||
}, | ||
expectedStores: 4, // sidecar + querier + receiver + storeGW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on those 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
Signed-off-by: Hitanshu Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
clients.store = storepb.NewStoreClient(er.cc) | ||
er.StoreClient = clients.store | ||
} else { | ||
er.clients.store = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, some comment would be nice here then
Signed-off-by: Hitanshu Mehta <[email protected]>
Head branch was pushed to by a user without write access
@bwplotka Yes, there was a bug. I have fixed it. Now, e2e tests have passed but units and documentation checks have failed. I tried, but was not able to find the reason :( Can you please help? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this PR, I get a bunch of "Duplicate store address is provided" spam:
Rgp 31 13:08:57 XXX thanos_endpoint[166897]: level=info ts=2021-08-31T13:08:57.843619303Z caller=endpointset.go:360 component=endpointset msg="adding new sidecar with [storeAPI rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI]" address=1.2.3.4:10901 extLset="{dc=\"aaaa\", www=\"1\"}"
...
Rgp 31 13:09:02 XXX thanos_endpoint[166897]: level=warn ts=2021-08-31T13:09:02.833746591Z caller=query.go:637 msg="Duplicate store address is provided" addr=1.2.3.4:10901
I have both --store and --rule pointing to the same IP addresses (Sidecar nodes). Removing --rule
fixes this problem. Presumably, this happens because the same endpoint spec is added for all flags. Maybe we can simply improve the logging so that a message would be printed only if duplicate nodes have been specified with the same flags? The deduplication should still happen at the end, though, but without any logging.
Perhaps we could fix this before merging @hitanshu-mehta ?
Signed-off-by: Hitanshu Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work 👍
Signed-off-by: Hitanshu Mehta <[email protected]>
a99d374
to
2f8ae5e
Compare
🎉 |
🎉 hi, could it support file sd? |
* Create endpoint flow Signed-off-by: Hitanshu Mehta <[email protected]> * add unit test for endpointSet Signed-off-by: Hitanshu Mehta <[email protected]> * lint fixes Signed-off-by: Hitanshu Mehta <[email protected]> * fix typo Signed-off-by: Hitanshu Mehta <[email protected]> * remove code smells Signed-off-by: Hitanshu Mehta <[email protected]> * start using endpointset instead of storeset Signed-off-by: Hitanshu Mehta <[email protected]> * remove storeset Signed-off-by: Hitanshu Mehta <[email protected]> * minor nits Signed-off-by: Hitanshu Mehta <[email protected]> * Fix failing e2e tests Signed-off-by: Hitanshu Mehta <[email protected]> * improve logging Signed-off-by: Hitanshu Mehta <[email protected]> * fix comment Signed-off-by: Hitanshu Mehta <[email protected]>
Changes
--endpoint
flag in Querier #4282 and also added--endpoint
flag. But new endpoint flow is baked into the existingstoreset
flow. This can make the migration process difficult as discussed here. So I have added new endpoint flow in a separate file.storeset.go
but I have tried to make it more generic.Verification