-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul name resolution in-memory cache #3121
Conversation
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First quick pass - I am not very familiar with blocking queries, could you please help me understand how this PR works?
Also, how does the in-memory cache get purged? And is there a way to limit its size?
Co-authored-by: Alessandro (Ale) Segala <[email protected]> Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Thanks for picking this up :) The way this works is, whenever a request comes in to resolve an app-id the registry is checked for an entry, if one doesn't exist then it will kick off a watcher routine for the service and fall back to the vanilla route (i.e. request the service from the consul agent). The watcher routine uses http long polling (consul "blocking queries") against the consul agent to track a set of health nodes for the given service/app-id - whenever there is a change in state then the agent will respond with the new set of healthy nodes which is then updated against the registry entry and the routine will then continue on to the next long poll. Any given key (i.e. app-id) in the registry is perpetual so long as its watcher routine is running - if a given routine fails it will remove the key from the registry. Finally, the current implementation has no way of limiting the cache size |
@a-elsheikh thanks for explaining! This seems like a very interesting approach. I do have some concerns however:
|
@ItalyPaleAle - to answer both questions, as far as I'm aware there is no way to invoke the consul health api to target multiple services/app-ids, only one per request. With that said, the number of goroutines used by this pattern will be equivalent to the number of unique services/app-ids the service/sidecar is resolving, in that context I don't see how goroutines would be expensive unless an app is resolving a very large number of distinct app-ids. I do though appreciate the lack of an upper bound (or even a time-based expiry) on the key set especially considering that it will scale the number of connections, such is the cost of long polling. Of course the impact of all this is dependent on the deployment type i.e. edge/k8s. What do you think if we added a configurable upper bound combined with an LRU policy on the keyset so that by default the sidecar will only track n services? FYI @berndverst |
Sadly, I don't think we can make assumptions on whether an app talks to a small number of other apps only, as that's a dangerous assumption IMHO. Also, apps can scale horizontally, possibly to dozens or hundreds of instances. In this case, I am more concerned about the usage of TCP sockets, and potential for TCP port exhaustion on the consul container. :( I think a LRU policy, or even just a timeout, may help. A few other options:
|
@a-elsheikh unfortunately there are too many details to be figured out still, and getting this PR merged in 1.12 (technically already well beyond code freeze) was a stretch. With that said, we are committed to helping you get this PR merged in time for Dapr 1.13. |
@ItalyPaleAle you're right, a growing number of clients against the consul agent would be problematic. I've struggled to find any documentation around the gRPC streaming (this) - I think it's only intended for consul->consul comms. This would actual solve both problems as we could use a single goroutine to subscribe to all the changes and update the cache. I'll keep an eye out but let me know if you're aware of any docs/spec for a gRPC api on the consul agents. The next best thing is, I've found a query on the HTTP health api that will allow us to track state changes for multiple services with a single call, the dataset is limited so this can only act as a notification and we'll need another request to resolve the target nodes for any changed services, this sacrifices some latency but should still be a much better outcome. |
That sounds like an interesting idea! If the refreshing is done in background, the latency impact should not make a huge difference anyways. |
…vices and updating cache Signed-off-by: Abdulaziz Elsheikh <[email protected]>
…call, added more comments Signed-off-by: Abdulaziz Elsheikh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes, it looks good :)
I have left some comments. Could you also please try running the tests for this package with the -race
flag and confirm there's no race conditions?
if _, ok := serviceKeys[service]; !ok { | ||
r.registry.addOrUpdate(service, nil) | ||
} | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this approach is that you are racing with the producer. If the producer adds to the channel faster than you consume them, you will end the loop sooner.
Consider maybe making registrationChannel
accept a slice so you can send all keys to add in bulk, in a single message over the channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, there's a mistake in this comment - I've rephrased both possibilities of the produce/consume statement and addressed below.
If the producer adds to the channel faster than you consume them, the loop will not end.
This is only possible if the user is constantly resolving new IDs, in that bizarre case caching would be irrelevant.
If the consumer reads from the channel faster than the producer writes, you will end the loop sooner.
This is obviously the expected behaviour.
The registration channel is only written to if a new service/app-id that is not in the registry is being resolved - there is really no batching concept here so it doesn't quite make sense to make it a slice channel. To elaborate:
ResolveID (sidecar invoke)
- Resolver checks registry for key
- Key does not exist and is written to channel
- Resolver invokes consul agent directly
Watcher routine (perpetual)
- Watcher picks up key from the channel and cancels long poll
- Watcher adds the key to the registry
- Watcher checks for any other new keys on channel
- Watcher invokes long poll with new set of keys
nameresolution/consul/watcher.go
Outdated
|
||
func (r *resolver) runWatchPlan(p *watchPlan, services []string, ctx context.Context, watchTask chan bool) { | ||
defer func() { | ||
recover() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a need for a recover()
? We normally try to avoid panicking entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recover()
here was to cater for any transport panics that may occur from within the consul libs, it's clear to me now that this is a false expectation as panics are not equivalent to exceptions in behaviour and there are no explicit panics referenced within the consul client api :) will remove
Co-authored-by: Alessandro (Ale) Segala <[email protected]> Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Co-authored-by: Alessandro (Ale) Segala <[email protected]> Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Co-authored-by: Alessandro (Ale) Segala <[email protected]> Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Co-authored-by: Alessandro (Ale) Segala <[email protected]> Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
…mponents-contrib into nr_consul_cache_dco
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
@ItalyPaleAle thanks for raising this one, I didn't know this feature existed, it's awesome. It found races in the test mocks where invocation counts are being checked and waited on, in any case I've solved those now so it's a clean run every time 👍 |
@a-elsheikh thank you for your patience while we're reviewing this! With the 1.12 release constantly delayed, and Bernd's OOF for a while, it's taking me longer to review these PRs :( I have a few cosmetic changes. I cannot push to your branch, so can you please merge this? curl -L "https://github.com/dapr/components-contrib/commit/346bcb17a2b8934ef48558503d179261ba65a98e.patch" | git am After that... At this stage this PR looks really good, and it's almost ready to be merged :) Only thing missing is the ability for the component to perform a clean shutdown. For most components, we implement
You can see an example in the mDNS resolver! |
Signed-off-by: ItalyPaleAle <[email protected]> Signed-off-by: Abdulaziz Elsheikh <[email protected]>
@ItalyPaleAle thanks for the update, response below 👍
I've applied the patch and it looks good but I'm not sure I agree with replacing the "watcher started" mutex+bool for an atomic bool - the following code is on a hot path, it makes sense to only synchronize if we think the watcher is not running otherwise we should be happy reading the bool and continuing. I've also benchmarked this for good measure and on parallel runs using the atomic is an order of magnitude slower. Of course this only applies if a sidecar is making parallel invocations and we are talking ns deltas so perhaps
Perfect - I'll use this as an opportunity to add an optional deregister on close, it is much needed! |
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
@@ -54,7 +54,8 @@ As of writing the configuration spec is fixed to v1.3.0 of the consul api | |||
| Tags | `[]string` | Configures any tags to include if/when registering services | | |||
| Meta | `map[string]string` | Configures any additional metadata to include if/when registering services | | |||
| DaprPortMetaKey | `string` | The key used for getting the Dapr sidecar port from consul service metadata during service resolution, it will also be used to set the Dapr sidecar port in metadata during registration. If blank it will default to `DAPR_PORT` | | |||
| SelfRegister | `bool` | Controls if Dapr will register the service to consul. The name resolution interface does not cater for an "on shutdown" pattern so please consider this if using Dapr to register services to consul as it will not deregister services. | | |||
| SelfRegister | `bool` | Controls if Dapr will register the service to consul on startup. If unset it will default to `false` | | |||
| SelfDeregister | `bool` | Controls if Dapr will deregister the service from consul on shutdown. If unset it will default to `false` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when we tried to do that (to solve #2490 ) this was a bad idea. It did un-register all instances of an app and not just the one that was going offline.
I would prefer if this wasn't even offered as an option, as it can cause confusion to users.
(And with #2490 merged in 1.12, this isn't needed anyways)
The previous code had a race condition. If 2 requests came in at the same time, and the watcher wasn't running, The general rule is that if there are 2 goroutines accessing the same value, and at least one of them is doing a write, then you must either use a lock or some atomic value. You can read more here: https://go.dev/doc/articles/race_detector
I left a comment on the code. A few months ago I did some investigation and found that deregister did deregister the entire service. So if your app is scaled horizontally, it un-registers the other instances too. Because of that, I think this option is "dangerous" as users may not be aware of its behavior if the app is scaled horizontally. In Dapr 1.12, #2490 was merged so we should be good with unregistering! |
@ItalyPaleAle Can you share any details of this investigation - the behaviour you're describing is both contrary to what I'm observing and to the documentation. The only way I can see that happening is if you were using the older version of dapr which (by default) did not generate unique IDs for the registrations and instead used the AppID thus giving every instance of a service on each node the same ID, this is no longer the case thanks to #1802 Even with that said though, I still don't see how that behaviour could have happened using agent/service/deregister |
@a-elsheikh I wrote about that here: #2490 (comment) I can't remember exactly the details, it's been a while :( |
This was the PR i created as a test: #2594 in there you can still see the commit where I added the "Deregister" and then removed because of those issues. |
Signed-off-by: Abdulaziz Elsheikh <[email protected]>
@ItalyPaleAle I reviewed the comments but it's still not clear how this was tested - our test environment is also using horizontally scaled apps and the behaviour looks fine 🤔 Can you try re-test with this implementation please 🙏 |
I don't remember, to be fully honest, it's been a long time ago! Ok, let's merge this PR. We can do more tests later and if we do see something's off in the behavior of unregistering, we can disable that later on. Thanks for your contributions :) One more ask: can you please update our docs to document the new options for Consul? See dapr/docs |
@ItalyPaleAle you're most welcome - I'll get the docs updated soon! Thanks for the help 🙏 |
Description
Added service cache to consul nr component
Clone of ancient PR 963 in order to work around DCO check
Issue reference
#934
Checklist
Please make sure you've completed the relevant tasks for this PR, out of the following list: