-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validation: Add config validation mode. (#499) #863
Conversation
…nd add a factory method to it for the ClusterManager. This is preparatory to adding a ValidationClusterManagerFactory, which creates ValidationClusterManagers.
This adds a --mode command-line flag. "--mode=serve" is the default and can be omitted; in that mode Envoy runs as it always has. With "--mode=validate" Envoy checks that the configuration file in --config-file is valid, then exits. "Valid" means that it performs the JSON schema check as usual, but also initializes as much of its internal state as possible, exiting with any errors as it would under "--mode=serve". When initialization is complete, instead of listening for traffic, Envoy prints an "OK" message to stderr and exits, returning success. In validation mode, to the extent possible, Envoy avoids interfering with its environment. For example, the hot-restart process is skipped, so validation can safely be run on the same machine as a serving Envoy. Validation mode also doesn't send any upstream traffic or open any listeners. It *does* attempt to read any files referenced in the config, like certs and private keys, so validation will fail if those files aren't located at the expected paths. A future "lite validation" mode, that mocks out the filesystem as well as the network, is not yet implemented as of this patch.
@mattklein123: As we talked about a while ago, this ended up having a fair number of tentacles mocking out different parts of the class structure. Comments on how you'd organize it differently are more than welcome. Remember I'm still reasonably new to the language, so if anything's surprising in here, there's no subtle genius, just wrongness. :) |
Nit: why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a useful feature, thanks @rlazarus. Before going too far into the review, I have an architectural question, which I actually feel bad about bringing up at such a late stage, given this looks like it's basically complete.
I'm wondering whether it would have been better to set a global dry_run
flag and condition on the existing code paths for server initialization rather than create the bifurcated initialization in this PR? I feel it's going to be an ongoing maintenance issue to keep the two in sync. But, I haven't explored this other alternative, so please let me know what your thoughts are.
include/envoy/server/options.h
Outdated
@@ -11,6 +11,31 @@ | |||
namespace Server { | |||
|
|||
/** | |||
* Whether to run Envoy in serving mode, or in config-validation mode at one of two levels (in which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/config-validation/configuration validation/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Default mode: Regular Envoy serving process. Configs are validated in the normal course of | ||
* initialization, but if all is well we proceed to serve traffic. | ||
*/ | ||
Serve, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine in the enum here, but for the command-line, I'm in agreement with @PiotrSikora, it's probably cleaner to have a --dry-run
and --dry-run-lite
flags, I don't think anyone will ever set --mode=serve
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, my thinking is that we might end up with any number of distinct modes for running Envoy -- "full validation" and "lite validation" are two, but conceptually --hot-restart-version
is another: if we implemented that today, it might be a value for --mode
. These all have two important things in common: that they significantly modify the semantics of what it means to run the binary (you expect drastically different things to happen) and, more importantly to the flag question, that it doesn't make sense to specify more than one.
We can grow a lot of boolean flags and document them all with "set either zero or one of these flags because they're mutually exclusive" or we can make that implicit with an enum-valued flag, and I argue that's much clearer. I agree nobody will ever explicitly set --mode serve
, but I think it's still semantically a default state.
I'm 100% open to different names (e.g. I don't feel strongly about "validate" over "dry run"). I'm also willing to be outvoted on an enum vs. a fleet of booleans, but it was an intentional stance.
(cc @PiotrSikora because this is in reply to your comment too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does hot restart version do that isn't done by --mode=serve
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prints the hot restart version and exits immediately (https://github.com/lyft/envoy/blob/master/source/server/options_impl.cc#L63) without serving, validating, or doing anything else.
source/exe/main.cc
Outdated
@@ -33,10 +35,32 @@ class ProdComponentFactory : public ComponentFactory { | |||
|
|||
} // Server | |||
|
|||
int validateConfigAndExit(OptionsImpl& options, Server::ProdComponentFactory& component_factory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better placed in an anonymous namespace to make it local to the translation unit (similar to a static function, but the C++ way is to use namespaces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
source/exe/main.cc
Outdated
options.serviceClusterName(), options.serviceNodeName()); | ||
|
||
if (options.mode() != Server::Mode::Serve) { | ||
return validateConfigAndExit(options, component_factory, local_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to call it just validateConfig
, as the AndExit
implies to me that it's going to do a ::exit(something);
, which it doesn't, it returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -- the name was left over from an earlier version where that's what it did. Done.
source/exe/main.cc
Outdated
|
||
Server::ValidationInstance server(options, restarter, stats_store, restarter.accessLogLock(), | ||
component_factory, local_info); | ||
std::cerr << "configuration '" << options.configPath() << "' OK" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Since this isn't an error, I think stdout
might be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. (My original thinking was to have success and failure outputs go the same place, and this is now the only reference to stdout in non-test code -- but that's probably fine.)
source/exe/main.cc
Outdated
@@ -33,10 +35,32 @@ class ProdComponentFactory : public ComponentFactory { | |||
|
|||
} // Server | |||
|
|||
int validateConfigAndExit(OptionsImpl& options, Server::ProdComponentFactory& component_factory, | |||
const LocalInfo::LocalInfoImpl& local_info) { | |||
Server::ValidationHotRestart restarter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this restarter only for its logLock()
and accessLogLock()
? Could these be just independent log objects you create here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Server interface exposes it via hotRestart()
.
FWIW, @rlazarus and I discussed this a while back and decided to go with the "mock" solution. The general thinking was that plumbing a dry_run param through the code would make the primary server logic harder to read and more complicated and the flag would have to go into a lot of places since this testing is doing semantic testing, not just JSON validation. In some sense, it's doing mocking without using gmock. I agree with @htuch that this is ugly, but on the other hand I think we need to choose between 3 ugly options: 1) Do this. 2) plumb dry_run through a ton of places. 3) Actually use gmock and the test mocks in prod code. I still think #1 is the least ugly, but let's discuss tomorrow if we need to all get on the same page. |
One other random thought for discussion tomorrow. Assuming we keep this approach, I think it would be good to switch the ExampleConfig test over from gmock to use these "mocks." This will help us get better test coverage: Eventually also probably the router test tool that @hennna wrote. |
What's the strategy the maintenance ? I haven't looked into the code (so feel free to say "go look st the code :)" ). If I were to introduce a new feature, is there a structure in place that can be followed to keep adding more stuff to the validation logic ? How is RDS, CDS and SDS validated ? (I am going totally by the PR description. ) On a side note, it seems that all you want to accomplish here is to ensure that Envoy initializes without perturbing the environment. From that perspective, reads of any form are okay right (from files, dns, etc). It's the writes that complicate things (pushing stats, and making requests). Rather than validating from inside Envoy, why can't you validate externally observable behavior (e.g. expect connection to foo.bar.com with so and so headers) using an external sink/request gen ? (This is not same as using gmock in prod code). You could emulate the external environment without Envoy knowing about it (like the nginx perl tests?). Not sure if this covers the use cases you are looking for. |
@rlazarus @htuch was there any offline conversation about this? Is it ready for wider review? I also need to sync with @junr03 tomorrow on whether this meets our internal needs / what might be missing (somewhat related to questions from @rshriram). In general I think if we convert example config tests away from mocks and to this stuff I'm more comfortable (haven't looked at whether that was done yet or not). |
@mattklein123 I need to give this another pass with a broad "it's probably the cleanest way we have to handle this" notion in mind. Will try make this happen today. |
* Default mode: Regular Envoy serving process. Configs are validated in the normal course of | ||
* initialization, but if all is well we proceed to serve traffic. | ||
*/ | ||
Serve, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does hot restart version do that isn't done by --mode=serve
?
@@ -140,6 +145,16 @@ class ClusterManagerFactory { | |||
virtual ~ClusterManagerFactory() {} | |||
|
|||
/** | |||
* Allocate a cluster manager from configuration JSON. | |||
*/ | |||
virtual ClusterManagerPtr clusterManagerFromJson(const Json::Object& config, Stats::Store& stats, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the other methods from this interface now that we have this? E.g. can allocateConPool
just be part of ClusterManagerImpl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not easily -- that distinction is there for testability, see https://github.com/lyft/envoy/blob/master/test/common/upstream/cluster_manager_impl_test.cc#L38. (I'm actually wasn't originally 100% sure that the ClusterManagerFactory is morally speaking the right place for, well, a ClusterManager factory function -- but I successfully talked myself into it. I could be talked back out of it, if there's a better place.)
@@ -39,6 +39,7 @@ AsyncClient::Request* AsyncClientImpl::send(MessagePtr&& request, AsyncClient::C | |||
const Optional<std::chrono::milliseconds>& timeout) { | |||
AsyncRequestImpl* async_request = | |||
new AsyncRequestImpl(std::move(request), *this, callbacks, timeout); | |||
async_request->initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want consistency in AsyncStreamImpl
with the initialization pattern? I.e. it might be surprising to someone using both AsyncRequestImpl
and AsyncStreamImpl
that they have slightly different ways to kick off a request after construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AsyncStreamImpl doesn't currently do any real work in its ctor, though. I could add an initialize()
method but it would only be for consistency's sake.
source/exe/main.cc
Outdated
Server::ValidationHotRestart restarter; | ||
Logger::Registry::initialize(options.logLevel(), restarter.logLock()); | ||
Stats::HeapRawStatDataAllocator alloc; | ||
Stats::ThreadLocalStoreImpl stats_store(alloc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need TLS during validation? Or is it possible to complete all the initialization+validation on just the entry thread? If so, we could use a simpler stats_store
, e.g. IsolatedStoreImpl
.
source/exe/main.cc
Outdated
int validateConfig(OptionsImpl& options, Server::ProdComponentFactory& component_factory, | ||
const LocalInfo::LocalInfoImpl& local_info) { | ||
Server::ValidationHotRestart restarter; | ||
Logger::Registry::initialize(options.logLevel(), restarter.logLock()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The export of the log and access locks isn't part of the HotRestart
interface. I think you can both simplify here by just using locally defined lock objects in both cases and also removing locking from the Server::ValidationHotRestart
implementation, can you take a look to see if this is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- I had had it that way for consistency between the regular and validation HotRestarts, but you're right that this is simpler.
source/exe/main.cc
Outdated
Event::Libevent::Global::initialize(); | ||
OptionsImpl options(argc, argv, Server::SharedMemory::version(), spdlog::level::warn); | ||
Server::ProdComponentFactory component_factory; | ||
LocalInfo::LocalInfoImpl local_info(Network::Utility::getLocalAddress(), options.serviceZone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question; do we want to rely on local address or should this be something you can pass in? I'm imagining using Envoy in a config validation pipeline that isn't run directly on the target server.
source/exe/main.cc
Outdated
options.serviceClusterName(), options.serviceNodeName()); | ||
|
||
if (options.mode() != Server::Mode::Serve) { | ||
return validateConfig(options, component_factory, local_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a if (option.mode() == Server::Mode::Validate) { NOT_IMPLEMENTED; }
? That would avoid someone asking for light mode and getting back full validation behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Assuming you mean == Server::Mode::ValidateLight
.)
In practice I just never set that enum value -- in options_impl.cc if you pass anything other than --mode serve
or validate
you get the "I've never heard of that mode" error. But a belt-and-suspenders check here seems reasonable; done.
Event::Dispatcher& dispatcher() override { return handler_.dispatcher(); } | ||
Network::DnsResolver& dnsResolver() override { return dns_resolver_; } | ||
bool draining() override { | ||
throw EnvoyException("ValidationInstance::draining() not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super common pattern with some boilerplate. What do you reckon about a #define
(I know, the shock, the horror) in the spirit of MOCK_METHOD
, which does all the exception generation and name mangling etc? I think this will make it more readable/maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, or I could just replace all the exceptions with NOT_IMPLEMENTED;
, which should do the right thing now that our stack traces are friendly -- and would also save me from falling into that exception trap I fell into. I think I've just talked myself into doing it that way, any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way works, I have a slight preference for whatever is less verbose.
Configuration::InitialImpl initial_config(*config_json); | ||
|
||
for (uint32_t i = 0; i < std::max(1U, options.concurrency()); i++) { | ||
workers_.emplace_back(new Worker(thread_local_, options.fileFlushIntervalMsec())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wonder if we can get away with no threading during validation above, but as I look through this method I see there's a fair bit of stuff going on related to this. @mattklein123 Do you think it's possible in Envoy to perform all config validation without bringing up any additional threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a quick chat with @mattklein123, I think we can do this more simply ala https://github.com/lyft/envoy/blob/master/test/config_test/config_test.cc#L25. We shouldn't need to start any workers, bring up thread-local state, use threaded variants of things like the stats_store_
etc. You really just need to provide the mocks that are in config_test
with your classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as part of this change please just make config test use your mocks per other comments. Will provide much better test coverage (though might require the "light" version also which doesn't read files).
* If we finish initialization, and reach the point where an ordinary Envoy run would begin serving | ||
* requests, the validation is considered successful. | ||
*/ | ||
class ValidationInstance : Logger::Loggable<Logger::Id::main>, public Instance { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to inherit from the implementation, or can it just be a bare minimum implementation that exports out methods similar to what is used in https://github.com/lyft/envoy/blob/master/test/config_test/config_test.cc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't inherit from the implementation -- Server::Instance
is the interface, as distinct from Server::InstanceImpl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the C++ style is great, I wouldn't have picked this for a PR from someone new to the language :)
@rlazarus can we move this PR forward? Would like to try to get it merged or close for now. |
@mattklein123 Yes -- I've already been working with @htuch today to get it out the door ASAP. |
These don't do anything except test that the mocked out methods are mocked out. They're purely to keep coverage from dropping.
@htuch Assuming Travis smiles upon this, it's ready for another look. Nits more than welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a few small comments.
CdsApiPtr ValidationClusterManagerFactory::createCds(const Json::Object& config, | ||
ClusterManager& cm) { | ||
// Create the CdsApiImpl... | ||
CdsApiPtr cds = ProdClusterManagerFactory::createCds(config, cm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you even need to assign to cds
here since it's unused later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, good point. Done.
ValidationInstance server(options, stats_store, access_log_lock, component_factory, local_info); | ||
std::cout << "configuration '" << options.configPath() << "' OK" << std::endl; | ||
server.shutdown(); | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer using EXIT_SUCCESS
/EXIT_FAILURE
here for clarity, or just a plain bool
, making validateConfig
not tied to the exit status and perform the mapping to exit code in main.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
|
||
/** | ||
* validateConfig() takes over from main() for a config-validation run of Envoy. It returns an exit | ||
* code for the process: 0 if the config is valid, 1 if invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment on the return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Upstream::ProdClusterManagerFactory cluster_manager_factory( | ||
server.runtime(), server.stats(), server.threadLocal(), server.random(), server.dnsResolver(), | ||
server.sslContextManager(), server.dispatcher(), server.localInfo()); | ||
MainImpl config(server, cluster_manager_factory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not support multiple constructors (including the old one), so can avoid the boilerplate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow -- by multiple constructors do you mean offer MainImpl(Server::Instance&)
as well as the new MainImpl(Server::Instance&, ClusterManagerFactory&)
? That might work but I'm not sure about the lifetimes -- the MainImpl
only holds a reference to the ClusterManagerFactory
(and needs it in initialize()
, not in its ctor) so I don't immediately see a clean way but maybe I'm missing it.
If the concern is just that these tests are noisy now, I agree -- if I refactored them to use a test fixture, so that we only had to say that once and the individual tests were more compact, would that suit you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, whatever you think is best here. I'm just the enemy of boiler plate.
@@ -0,0 +1,58 @@ | |||
#include "envoy/json/json_object.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurred to me that another way you could get coverage is to write sufficient example configs to exercise all paths, but I think what you have here is fine if it doesn't regress coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- once we've replaced the example config tests with this, adding more example configs sounds like a good way of upping the coverage. But we might as well get there first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rlazarus Optionally in this PR you could also update the docs in https://github.com/lyft/envoy/tree/master/docs to talk about this tool. I say optionally as I think that would be a fine in a followup PR, up to you.
@rlazarus Can you improve the coverage here so bazel.coverage passes again? |
Frustrating: Merging master seems to have taken us from just above the code coverage setpoint to just below. I'll see if I can effectively cover the NOT_IMPLEMENTED lines with a death test, because there's not much else to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to ship as soon as coverage is fixed.
Right -- that's what @htuch and I are looking at, I pushed that commit so he could patch it in. |
If it helps I found reference to manually calling __gcov_flush although at the time I was investigating the signal handler code wasn't in place yet. It might be possible to easily integrate __gcov_flush call now that the custom signal handlers are installed. |
…y NOT_IMPLEMENTED lines that coverage doesn't count
This reverts commit 18fb3c2. The tests were only added for the coverage metric, and they don't increase the coverage metric.
@rlazarus Looks like some minor bit rot to fix. |
This adds a
--mode
command-line flag.--mode=serve
is the default and can be omitted; in that mode Envoy runs as it always has.With
--mode=validate
Envoy checks that the configuration file in --config-file is valid, then exits. "Valid" means that it performs the JSON schema check as usual, but also initializes as much of its internal state as possible, exiting with any errors as it would under--mode=serve
. When initialization is complete, instead of listening for traffic, Envoy prints an "OK" message to stderr and exits, returning success.In validation mode, to the extent possible, Envoy avoids interfering with its environment. For example, the hot-restart process is skipped, so validation can safely be run on the same machine as a serving Envoy. Validation mode also doesn't send any upstream traffic or open any listeners.
It does attempt to read any files referenced in the config, like certs and private keys, so validation will fail if those files aren't located at the expected paths. A future "lite validation" mode, that mocks out the filesystem as well as the network, is not yet implemented as of this patch.