Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent remote configuration #76

Closed
3 of 7 tasks
jalvz opened this issue Apr 18, 2019 · 94 comments
Closed
3 of 7 tasks

Agent remote configuration #76

jalvz opened this issue Apr 18, 2019 · 94 comments

Comments

@jalvz
Copy link
Contributor

jalvz commented Apr 18, 2019

Overview

Following #4 , agents need to be able to poll apm-server for configuration changes received upstream from Kibana, apply them, and log the result with status (success | failure), failure cause (if any), timestamp, setting name and value.

We will start providing support for TRANSACTION_SAMPLE_RATE.

Requirements

At minimum, agents need to agree on:

  • name and default value for config polling interval setting (RUM can have a different default).
  • exact message pattern and log level.
  • top 2-5 settings that should follow sampling rate, in order of priority.

APM Server API

  • Server will expose a /config/v1/agents endpoint, for agents to GET with a service.name URL query parameter (required), and service.environment (optional)

  • Agents might send a request with a If-None-Match header, to which Server will respond with a 304 - not modified response; or with 200, a response body with the configuration, and an Etag header.

  • Example
    curl -v -H "If-None-Match:1" "http://localhost:8200/config/v1/agents?service.environment=prod&service.name=opbeans"

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8200 (#0)
> GET /config?service.environment=prod&service.name=opbeans HTTP/1.1
> Host: localhost:8200
> User-Agent: curl/7.61.0
> Accept: */*
> If-None-Match:1
> 
< HTTP/1.1 200 OK
< Cache-Control: max-age=0
< Content-Type: application/json
< Etag: 2
< Date: Thu, 11 Apr 2019 14:08:32 GMT
< Content-Length: 27
< 
{
  "transaction_sample_rate": 0.7
}

Update 23/04

Configuration settings are taken from the environment variable names, without the ELASTIC_APM_ prefix, and lower case.

Update 06/05

As per comment #76 (comment), the Server will accept query parameters both in the URL and in the body of a POST request.
If different values for the same attribute are provided as POST and GET, request will be 400-rejected; different attributes will be joint.

Other notes that slipped in the initial description:

  • As pointed out in Agent remote configuration #76 (comment), agents should also align the error handling behaviour. For instance, if a config update can't be applied, should fallback to the last good value, to the agents default value, or to the value that the process started with?

  • Regarding service.environment, if none is passed in the query, only config updates without service environment will match. Likewise, if one is passed, only config updates with that value will match (and not config updates without value). In other words, a missing service environment is treated like any other (with a value of "" if you want to see it that way).

Update 29/05

  • The Server will send all attributes as strings.

Update 01/07

  • The Server caches the agent configuration for 30s by default (changeable via apm-server config) and sets the expiration time via Cache-Control: max-age header in every successful response. For failing agent requests the header will be set with a max-age: 300 (5 mins) since querying again after 30s doesn't make sense. Decision was made to set to 5 mins instead of not setting or setting to 0 so agents don't need to put their own logic and can differentiate between server not supporting remote config and failures. More details on this in [ACM] Optimization / caching apm-server#2220.

Updates for 7.3

Status

@elastic/apm-agent-devs please link your implementation issues

Let me know if you have any questions.

@felixbarny
Copy link
Member

Are there plans to create configurations for a more fine-grained subset of agents, based on more than service_name and service_environment possible even via global labels? If so, it might make sense to send the metadata along with the request.

@SergeyKleyman
Copy link
Contributor

A minor comment: the name for sampling rate option is "TRANSACTION_SAMPLE_RATE" (https://docs.google.com/spreadsheets/d/1JJjZotapacA3FkHc2sv_0wiChILi3uKnkwLTjtBmxwU/) so I would suggest to use "transaction_sample_rate" in response JSON for consistency.

@SergeyKleyman
Copy link
Contributor

Should agents be aligned on other aspects of error handling and not just error reporting? For example, if some of the options fail to pass validation should the agent reject the whole configuration or should the agent use the valid options and find some replacements for the invalid ones (use default values, etc.)

@jalvz
Copy link
Contributor Author

jalvz commented Apr 23, 2019

name for sampling rate option is "TRANSACTION_SAMPLE_RATE"

thanks @SergeyKleyman , updated with a note.

Should agents be aligned on other aspects of error handling

Yes, good point. I did highlight error reporting because it might have an impact in Kibana (eg. someone could create dashboards based on it), so I think it is important for the feature as a whole.
You should discuss error handling and align to the the extent of what is reasonable, but it is essentially up to you.

Are there plans to create configurations for a more fine-grained subset of agents

It was discussed briefly. We can ado that, yes, but it doesn't have the highest priority atm. Any ideas here ofc are welcome.

@felixbarny
Copy link
Member

It was discussed briefly. We can ado that, yes, but it doesn't have the highest priority atm. Any ideas here ofc are welcome.

Sure, we don't have to start with that right away. My point is that even if we want to add support for it later, we should already start sending the metadata instead of query parameters now.

@axw
Copy link
Member

axw commented Apr 25, 2019

Sending metadata instead of query parameters does seem like it would make things more extensible. A newer server could start taking advantage of metadata agents already send, without any upgrade of agents. Is this difficult for anyone to implement?

@hmdhk
Copy link
Contributor

hmdhk commented Apr 25, 2019

At least for the RUM agent being able to get the configuration without sending a body is preferable. Furthermore I think the extensibility provided by sending metadata is not very useful when the available configuration options is limited to a small number as is the case for the RUM agent.

@jalvz
Copy link
Contributor Author

jalvz commented Apr 26, 2019

We can accept both, if that makes happy everyone.

@axw
Copy link
Member

axw commented Apr 29, 2019

@jalvz that sounds fine to me, if it's not too onerous on the server.

Could we just say that you can specify any metadata as a query parameter, using the same JSON path? e.g. where a backend agent might use

GET /config
{
  "service": {
    "name": "foo",
    "environment": "bar"
  }
}

the RUM agent would use

GET /config?service.name=foo&service.environment=bar

i.e. with dots instead of _, so we can directly map the query params to a nested object in the metadata. Then, for completeness, we would state that query parameters would be overlaid on the request body (if any).

@jalvz
Copy link
Contributor Author

jalvz commented May 2, 2019

@elastic/apm-agent-devs can I get you to comment on:

top 2-5 settings that should follow sampling rate

Some interesting ones are LOG_LEVEL, METRICS_INTERVAL, SPAN_FRAMES_MIN_DURATION, TRANSACTION_MAX_SPANS, SERVER_TIMEOUT, INSTRUMENT... what would you like to see next?

@felixbarny
Copy link
Member

I'd like to see that users can enter any configuration options via text input.

@jalvz
Copy link
Contributor Author

jalvz commented May 2, 2019

what do you win with that?
(agents have to implement dynamic loading on their side, so we still need that list regardless)

@felixbarny
Copy link
Member

That lets users define settings which are agent specific or were added recently so that the config UI does not know about them yet.

(agents have to implement dynamic loading on their side, so we still need that list regardless)

Not sure what we are actually talking about here. Do you mean 2-5 settings the UI should support next or the settings that the agents should support when reading the /config endpoint? I think the agents should generically recognize all the options they support from the contents from the /config endpoint. Therefore, we don't need to specify a comprehensive list of the options at the API level.

Besides that, support for the ACTIVE flag would be important as well so that agents can be disabled at runtime.

@jalvz
Copy link
Contributor Author

jalvz commented May 2, 2019

That lets users define settings which are agent specific

UI can support settings specific for each agent any case, no need blank text input just for that.

or were added recently so that the config UI does not know about them yet

I'll argue that with existing release cycles this is not too bad (plus: with a settings-aware UI we give users some incentive to upgrade!)

Not sure what we are actually talking about here

How the user will know what settings are reloadable by each agent? That is what I meant above, unless you all implement dynamic reloading for all settings for the next version, users need to know what works and what not.

ACTIVE flag would be important

This is the one named INSTRUMENT, no?

Other counter arguments:

  • Giving feedback to users is not trivial (and left out in first phase), so it is important that UI can do validation upfront, which is much harder that way (eg: validate numeric ranges for specific settings).

  • What to do if user enters a typo? let it go? Validate against a mapping of known settings? do we need a DYM for that?

  • Generally a text input is more prone to mistakes, requires more documentation detailing the expected behaviour and is harder to use as users have no guidance whatsoever

@felixbarny
Copy link
Member

How the user will know what settings are reloadable by each agent?

I think the agent configuration options documentation should state for each config whether it's reloadable. The Java agent already does that: https://www.elastic.co/guide/en/apm/agent/java/current/configuration.html (dynamic true/false). So each setting with dynamic=true should work out of the box.

I do get the point that validation is easier when restricting the options. What do you think about validating the values for known options but still allowing to set options the UI does not yet know about via a free-text input? Users already have the possibility to set options as free text in the configuration files and environment variables so the problem of misspelled options and invalid values is not entirely new. The impact may be more severe and harder to debug with central CM, however

Even if we choose to restrict the options via the UI, I wouldn't restrict it on the API level.

Question: should we define a common format which the agents provide and the UI uses in order to suggest certain settings, along with type and validation metadata?

ACTIVE flag would be important

This is the one named INSTRUMENT, no?

ACTIVE and INSTRUMENT are two different things. I suspect that most agents can't make INSTRUMENT reloadable, in contrast to ACTIVE.

Another question: Should agents query the /config endpoint on startup so that it's possible to configure non-reloadable options? Probably not, right? At least not for the first iteration. Because we would have to block the startup and we need the metadata to call /config but the metadata also contains values from the configuration (like service_name).

@jalvz
Copy link
Contributor Author

jalvz commented May 2, 2019

I wouldn't restrict it on the API level

No, we're not doing that

should we define a common format which the agents provide

I think everything that helps UI to do validation is a great idea 👍

Should agents query the /config endpoint on startup

Agreed, better not for now

@jalvz jalvz added this to the 7.3 milestone May 2, 2019
@SergeyKleyman
Copy link
Contributor

I'd like to see that users can enter any configuration options via text input.

This will be very useful for supportability.
I have a small suggestion - maybe we can hide this section of UI under "Advanced" sub-view or something like that?

@SergeyKleyman
Copy link
Contributor

If we decide to go with configuration options via text input another point we need to consider is what is the expected outcome when the same option is provided twice
(with different values) - once via structured UI and once via text input?

@jalvz
Copy link
Contributor Author

jalvz commented May 6, 2019

Alright, lets keep those arguments for the next iteration.

I added an update to the initial description, please everyone go trough it and let me know if you have anything against.

@felixbarny
Copy link
Member

@SergeyKleyman that got me thinking: What does the APM Server return when multiple configurations match the current agent? Is it guaranteed that only one can ever match? If so, how? If not, how are configurations merged?

Also, how should the configurations from APM Server overrule the other configuration sources? The current hierarchy for the Java agent is (from highest to lowest precedence):

  • Process arguments
  • Environment variables
  • Configuration file

As per comment #76 (comment), the Server will accept query parameters both in the URL and in the body of a POST request.

The contents of the post would be of type application/json and not application/x-www-form-urlencoded, right?

@watson
Copy link
Contributor

watson commented May 7, 2019

Regarding what we can configure in the future, the only config options that the agents currently all agree on are (with "agree on" I mean where we agree both on the config option name and how its value should be interpreted):

  • API_REQUEST_SIZE
  • API_REQUEST_TIME
  • CAPTURE_BODY
  • CAPTURE_HEADERS
  • SECRET_TOKEN (not a candidate for remote configuration unless we want the /config endpoint to be accessible without a secret token)
  • SERVICE_NAME (obviously not a candidate for remote configuration)
  • SERVICE_VERSION (obviously not a candidate for remote configuration)
  • TRANSACTION_MAX_SPANS
  • TRANSACTION_SAMPLE_RATE
  • VERIFY_SERVER_CERT (not really a candidate for remote configuration I think)

Unless the UI allows a user to send some config only to Node.js agent while some other config only to Java agents, we have to currently limit our selves to these config options.

There are a few other options that are not implemented by all agents, but where the subset of agents who have implemented them do all agree. Those might be candidates as well.

@watson
Copy link
Contributor

watson commented May 7, 2019

Regarding the discussion about being able to remotely set the ACTIVE flag: This will be a one-time thing.

Once ACTIVE is set to false, the agent will be disabled and therefore not poll /config again and hence never see if we set it back to true. So I think this belongs in the same category as SECRET_TOKEN and VERIFY_SERVER_CERT, where you can kind of only set it once, but then never again and hence isn't really that use-full in a remote config scenario.

@SergeyKleyman
Copy link
Contributor

Should we have a separate issue to align on meaning of ACTIVE? Judging by https://discuss.elastic.co/t/unable-to-disable-apm-via-environment-variable/180558 some users expect to have a configuration setting to completely disable the agent - even communication to APM Server should be disabled.

@felixbarny
Copy link
Member

The question is whether we need to align on this in the first iteration (is it marked as beta?) and what it means in terms of documentation and consistency. If feasible, we could have the spec somewhat flexible and define that agents SHOULD disable themselves but MAY fall back to the other configuration sources if the APM Server does not respond with 200.

@axw
Copy link
Member

axw commented Jul 10, 2019

I think we should be consistent with this in the first release, and do the simplest thing: start the agent as we do today, with locally-defined config, and update when a positive response is received.

If we later go with disabling until first positive response, I suggest we align again to avoid confusing users of multiple agents.

@eyalkoren
Copy link
Contributor

Trying to put it all together (blocking/non-blocking, active/non-active), can we agree on this:
Current implementation:

  • Block startup
  • initialize all local configurations on startup (environment, config file etc.)
  • Stop blocking the app
  • Do healthcheck
    • if healthcheck succeeds - try to get remote config
      • if succeeds - apply configurations

Future implementation:

  • Block startup
  • initialize all local configurations on startup (environment, config file etc.)
  • When done- set active/recording to false
  • Stop blocking the app
  • Do healthcheck
    • If healthcheck fails - keep inactive
    • if healthcheck succeeds - try to get remote config
      • if succeeds - apply configurations and switch to active (in the future active will be one of the configs coming from this source)
      • if fails - switch to active

@felixbarny
Copy link
Member

felixbarny commented Jul 10, 2019

Most agents don't do healthchecks. A failing healthcheck should also not lead to the agent not polling for the configuration as the APM Server might be temporarily down when the agent starts.

The Java agent only performs a health check on startup so it can write to the logs whether the APM Server is available (also logs version number) on startup with is useful for debugging purposes.

What would be relatively easy to do in the Java agent is to initialize the ApmServerConfigurationSource with active=false. As soon as the configuration is read from remote, that setting would be overridden. If the Server responds with a 404, the settings for this configuration source would be overridden with an empty map. I'm not sure how feasible that is for other agents as they might not work with this configuration source concept where each config source has a map of config keys to values. As ApmServerConfigurationSource has the highest precedence, initially setting active=false there would temporarily disable the agent.

The caveat of this is

The only counter-argument would be that the agents only retry after 5 min in case the APM Server is not available. That would yield in a sub-optimal getting-started experience for someone who starts the APM Server shortly after starting the agent-monitored application.

@SergeyKleyman
Copy link
Contributor

@felixbarny As I mentioned above you can mitigate "only retry after 5 min" by polling much more frequently during the first X minutes after agent startup.

@eyalkoren
Copy link
Contributor

Most agents don't do healthchecks.

Then where applies

A failing healthcheck should also not lead to the agent not polling for the configuration as the APM Server might be temporarily down when the agent starts.

Right, it shouldn't. The current implementation in the Java agent doesn't fit well anymore with polling. If the purpose of healthcheck is to know whether an APM server is available, it makes no sense to fail and start polling some other endpoint. What we can do is remove the healthcheck and rely only on the remote config for that, or switch to remote config polling only after successful healthcheck

@SergeyKleyman
Copy link
Contributor

@eyalkoren

it makes no sense to fail and start polling some other endpoint.

Why? I assume by "other endpoint" you refer to more than one URL in "server-URLs" configuration, right? If so why doesn't it make sense for agent to try and switch to the next URL and try to communicate with it instead (get configuration, send data ,etc.)?

@eyalkoren
Copy link
Contributor

@SergeyKleyman no, I meant if you fail to get a proper response for the healthcheck (localhost:9200/), it doesn't make sense to start polling a different path. It is not about rotating between different servers.

@eyalkoren
Copy link
Contributor

eyalkoren commented Jul 10, 2019

I'll try to rephrase:
You can have two states- connected and not_connected. The initial state is not_connected.
In this state, you rotate over your server URLs until you get a valid connection (using healthcheck), in which case you switch to connected and start a polling cycle for remote config (and become active). Whenever the state changes back to not_connected (because remote config fails or sending events fails) - you stop remote config polling (and become inactive) and go back to your not_connected state, looking for a connection.

@felixbarny
Copy link
Member

I feel like introducing state management overcomplicates things and optimizes for an edge case. In the end, we still want to poll for remote config even if the APM Server is currently unavailable. Whether to try to get the remote config right after a failed healthcheck or to wait for a while does not seem like a significant enough improvement to warrant the complexity.

Again, the healthcheck in the Java agent is only executed once on startup for the purposes of having a log line which contains the APM Server version and the status of the connection which helps with debugging/support. It's a completely different concern and should therefore not influence remote configuration polling.

@eyalkoren
Copy link
Contributor

I feel like introducing state management overcomplicates things and optimizes for an edge case.

I definitely not think state management overcomplicates it and not sure why you refer to it as edge cases. Connection state seems the most reasonable way for me to maintain communication with one of unknown number of servers, where being required to switch shouldn't be considered as edge case.

The healthcheck is not important here, anything that you can poll on would fit. The problem with the new remote config resource is that you cannot rely on it because older servers don't have it. Otherwise, it would be enough for this purpose as well.

@beniwohli
Copy link
Contributor

beniwohli commented Jul 10, 2019

Another case that just came up on my side while testing: how do we handle deleted configs?

Scenario:

  • user creates remote config for an app/environment
  • agent gets config, applies it
  • user deletes config again
  • agent gets 404 response

What do we do now: Revert to local configuration value? Revert to default? Do nothing? Everything but "do nothing" would require somewhat major changes in my implementation, as we'd need to keep track of which options have been changed via remote config, and what their local/default value was.

/edit: we discussed this in the agents meeting of July 10. Felix explained that his configuration system has multiple configuration sources with a precedence hierarchy, so removing the remote config source (which has the highest precedence) automatically reverts to the "local" source with the highest precedence.

We agreed that this is a good guideline implementation for other agents to follow, but that it is not feasible to implement this for 7.4 for most agents. As such, agents should document known limitations of their own implementations in the docs, and update those docs once the limitations have been fixed.

@axw
Copy link
Member

axw commented Jul 12, 2019

Regarding the treatment of default 300s max-age for errors: I'm beginning to think that the server should respond with the same max-age for 404s as for 200s. At the moment the agent will go to sleep for 5 minutes before checking again if there's new config, but once it's got config it'll start querying every 30 seconds. Is there a reason not to make them the same?

@SergeyKleyman
Copy link
Contributor

@axw Maybe we should use Null object pattern and instead of using a special status code (404) server can return the empty configuration?

@simitt
Copy link
Contributor

simitt commented Jul 12, 2019

@axw the reason is we agreed on returning 5mins for all error cases. We could change to 30s again for non-server side errors. (But we should make this decision soon).

@felixbarny
Copy link
Member

I'm +1 for changing the interval to 30s in case of 404s. That makes it much quicker to apply the config when creating a new config for a particular service.

The downside is that agents will poll quite frequently even if the user never even intends to use central config. But I don't think this will cause much overhead or problems.

@simitt
Copy link
Contributor

simitt commented Jul 12, 2019

The downside is that agents will poll quite frequently even if the user never even intends to use central config. But I don't think this will cause much overhead or problems.

In such cases I'd expect ACM to be turned off in the server, or the cache expiration time to be increased.

@axw
Copy link
Member

axw commented Jul 15, 2019

the reason is we agreed on returning 5mins for all error cases.

@simit Yeah and I agree with that, however after starting implementation it occurred to me that this isn't really an error case at all. Agents always have to ask for config, and the fact that there isn't any is an expected/normal outcome.

We could change to 30s again for non-server side errors. (But we should make this decision soon).

I think we should change to 30s for 404 specifically, but not for all client errors (not 403.)

Maybe we should use Null object pattern and instead of using a special status code (404) server can return the empty configuration?

@SergeyKleyman I'd be fine with this too. I'm not sure there's much if any value in the 404 status code, since agents essentially need to treat it the same as a 200.

@Qard
Copy link

Qard commented Jul 17, 2019

So what exactly is the consensus for the mapping of status codes to polling interval? 404 is 30s and everything else is 5m?

@felixbarny
Copy link
Member

It's always the Cache-Control: max-age=<value in sec> value unless that is not present or valid. In that case, the default is 5 min. The only case when the APM Server will return max-age=300 (5min) is when there is an internal server error. In other cases, it returns max-age=30 (30s)

@axw
Copy link
Member

axw commented Jul 17, 2019

@elastic/apm-agent-devs also, in case you've not seen the apm-server PR linked above, the server will now respond with status 200 OK.

IIANM, the response body in this case is currently empty. I think it should be {}, so agents don't need to special-case this type of response. I think the server should also return an Etag in the empty case, so that the agents can next time send it and get back a 304 indicating that there's been no change.

@simitt
Copy link
Contributor

simitt commented Jul 17, 2019

I also kept updating the initial description of this Issue, last edit see section Update for 7.3 so the agreements are summarized (as this Issue thread has grown quite big).

@felixbarny this is not entirely true, as there could also be a 400 or 405 in which cases server also returns max-age=300. The server only returns max-age=30 for succesful responses.
For a 403 there is no Etag, as the request is already terminated with a killswitch checking that the feature is not enabled.

@graphaelli
Copy link
Member

@bmorelli25 mentioned that we refer to this feature differently in the UI, docs, and configs:

  • UI: APM Agent Configuration (BETA)
  • Agents: Central Configuration
  • Server: APM Agent configuration in Kibana

Since APM Agent Configuration isn't specific enough in the agent's context, I propose we sync server with the agents, changing server to APM Agent Central Configuration, and leave the UI as is. If we find users are confused during the beta period, we could add Central in the UI as well. Thoughts, particularly @alvarolobato @tbragin?

@alvarolobato
Copy link
Contributor

@graphaelli sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests