-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent remote configuration #76
Comments
Are there plans to create configurations for a more fine-grained subset of agents, based on more than |
A minor comment: the name for sampling rate option is "TRANSACTION_SAMPLE_RATE" (https://docs.google.com/spreadsheets/d/1JJjZotapacA3FkHc2sv_0wiChILi3uKnkwLTjtBmxwU/) so I would suggest to use "transaction_sample_rate" in response JSON for consistency. |
Should agents be aligned on other aspects of error handling and not just error reporting? For example, if some of the options fail to pass validation should the agent reject the whole configuration or should the agent use the valid options and find some replacements for the invalid ones (use default values, etc.) |
thanks @SergeyKleyman , updated with a note.
Yes, good point. I did highlight error reporting because it might have an impact in Kibana (eg. someone could create dashboards based on it), so I think it is important for the feature as a whole.
It was discussed briefly. We can ado that, yes, but it doesn't have the highest priority atm. Any ideas here ofc are welcome. |
Sure, we don't have to start with that right away. My point is that even if we want to add support for it later, we should already start sending the metadata instead of query parameters now. |
Sending metadata instead of query parameters does seem like it would make things more extensible. A newer server could start taking advantage of metadata agents already send, without any upgrade of agents. Is this difficult for anyone to implement? |
At least for the RUM agent being able to get the configuration without sending a body is preferable. Furthermore I think the extensibility provided by sending metadata is not very useful when the available configuration options is limited to a small number as is the case for the RUM agent. |
We can accept both, if that makes happy everyone. |
@jalvz that sounds fine to me, if it's not too onerous on the server. Could we just say that you can specify any metadata as a query parameter, using the same JSON path? e.g. where a backend agent might use
the RUM agent would use
i.e. with dots instead of _, so we can directly map the query params to a nested object in the metadata. Then, for completeness, we would state that query parameters would be overlaid on the request body (if any). |
@elastic/apm-agent-devs can I get you to comment on:
Some interesting ones are LOG_LEVEL, METRICS_INTERVAL, SPAN_FRAMES_MIN_DURATION, TRANSACTION_MAX_SPANS, SERVER_TIMEOUT, INSTRUMENT... what would you like to see next? |
I'd like to see that users can enter any configuration options via text input. |
what do you win with that? |
That lets users define settings which are agent specific or were added recently so that the config UI does not know about them yet.
Not sure what we are actually talking about here. Do you mean 2-5 settings the UI should support next or the settings that the agents should support when reading the Besides that, support for the |
UI can support settings specific for each agent any case, no need blank text input just for that.
I'll argue that with existing release cycles this is not too bad (plus: with a settings-aware UI we give users some incentive to upgrade!)
How the user will know what settings are reloadable by each agent? That is what I meant above, unless you all implement dynamic reloading for all settings for the next version, users need to know what works and what not.
This is the one named INSTRUMENT, no? Other counter arguments:
|
I think the agent configuration options documentation should state for each config whether it's reloadable. The Java agent already does that: https://www.elastic.co/guide/en/apm/agent/java/current/configuration.html (dynamic true/false). So each setting with I do get the point that validation is easier when restricting the options. What do you think about validating the values for known options but still allowing to set options the UI does not yet know about via a free-text input? Users already have the possibility to set options as free text in the configuration files and environment variables so the problem of misspelled options and invalid values is not entirely new. The impact may be more severe and harder to debug with central CM, however Even if we choose to restrict the options via the UI, I wouldn't restrict it on the API level. Question: should we define a common format which the agents provide and the UI uses in order to suggest certain settings, along with type and validation metadata?
Another question: Should agents query the |
No, we're not doing that
I think everything that helps UI to do validation is a great idea 👍
Agreed, better not for now |
This will be very useful for supportability. |
If we decide to go with configuration options via text input another point we need to consider is what is the expected outcome when the same option is provided twice |
Alright, lets keep those arguments for the next iteration. I added an update to the initial description, please everyone go trough it and let me know if you have anything against. |
@SergeyKleyman that got me thinking: What does the APM Server return when multiple configurations match the current agent? Is it guaranteed that only one can ever match? If so, how? If not, how are configurations merged? Also, how should the configurations from APM Server overrule the other configuration sources? The current hierarchy for the Java agent is (from highest to lowest precedence):
The contents of the post would be of type |
Regarding what we can configure in the future, the only config options that the agents currently all agree on are (with "agree on" I mean where we agree both on the config option name and how its value should be interpreted):
Unless the UI allows a user to send some config only to Node.js agent while some other config only to Java agents, we have to currently limit our selves to these config options. There are a few other options that are not implemented by all agents, but where the subset of agents who have implemented them do all agree. Those might be candidates as well. |
Regarding the discussion about being able to remotely set the Once |
Should we have a separate issue to align on meaning of |
The question is whether we need to align on this in the first iteration (is it marked as beta?) and what it means in terms of documentation and consistency. If feasible, we could have the spec somewhat flexible and define that |
I think we should be consistent with this in the first release, and do the simplest thing: start the agent as we do today, with locally-defined config, and update when a positive response is received. If we later go with disabling until first positive response, I suggest we align again to avoid confusing users of multiple agents. |
Trying to put it all together (blocking/non-blocking, active/non-active), can we agree on this:
Future implementation:
|
Most agents don't do healthchecks. A failing healthcheck should also not lead to the agent not polling for the configuration as the APM Server might be temporarily down when the agent starts. The Java agent only performs a health check on startup so it can write to the logs whether the APM Server is available (also logs version number) on startup with is useful for debugging purposes. What would be relatively easy to do in the Java agent is to initialize the The caveat of this is
|
@felixbarny As I mentioned above you can mitigate "only retry after 5 min" by polling much more frequently during the first X minutes after agent startup. |
Then where applies
Right, it shouldn't. The current implementation in the Java agent doesn't fit well anymore with polling. If the purpose of healthcheck is to know whether an APM server is available, it makes no sense to fail and start polling some other endpoint. What we can do is remove the healthcheck and rely only on the remote config for that, or switch to remote config polling only after successful healthcheck |
Why? I assume by "other endpoint" you refer to more than one URL in "server-URLs" configuration, right? If so why doesn't it make sense for agent to try and switch to the next URL and try to communicate with it instead (get configuration, send data ,etc.)? |
@SergeyKleyman no, I meant if you fail to get a proper response for the healthcheck ( |
I'll try to rephrase: |
I feel like introducing state management overcomplicates things and optimizes for an edge case. In the end, we still want to poll for remote config even if the APM Server is currently unavailable. Whether to try to get the remote config right after a failed healthcheck or to wait for a while does not seem like a significant enough improvement to warrant the complexity. Again, the healthcheck in the Java agent is only executed once on startup for the purposes of having a log line which contains the APM Server version and the status of the connection which helps with debugging/support. It's a completely different concern and should therefore not influence remote configuration polling. |
I definitely not think state management overcomplicates it and not sure why you refer to it as edge cases. Connection state seems the most reasonable way for me to maintain communication with one of unknown number of servers, where being required to switch shouldn't be considered as edge case. The healthcheck is not important here, anything that you can poll on would fit. The problem with the new remote config resource is that you cannot rely on it because older servers don't have it. Otherwise, it would be enough for this purpose as well. |
Another case that just came up on my side while testing: how do we handle deleted configs? Scenario:
What do we do now: Revert to local configuration value? Revert to default? Do nothing? Everything but "do nothing" would require somewhat major changes in my implementation, as we'd need to keep track of which options have been changed via remote config, and what their local/default value was. /edit: we discussed this in the agents meeting of July 10. Felix explained that his configuration system has multiple configuration sources with a precedence hierarchy, so removing the remote config source (which has the highest precedence) automatically reverts to the "local" source with the highest precedence. We agreed that this is a good guideline implementation for other agents to follow, but that it is not feasible to implement this for 7.4 for most agents. As such, agents should document known limitations of their own implementations in the docs, and update those docs once the limitations have been fixed. |
Regarding the treatment of default 300s max-age for errors: I'm beginning to think that the server should respond with the same max-age for 404s as for 200s. At the moment the agent will go to sleep for 5 minutes before checking again if there's new config, but once it's got config it'll start querying every 30 seconds. Is there a reason not to make them the same? |
@axw Maybe we should use Null object pattern and instead of using a special status code (404) server can return the empty configuration? |
@axw the reason is we agreed on returning 5mins for all error cases. We could change to 30s again for non-server side errors. (But we should make this decision soon). |
I'm +1 for changing the interval to 30s in case of 404s. That makes it much quicker to apply the config when creating a new config for a particular service. The downside is that agents will poll quite frequently even if the user never even intends to use central config. But I don't think this will cause much overhead or problems. |
In such cases I'd expect ACM to be turned off in the server, or the cache expiration time to be increased. |
@simit Yeah and I agree with that, however after starting implementation it occurred to me that this isn't really an error case at all. Agents always have to ask for config, and the fact that there isn't any is an expected/normal outcome.
I think we should change to 30s for 404 specifically, but not for all client errors (not 403.)
@SergeyKleyman I'd be fine with this too. I'm not sure there's much if any value in the 404 status code, since agents essentially need to treat it the same as a 200. |
So what exactly is the consensus for the mapping of status codes to polling interval? 404 is 30s and everything else is 5m? |
It's always the |
@elastic/apm-agent-devs also, in case you've not seen the apm-server PR linked above, the server will now respond with status 200 OK. IIANM, the response body in this case is currently empty. I think it should be |
I also kept updating the initial description of this Issue, last edit see section @felixbarny this is not entirely true, as there could also be a |
@bmorelli25 mentioned that we refer to this feature differently in the UI, docs, and configs:
Since |
@graphaelli sounds good to me. |
Overview
Following #4 , agents need to be able to poll apm-server for configuration changes received upstream from Kibana, apply them, and log the result with status (
success
|failure
), failurecause
(if any),timestamp
, settingname
andvalue
.We will start providing support for
TRANSACTION_SAMPLE_RATE
.Requirements
At minimum, agents need to agree on:
APM Server API
Server will expose a
/config/v1/agents
endpoint, for agents toGET
with aservice.name
URL query parameter (required), andservice.environment
(optional)Agents might send a request with a
If-None-Match
header, to which Server will respond with a304
- not modified response; or with200
, a response body with the configuration, and anEtag
header.Example
curl -v -H "If-None-Match:1" "http://localhost:8200/config/v1/agents?service.environment=prod&service.name=opbeans"
Update 23/04
Configuration settings are taken from the environment variable names, without the
ELASTIC_APM_
prefix, and lower case.Update 06/05
As per comment #76 (comment), the Server will accept query parameters both in the URL and in the body of a
POST
request.If different values for the same attribute are provided as
POST
andGET
, request will be400
-rejected; different attributes will be joint.Other notes that slipped in the initial description:
As pointed out in Agent remote configuration #76 (comment), agents should also align the error handling behaviour. For instance, if a config update can't be applied, should fallback to the last good value, to the agents default value, or to the value that the process started with?
Regarding
service.environment
, if none is passed in the query, only config updates without service environment will match. Likewise, if one is passed, only config updates with that value will match (and not config updates without value). In other words, a missing service environment is treated like any other (with a value of""
if you want to see it that way).Update 29/05
Update 01/07
Cache-Control: max-age
header in every successful response. For failing agent requests the header will be set with amax-age: 300
(5 mins) since querying again after 30s doesn't make sense. Decision was made to set to 5 mins instead of not setting or setting to 0 so agents don't need to put their own logic and can differentiate between server not supporting remote config and failures. More details on this in [ACM] Optimization / caching apm-server#2220.Updates for
7.3
403
if ACM not enabled in APM Server (apm-server.kibana.enabled: false
) (Fix config handler apm-server#2386)Etag
header in double quotes and expect the same for theIf-None-Match
header (Enclose etag and IfNoneMatch header in double quotes. apm-server#2407)json
if supported and return minimal error response body if no security token is used for communicating with the APM Server (Ensure Kibana client reconnects. apm-server#2421)200
with empty body instead of404
if configuration has not been found ([ACM] Return 200 instead of 404 for missing config. apm-server#2458)Etag
from_source.settings
and return flattenedkey:value
pairs in case nested config options are stored ([ACM] Create Etag from settings struct. apm-server#2441).Status
@elastic/apm-agent-devs please link your implementation issues
Let me know if you have any questions.
The text was updated successfully, but these errors were encountered: