Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X-Opaque-ID contains UUID causing ES deduplication to fail #120124

Closed
pgomulka opened this issue Dec 1, 2021 · 54 comments
Closed

X-Opaque-ID contains UUID causing ES deduplication to fail #120124

pgomulka opened this issue Dec 1, 2021 · 54 comments
Assignees
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:elasticsearch impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.17.0 v8.0.0

Comments

@pgomulka
Copy link

pgomulka commented Dec 1, 2021

ES is originally using X-Opaque-ID to trace origin of the request, to help identify slow search request or the use of deprecated functionality.
There was an expectation that this value would contain something in a form of user-id. No specification was given on how this id will look like. We documented this in both slow log and deprecation use cases.

The value of the X-Opaque-ID is used when de-duplicating deprecated log messages (together with a key of the deprecated feature). That means if the X-Opaque-ID is always unique, then there will be no de-duplication of those messages.
This will cause deprecation log to grow and since we enabled by default deprecation log indexing, it might use significant resources of the cluster.

The problem:
Kibana is using the UUID in a form of 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' for the X-Opaque-ID to help diagnose conceptual origin of the HTTP request. From what I understand, it can also pass the value provided by a user on a request to Kibana?
But because the value is mostly unique, it generates a lot of duplicated deprecation logs.

I wonder if we could use trace-id for UUID and use something like a UserID for the X-Opaque-Id

@pgomulka pgomulka added bug Fixes for quality problems that affect the customer experience v8.0.0 v7.10.0 labels Dec 1, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Dec 1, 2021
@lukeelmers lukeelmers added Feature:elasticsearch Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Dec 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Dec 1, 2021
@pgomulka
Copy link
Author

pgomulka commented Dec 1, 2021

we also discussed this on a core-infra sync today.
We realised that maybe two separate use cases are mixed here. The tracing of the request across the stack and the identification of the origin of the request.
For the tracing we would prefer traceparent/traceid be used. We need to expand the documentation on this. Here is the PR that added the support for this header https://github.com/elastic/elasticsearch/pull/74210/files
The traceparent can have a generated UUIDs (following the defined syntax)

For origin of the slow request/depreacated feature request we would prefer if X-Opaque-ID contain an id from a finite set so that we could deduplicate deprecatation logs. As we documented the intended use case is to set a user-id (no format enforced) so that an app/user could be found by a user.
For Kibana, maybe an X-Opaque-ID could be something like a call-site. A class name, method? and for a traceparent a UUID could be used?

@droberts195
Copy link
Contributor

For Kibana, maybe an X-Opaque-ID could be something like a call-site. A class name, method?

Call-site would be useful, but in some cases it would be nice to also know the object that made the call.

Recently we have had a number of support cases where something in Kibana has been submitting a search to Elasticsearch that uses enormous resources, for example SIEM detection rules. In this case it would be nice to not just find out that the call site is "detection rules", but to find out the name of the particular rule that was involved.

I think this change would be in keeping with the original intent of this issue. For both deprecation log de-duplication and identification of the source of heavy searches it's useless to have a UUID for each search done by a particular SIEM detection rule. And in both cases it's useful to know not just that the source of the search was a SIEM detection rule, but which one.

Maybe simply call-site could be used most of the time, with call-site + object name to be used in a select few cases where we anticipate needing to identify the object.

@lukeelmers
Copy link
Member

For origin of the slow request/depreacated feature request we would prefer if X-Opaque-ID contain an id from a finite set so that we could deduplicate deprecate log

I completely understand the need to deduplicate the deprecation logs and the challenges that unique opaque IDs cause, but TBH I'm a little confused on this -- don't we call the ID "opaque" exactly because Elasticsearch shouldn't be making any assumptions about its contents?

What makes this challenging for Kibana is that we are doing more with x-opaque-id than just adding request uuid. We now have an execution context service which will be enabled by default starting in 8.1. The service allows plugins to create a context object that is propagated to ES via the x-opaque-id header.

This means that x-opaque-id can now look something like this (quoting an example from #119932):

Then, in Elasticsearch slow logs you might find a hint that a request was initiated by my-plugin-ui-feature feature of my-app application and proxied through my-plugin-server-feature features of my-app-server app:

x-opaque-id: ${requestId};kibana:my-app:my-plugin-ui-feature:42;my-app-server:my-plugin-server-feature:42

All this is to say, if we wanted to change the value Kibana passes via x-opaque-id, it would also affect our implementation of the execution context service. But I'll defer to @mshustov for his thoughts on that.

@rudolf
Copy link
Contributor

rudolf commented Dec 2, 2021

I think it's worth discussing what we want to do long term. But we need to find a solution for 7.16/7.17

I think we have the following options:

  1. Remove x-opaque-id from Kibana. Either without any replacement tracing header or by switching to the traceparent header. The Kibana Core can explore the effort and any potential problems that might come out of this solution.
  2. When receiving the x-elastic-product-origin: 'kibana' header, Elasticsearch ignores the x-opaque-id header when deduplicating logs. Would this be a viable alternative?

@mshustov
Copy link
Contributor

mshustov commented Dec 2, 2021

There was an expectation that this value would contain something in a form of user-id. No specification was given on how this id will look like. We documented this in both slow log and deprecation use cases.

Since this format's been documented for a while, I agree that it's a bug. In #71019 Kibana didn't require x-opaque-id to be a user unique identifier and as it's noticed Kibana falls back to a random uuid value if x-opaque-id is not provided.

We now have an execution context service which will be enabled by default starting in 8.1. The service allows plugins to create a context object that is propagated to ES via the x-opaque-id header.
This means that x-opaque-id can now look something like this

In the long term, we are going to migrate from (mis-)used x-opaque-id header to semantically correct baggage header #101708. The draft status of the baggage header specification is the main blocker for this effort. If it's a critical problem we can consider using another header for execution context propagation.

Remove x-opaque-id from Kibana.

We can't just remove x-opaque-id support as it will be a breaking change. Kibana already documents server.requestId settings https://www.elastic.co/guide/en/kibana/7.16/settings.html However, there are no requirements for x-opaque-id content.

What I'd propose is:

  1. document requirements to x-opaque-id content
  2. stop generating uuid in Kibana if x-opaque-id is not provided
  3. decide what information Kibana should provide via x-opaque-id header.
    @droberts195 suggested Call-site would be useful, but in some cases it would be nice to also know the object that made the call. . We can provide a static kibana value if execution_context is disabled, or a more specific kibana enitity value if execution_context is enabled. execution_context can hint whether a request is originated from a Kibana visualization or Kibana background task/alert. In either way, we get rid of uuid value in x-opaque-id header which should stabilize the deprecation logging. WDYT? @droberts195 @pgomulka

As a side note: we need to document x-opaque-id content format requirement for other Elastic products. For example, other elasticsearch-js client doesn't document this requirement either https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/observability.html#_x_opaque_id_support

@pgomulka
Copy link
Author

pgomulka commented Dec 3, 2021

document requirements to x-opaque-id content
stop generating uuid in Kibana if x-opaque-id is not provided
decide what information Kibana should provide via x-opaque-id header.
@droberts195 suggested Call-site would be useful, but in some cases it would be nice to also know the object that made the call. . We can provide a static kibana value if execution_context is disabled, or a more specific kibana enitity value if execution_context is enabled. execution_context can hint whether a request is originated from a Kibana visualization or Kibana background task/alert. In either way, we get rid of uuid value in x-opaque-id header which should stabilize the deprecation logging. WDYT?

this sounds great to me, because this is exactly what ES would expect on that header at the moment.
I will work on improving the documentation for x-opaque-id and traceparent/trace.id fields

@droberts195
Copy link
Contributor

We can provide a static kibana value if execution_context is disabled, or a more specific kibana enitity value if execution_context is enabled.

I guess this relates to #101587 right? (Sorry, I don't know much about prior work in this area.)

In that case I think what I was asking for is covered, but an important point is that all the sub-teams whose Kibana apps might submit heavy searches to Elasticsearch actually take advantage of the execution_context functionality. It probably needs a publicity drive when it's ready for use to ensure that all the people who work on Kibana apps for security, APM, alerts, ML, etc, are all aware that heavy searches should be submitted with an execution_context that clearly identifies where they came from.

Anyway, I've subscribed to that other issue and I don't think it needs covering any more on this issue.

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Dec 7, 2021
@lukeelmers lukeelmers added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort and removed impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Dec 7, 2021
@exalate-issue-sync exalate-issue-sync bot added loe:small Small Level of Effort and removed loe:medium Medium Level of Effort labels Dec 14, 2021
@pgayvallet
Copy link
Contributor

Some technical pointers for the implementation

The requestId is set to the executionContext in the HTTP server, here:

const requestId = getRequestId(request, config.requestId);
const parentContext = executionContext?.getParentContextFrom(request.headers);
if (parentContext) executionContext?.set(parentContext);
executionContext?.setRequestId(requestId);

We'll probably have to modify getRequestId to return undefined instead of a random uuid when the opaqueId can't be retrieved from the request (either not present or because of the configuration)

This ID is also accessible from the KibanaRequest instance, via the id property

/**
* A identifier to identify this request.
*
* @remarks
* Depending on the user's configuration, this value may be sourced from the
* incoming request's `X-Opaque-Id` header which is not guaranteed to be unique
* per request.
*/
public readonly id: string;

This is then used to set the header of the request against ES in two places:

  • In the cluster client

if (isRealRequest(request)) {
const requestHeaders = ensureRawRequest(request).headers;
const requestIdHeaders = isKibanaRequest(request) ? { 'x-opaque-id': request.id } : {};

This is used to set the headers parameter when instantiating a child Client instance, in ClusterClient#asScoped.

  • In our custom transport

const opaqueId = getExecutionContext();
if (opaqueId && !opts.opaqueId) {
// rewrites headers['x-opaque-id'] if it presents
opts.opaqueId = opaqueId;
}

which is retrieving the value from the ExecutionContextService:

() => this.executionContextClient?.getAsHeader()

private getAsHeader(): string | undefined {
if (!this.enabled) return;
// requestId may not be present in the case of FakeRequest
const requestId = this.requestIdStore.getStore()?.requestId ?? 'unknownId';
const executionContext = this.contextStore.getStore()?.toString();
const executionContextStr = executionContext ? `;kibana:${executionContext}` : '';
return `${requestId}${executionContextStr}`;
}

As the usage from the ClusterClient is directly accessing the request, we'll likely need a way to differentiate a request having an ID retrieved from the x-opaque-id header coming from the client from a request that have a generated ID.

From a quick grep, I couldn't really find any usage of KibanaRequest.id apart from the ones used in the ES service's code.

We will probably need to either make this KibanaRequest.id value optional, and only populate it when coming from a proper x-opaque-id header, or add a boolean flag to identify if the value was autogenerated or not.

It also makes me wonder why exactly we decided to set a default value to id, given we also have KibanaRequest.uuid which is supposed to be the real technical identifier of the object.

I would personally make KibanaRequest.id optional AND rename it to opaqueId, to reflect more precisely what exactly this value is about.

Then, we'll also need to modify ExecutionContextService.getAsHeader to better handle the case where the requestId / opaqueId is undefined (it actually use a unknownId constant in that case, and we can probably do slightly better now that undefined will be a 'normal' value)

@lukeelmers @mshustov you probably have a better understanding of the whole problematic than I do. WDYT?

@mshustov
Copy link
Contributor

mshustov commented Jan 13, 2022

we'll likely need a way to differentiate a request having an ID retrieved from the x-opaque-id header coming from the client from a request that have a generated ID.

Why is it necessary if we aren't going to generate request id anymore? This value from ClusterClient will be overwritten by getExecutionContext, I'm not even sure we need this logic in getScopedHeaders.

It also makes me wonder why exactly we decided to set a default value to id, given we also have KibanaRequest.uuid which is supposed to be the real technical identifier of the object.

I suppose it's because an end-user can set any x-opaque-id value, so we can't rely on the format of the value.

I would personally make KibanaRequest.id optional AND rename it to opaqueId, to reflect more precisely what exactly this value is about.

We can do it since elasticsearch-js uses this naming as well. And there are no other consumers in the Kibana codebase anyway.

Then, we'll also need to modify ExecutionContextService.getAsHeader to better handle the case where the requestId / opaqueId is undefined (it actually use a unknownId constant in that case, and we can probably do slightly better now that undefined will be a 'normal' value)

Not sure we can drop it completely, but we can use something more compact, like -. Note, that in the long term we will move execution_context from x-opaque-id to baggage header with proper semantic value. We are blocked by elastic/apm-agent-nodejs#2102

@pgayvallet
Copy link
Contributor

This value from ClusterClient will be overwritten by getExecutionContext, I'm not even sure we need this logic in getScopedHeaders.

if execution_context.enabled is false, getAsHeader returns undefined,

private getAsHeader(): string | undefined {
if (!this.enabled) return;

which makes the ES client use the header value specified by getScopedHeaders.

If we modify the execution context service to return a static value when it's disabled, then we wouldn't need to update the logic in getScopedHeaders. but in that case, if we accept that the x-opaque-id header that will be sent to ES is always managed by the EC service, shouldn't we remove the x-opaque-id handling from ClusterClient#getScopedHeaders altogether?

@lukeelmers
Copy link
Member

I don't think having these logs that are anyway hidden by upgrade assistant adds much value, so I think we could consider not creating deprecation logs for elastic stack components at all. I'm not sure what implications made us decide against this approach initially.

I tend to agree, it feels like they aren't all that useful, but there could be some context I am missing. @jakelandis & team, what do y'all think?

@jakelandis

This comment has been minimized.

@jakelandis
Copy link
Contributor

jakelandis commented Jan 19, 2022

What do you have in mind with "hiding" them? I don't think having these logs that are anyway hidden by upgrade assistant adds much value, so I think we could consider not creating deprecation logs for elastic stack components at all. I'm not sure what implications made us decide against this approach initially.

Apologies... I was mis-understanding how x-elastic-product-origin was implemented. I see now that Kibana does indeed "hide" these, but only after they have been written to the index and hidden only via the UI. My intent is not re-open those decisions and hiding via the UI (mostly) satisfies my concerns around the 60 depredations expected to be shipped (and will voice those lesser concerns via email edit submitted elastic/elasticsearch#82833 to handle unbounded growth of the log).

...back on topic ...

Change the Elasticsearch deduplication logic to ignore x-opaque-id when performing deduplication only when the x-elastic-product-origin: kibana header is present.

If x-elastic-product-origin from Kibana is always hidden from the users view, then why only ignore x-opaque-id for duplication. Wouldn't a simpler option be skip logging the deprecations entirely ? Wouldn't that have the same effect per the user's perspective ? (again, not trying to reopen old decisions, just trying to understand).

@pgomulka
Copy link
Author

If x-elastic-product-origin from Kibana is always hidden from the users view, then why only ignore x-opaque-id for duplication. Wouldn't a simpler option be skip logging the deprecations entirely ? Wouldn't that have the same effect per the user's perspective ? (again, not trying to reopen old decisions, just trying to understand).

I don't think we considered skipping logging the deprecations originating from elastic entirely last time (when the issue was originally created). I personally would be worried about this idea. How else would we track what deprecated features are still used within the stack (kibana for instance?) We could obviously push that responsibility on the clients to store the warning headers with a deprecations. It sound to me, like this would mean clients (like kibana) would have to effectively implement and maintain their own deprecation log.

back on the issue and the discussion summary.
I understand that fixing 60deprecations in short timeframe is not possible
and option 1 is also less favourable because it requires more changes.
I am fine with option 2 (elasticsearch not using x-opaque-id for throttling deprecations from kibana only) if the 7.17 timing is the most important.

However, I don't think we considered pushing back with releasing 7.17 and 8.0. Would that even be possible to get more time and implement option 1. If I understand it right, the only downside of this option is the short deadline for 7.17. I don't think traceparent could affect cloud proxy. We already support traceparent in elasticsearch.
The only missing bits for option 1 are:

  • in ES is traceparent in audit logs which we estimate as low effort
  • the documentation changes in kibana (regarding correlations of ES-kibana audit logs).
  • and kibana not using generated UUID in x-opaque-id (already in progress for 7.17)

@igore-kupczynski Just to confirm. Do you think there could be any problem with cloud proxy if kibana start sending traceparent to elasticsearch?

@pgayvallet
Copy link
Contributor

pgayvallet commented Jan 20, 2022

However, I don't think we considered pushing back with releasing 7.17 and 8.0. Would that even be possible to get more time and implement option 1. If I understand it right, the only downside of this option is the short deadline for 7.17.

Unfortunately it's not. There are some significant technical implications for option 1 that we discovered yesterday while discussing with @mshustov and @jportner

During the investigations of #123197, we discovered a few things that would be problematic if we'd switch to using the traceId value from the traceparent header for audit logging correlation:

At the moment, Kibana is already sending a traceparent header to elasticsearch, then, as your said, this traceparent header is used by ES to retrieve the traceId and other tracing values from Kibana since elastic/elasticsearch#74210

However, this header is sent by APM, meaning that this is done only if APM is enabled on Kibana. When APM is disabled, there's no traceparent header circulating.

Initially the traceparent support was supposed to be ONLY for log correlation. As Audit Logging was using the x-opaque-id header that was (is atm) always populated with a unique value, having traceparent be optional depending on the configuration was not an issue. If we switch AL's implementation to use traceId instead of xOpaqueId, this header being optional becomes a problem.

We can't really just change our ClusterClient implementation to send a traceparent header, as this would (potentially?) conflict with the one APM is sending. We can't also just let APM handle this, as when APM is disabled, this header would not be sent, therefor breaking the whole audit logging correlation. Our only options would be to either manually generate / send this header when APM is disabled, or (if that's possible) disable APM from handling this header, and always generating it manually to make sure that it is send regardless of Kibana's configuration. Both are non-trivial.

Note that we also got a divergence between 8.1 and older versions, as #118466, which is doing some work in this area, was not backported to either 8.0 or 7.17 (as it was not covering a bug, but a feature), which would also increase the complexity of doing works in the 7.17 branch (even it's that not blocking, just pointing this out).

and kibana not using generated UUID in x-opaque-id (already in progress for 7.17)

This PR is on hold until we agree on a technical solution as it currently breaks audit logging.

@rudolf rudolf added the blocker label Jan 20, 2022
@rudolf
Copy link
Contributor

rudolf commented Jan 20, 2022

Before we dismiss dropping Kibana deprecation logs completely as an option. Kibana is already relying only on deprecation headers for development/testing. When instrumenting CI to detect deprecations we needed a way to trace deprecations back to the calling site and this was impossible using the logs. Looking for the returned header meant we could find out the exact query and stack trace where the query comes from and log all the context. The only shortcoming is that the deprecation headers don't include a level such as WARNING/CRITICAL.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jan 20, 2022

@jportner given that the default configuration for the APM agent on Kibana is enabled: true, contextPropagationOnly: true, therefor correctly sending the traceparent header, do you think it would be acceptable to document that audit logging correlation (which would now be based on trace-id) would require Kibana to have APM context propagation enabled to function (meaning that APM's context propagation should not be explicitly disabled by the customer)?

If this is acceptable, it would allow us to always delegate the traceparent handling to APM, therefor removing most of the complexity described in #120124 (comment). In that case, the steps described by @pgomulka in #120124 (comment) would be sufficient, in addition to :

Note that if we go that way, we would kinda be forced to keep this enabled: true, contextPropagationOnly: true configuration for APM as the default.

@rudolf
Copy link
Contributor

rudolf commented Jan 20, 2022

What's the risk of not making APM agent configurable at all? Do users have a good reason to disable it?

@pgayvallet
Copy link
Contributor

Do users have a good reason to disable it?

Performances. APM has some impact on performances, even if the contextPropagationOnly mode is supposed to be negligible, given the data collected in #112973.

So no, I don't see any good reason to disable APM totally and not just enable it in contextPropagationOnly mode (which is the default anyway). But maybe @mshustov has a better idea.

What's the risk of not making APM agent configurable at all

Introducing a breaking change in a minor, and a X.last, version. I feel like we may want to avoid that.

@mshustov
Copy link
Contributor

given that the default configuration for the APM agent on Kibana is enabled: true, contextPropagationOnly: true, therefor correctly sending the traceparent header, do you think it would be acceptable to document that audit logging correlation (which would now be based on trace-id) would require Kibana to have APM context propagation enabled to function (meaning that APM's context propagation should not be explicitly disabled by the customer)?

Worth noting that contextPropagationOnly support was added in #112973 which is not backported to 8.0, 7.17.

Introducing a breaking change in a minor, and a X.last, version. I feel like we may want to avoid that.

AFAIK We've never documented elastic.apm settings of kibana.yml see #99328, so changing the settings format is not a breaking change.

Do users have a good reason to disable it?

Users may want to disable the APM agent to reduce runtime overhead, the overhead of sending additional tracing headers. @trentm do you see any other potential problems?

@jportner
Copy link
Contributor

@jportner given that the default configuration for the APM agent on Kibana is enabled: true, contextPropagationOnly: true, therefor correctly sending the traceparent header, do you think it would be acceptable to document that audit logging correlation (which would now be based on trace-id) would require Kibana to have APM context propagation enabled to function (meaning that APM's context propagation should not be explicitly disabled by the customer)?

If this is acceptable, it would allow us to always delegate the traceparent handling to APM, therefor removing most of the complexity described in #120124 (comment).

Agreed, I think this makes sense, we just need to make sure audit logging docs are updated accordingly.

Note that if we go that way, we would kinda be forced to keep this enabled: true, contextPropagationOnly: true configuration for APM as the default.

Yes.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jan 20, 2022

So, to summarize, if we want to go ahead with option 1:

ES side

  • Add traceparent support on audit logs to include the trace-id field to those logs
    • on 8.1, backported to 8.0 and 7.17

Kibana side

@pgomulka @jakelandis if we go that way, we gonna need the changes on ES to be available for testing to make sure that audit log correlation is still working after our changes before merging/backporting #123197

@mshustov @jportner @rudolf before we confirm this plan as being an option, does it seem correct to you, maybe I missed anything?

@mshustov
Copy link
Contributor

@pgayvallet Kibana's side looks correct. However, this to specify that APM agent must be enabled (on contextPropagationOnly mode or not) for correlation to work looks like a significant change for Kibana. Don't we need Product approval for this? Maybe we should consider the fact Kibana uses the apm agent as an implementation detail and do not allow customers to disable it?

@pgayvallet
Copy link
Contributor

Maybe we should consider the fact Kibana uses the apm agent as an implementation detail and do not allow customers to disable it?

Given the minimal impact of the agent enabled in contextPropagationOnly, I think it would be reasonable. Now, that's additional changes to perform (and to backport) for 7.17. Do we think this is a necessity, or are we fine with just documenting audit logs as requiring the agent to be enabled to function?

As a sidenote, I see that @pgomulka started working on elastic/elasticsearch#82855, so if we're going with option 2 for 7.17, we would have more time to properly address option 1 in 8.1, which is why I would really like to have @pgomulka and @jakelandis 's opinion on the direction we want to take.

@jportner
Copy link
Contributor

jportner commented Jan 20, 2022

before we confirm this plan as being an option, does it seem correct to you, maybe I missed anything?

Seems correct to me, though I'm unclear if audit logs would be missing the trace.id if we did not backport #118466. If that's the case, I think that we need to backport, it's not optional.

Now, that's additional changes to perform (and to backport) for 7.17. Do we think this is a necessity, or are we fine with just documenting audit logs as requiring the agent to be enabled to function?

I think that it's a necessity. Otherwise someone could have audit logging enabled (with correlation) in 7.16, then upgrade to 7.17 and lose correlation without any warning. We can't realistically expect our users to read through all the docs / release notes to catch something like this, and we don't have the upgrade assistant to surface this info in 7.16.


Another question: are we certain that the APM agent supports contextPropagationOnly in 7.17/8.0? Or is that only supported in 8.1+? The PR where the spec was added is not clear.

@pgayvallet
Copy link
Contributor

Otherwise someone could have audit logging enabled (with correlation) in 7.16, then upgrade to 7.17 and lose correlation without any warning

I get that. To be honest my opinion is that given that APM instrumentation isn't even documented anywhere, such risks are null, but I can't argue that this is technically possible.

Now, forcing the agent to be enabled could also be considered a breaking change, because in the same scenario, a user with APM disabled in 7.16 would upgrade to 7.17 to discover that the agent is now enabled. So in that case, I'm unsure we can safely do that in 7.17.

are we certain that the APM agent supports contextPropagationOnly in 7.17/8.0?

Not atm. This is why we need to backport #112973

@jakelandis
Copy link
Contributor

Now that I understand that users will never see these deprecation warnings, my initial reservations are dampened.

The issue really comes down to writing more bytes to an index than are expected and/or necessary. Those bytes will likely never be seen by users and are of minimal size (small messages with much repetition). So the problem we are trying to solve is how to avoid trivial levels of overhead writing these messages and minimal bloat in clusters that remain on 7.last for long periods of time.

I think we should solve the problem, but not sure if it is really a blocker to the release.

My initial reservations with the proposed fix :

x-opaque-id is used in an unexpected way, it requires ES to behave specially for Kibana, and it allows Kibana to call into deprecated APIs with fewer consequences

"it allows Kibana to call into deprecated APIs with fewer consequences" - this ship has already sailed and this de-duplication issue should focus on just the de-duplication aspect (sorry if forked the conversation and didn't pick up what was being said)

"x-opaque-id is used in an unexpected way" - There is an argument to be made that ES should not have ntroduced caching semantics to what is intended to be an opaque-id. In this case (and others) we tend to use the x-opaque-id as more of a client-id but it really isn't.

"it requires ES to behave specially for Kibana" - reservation holds

We considered ignoring the x-opaque-id header for all requests with the x-elastic-product-origin header, e.g. requests coming from all Elastic stack components. However, this could make it harder to trace the origin of deprecation logs from stack components which are correctly using the x-opaque-id header.

For me this option (or the like) is back on the table. If we already hide all stack component deprecation for the users, then making it harder to trace the deprecation from stack components is moot point. I would vote for this option... if x-elastic-product-origin use that for caching else use x-opaque-id.

However, @rjernst and @pgomulka are the code owners and have the final say for this part of ES.

@trentm
Copy link
Member

trentm commented Jan 20, 2022

[@mshustov]

Users may want to disable the APM agent to reduce runtime overhead, the overhead of sending additional tracing headers. @trentm do you see any other potential problems?

No other potential problems that I see. Just performance overhead of the APM agent.

@pgomulka
Copy link
Author

pgomulka commented Jan 24, 2022

We considered ignoring the x-opaque-id header for all requests with the x-elastic-product-origin header, e.g. requests coming from all Elastic stack components. However, this could make it harder to trace the origin of deprecation logs from stack components which are correctly using the x-opaque-id header.

For me this option (or the like) is back on the table. If we already hide all stack component deprecation for the users, then making it harder to trace the deprecation from stack components is moot point. I would vote for this option... if x-elastic-product-origin use that for caching else use x-opaque-id.

I like that. This is a variation of option 1 - where x-opaque-id would not be used for kibana's requests. Given that all requests with elastic-product-origin are hidden in upgrade assistant, I don't mind using elastic-product-origin for all values when throttling deprecation logs.
I will update elastic/elasticsearch#82855 accordingly.

At the same time I will merge #82849 emit trace.id into audit logs and update misleading information in audit log doc about no semantic meaning of x-opaque-id (#82868)

@cjcenizal
Copy link
Contributor

CC @elastic/platform-deployment-management

@pgomulka
Copy link
Author

I have merged [7.17] Do no use x-opaque-id for deduplicating elastic originating requests backport (#82855) #83023 into 7.17 should we consider this issue done?

Or maybe we should keep it open until kibana stop sending UUID on x-opaque-id?

I also merged Emit trace.id into audit logs #82849
into 7.17 and updated the audit log documentation.

@pgayvallet
Copy link
Contributor

should we consider this issue done? Or maybe we should keep it open until kibana stop sending UUID on x-opaque-id?

We do still need to address this properly on our side on 8.1 or 8.2. But there is a lot of noise on this issue. I will open a new one summarizing all this, then close this one.

In the meantime, @pgomulka, do you know if an issue was opened to track adding trace-id support on the log output/format on the elasticsearch side? We'll still need that to properly address the issue.

@pgomulka
Copy link
Author

In the meantime, @pgomulka, do you know if an issue was opened to track adding trace-id support on the log output/format on the elasticsearch side? We'll still need that to properly address the issue.

There was no issue created. It was added a while back to ES 7.15 We only missed adding this to audit logs.

@pgayvallet
Copy link
Contributor

We only missed adding this to audit logs.

Yea, sorry, specifically for the audit logs was what I meant

@pgomulka
Copy link
Author

also I did not create an issue. I considered this more like a follow up elastic/elasticsearch#82849

@pgayvallet
Copy link
Contributor

Oh, that's already done and merged, perfect then.

@pgayvallet
Copy link
Contributor

Closing in favor of #123737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:elasticsearch impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.17.0 v8.0.0
Projects
None yet
Development

No branches or pull requests