Investigate moving sourcemapping to an enrich processor #3606

axw · 2020-04-02T02:55:46Z

I would like us to investigate moving sourcemapping logic out of apm-server, and into an ingest node pipeline. This would enable us to fix #2724, and would likely also speed things up by doing everything in Elasticsearch, where the sourcemaps are stored.

In order to do this, I think we could use the Enrich processor to enrich ingested documents with the sourcemap, followed by a script processor which adjusts/enriches stacktrace fields, and finally removes the sourcemap field.

To use the Enrich processor, we would need to store a field which concatenates the properties we use to match them into one field: service name, service version, and file/URL path.

axw · 2020-12-03T08:59:42Z

There are some ramifications to this change that I had not previously considered, as there are several things that depend on the stacktrace:

error grouping key calculation (see also Investigate switching error grouping key calculation to use something faster apm-data#146), including replacement (or simply removal) of the apm-server.rum.exclude_from_grouping config
identification of RUM library frames
error culprit identification

axw · 2020-12-03T10:19:14Z

Regarding apm-server.rum.exclude_from_grouping and apm-server.rum.library_frames: Painless regex support is enabled by default from 7.10 onwards in a "limited" mode: elastic/elasticsearch#63029. I'm not yet sure if this will be good enough for our needs, but I think so. I tend to think that as long as we allow the pipeline to be user-modifiable (just in the ingest pipeline editor?), then it should be reasonable to implement in ingest node.

Error culprit identification should be straightforward.

axw · 2020-12-08T13:22:05Z

Regarding apm-server.rum.exclude_from_grouping and apm-server.rum.library_frames: Painless regex support is enabled by default from 7.10 onwards in a "limited" mode: elastic/elasticsearch#63029. I'm not yet sure if this will be good enough for our needs, but I think so. I tend to think that as long as we allow the pipeline to be user-modifiable (just in the ingest pipeline editor?), then it should be reasonable to implement in ingest node.

In Painless, regex support is limited to constant patterns. At the top of https://www.elastic.co/guide/en/elasticsearch/painless/7.10/painless-regexes.html:

Regular expression constants are directly supported. To ensure fast performance, this is the only mechanism for creating patterns. Regular expressions are always constants and compiled efficiently a single time.

So, the only way we could match filenames against a configurable regular expression would be by recreating the pipeline each time the config changes. Alternatively, we could switch the config over to using wildcard patterns like we use in the agents, such as in sanitize_field_names. Then we could inject the config into events, pick that up in the pipeline, and apply them with wildcard-matching logic written in Painless.

I'm leaning towards the latter at the moment. It'll mean a more complicated pipeline script, but it will provide more consistency in configuration across APM.

axw · 2020-12-09T05:50:15Z

Apart from error grouping key calculation (elastic/apm-data#146), I've got everything (I think?) working in master...axw:sourcemap-enrich. It's a little bit messy, but should demonstrate how things would work.

There's a substantial amount of Painless. This includes: sourcemapping, identifying library frames (using wildcard matching, see #3606 (comment)), and identifying the error culprit.

axw · 2021-03-10T07:33:07Z

I've created a new branch rebased on master, moving all the ingest node stuff into the pipeline we install: master...axw:sourcemap-enrich-take2

In that branch the pipeline is defined in ingest/pipeline/definition.yml, and definition.json is generated from that. The reason for using YAML is to make it enable comments and multiline strings, both of which are important for these painless scripts.

The pipeline uses the fingerprint processor to compute error.grouping_key. The good news is that it works, but the bad news is that it can't work in the same way as the existing grouping key calculation without changes to the fingerprint processor:

there's always a delimiter byte (0) written to the digest, which of course changes the hash
the hash is base64-encoded, whereas we have been hex-encoding

The other issue that we'll have is that since the fingerprint processor was only added in 7.12, we can't just introduce it to the pipeline as that would break compatibility with older versions of Elasticsearch. So if we are doing this, I think we can only add it in the integration package.

Seeing as the hashes will change, we may as well:

move to Murmur3 (requires extending fingerprint though): Investigate switching error grouping key calculation to use something faster apm-data#146
also change library_frames and exclude_from_grouping to wildcard matches as breaking changes in the integraton package

axw · 2021-03-16T06:42:05Z

I'm running some performance tests. Sending 1000x RUM errors and comparing methods (server vs. ingest).

In the server approach I have made some minor improvements to stop recording an error when a matching sourcemap can't be found (RUM: sourcemap.error on and sourcemap.updated are set overzealously #4958), meaning the performance will be slightly better than what is currently released.
In the ingest approach I have disabled source mapping in the server, and rely on ingest entirely.
In either case I've effectively disabled rate limiting, and increased the bulk max size to 2000.
Whichever way I run it, I can reliably index 2000+ errors/s; there's no externally discernible difference at this load. Increasing the load by 10x also led to similar relative ingestion rates.

I ran the test three times each and checked node stats each time, looking at the time spent in the "apm" ingest pipeline. Over each three runs, the pipeline averages ~0.1ms per event for in-server source mapping, and ~1ms per event for ingest source mapping. I also instrumented the time spent in the server in applying sourcemaps, and it works out to ~0.1ms per error each with 6 stack frames = 60000 frames.

So the ingest approach is considerably slower with worst case 10x slowdown for ingestion.

Now we need to answer:

Is it worth continuing with the ingest approach?
What are our alternatives?

For (1): I'm tending towards a no. The performance loss may not be apparent with a single APM Server given its current ingestion rate (in)capability, but with a cluster APM Servers handling heavy RUM traffic we could end up bottlenecked on ingest node. On top of all that, moving the process to ingest node carries some risk, and requires breaking changes.

For (2): knowing what we know now about Fleet hooks, we could do something like as follows

Introduce config into APM Server for specifying source maps, along the lines of:

apm-server:
  rum:
    source_maps:
      - service.name: opbeans-rum
        service.version: 1.2.3
        bundle.filepath: /test/e2e/general-usecase/bundle.js.map
        sourcemap.url: http://somewhere.com/bundle.js.map

2. Add a sourcemap upload endpoint to Kibana which store source maps in an index similar to today, but with no Enrichment index. On upload, and on Fleet policy creation/update, Kibana will inject apm-server.rum.source_maps entries with the URL pointing back at Kibana.

~~3. Add a sourcemap download endpoint to Kibana which fetches source maps from the index.~~

~~I'm essentially suggesting something like Endpoint's artifacts: https://github.com/elastic/kibana/tree/master/x-pack/plugins/security_solution/server/endpoint/routes/artifacts.~~

Assuming we want to make the sourcemap download API require authorisation, we could do this by poking the Fleet API access key into APM Server, similar to Endpoint: https://github.com/elastic/beats/blob/b03a935282bec78b3160e74e66dfe29e2df8ee69/x-pack/elastic-agent/spec/endpoint.yml#L51-L53

Add a sourcemap upload endpoint to Kibana which stores source maps using the Fleet Artifacts API. Source maps will be stored in .fleet-artifacts. On upload and on Fleet policy creation/update, Kibana will inject a reference to the artifact into APM policies.

simitt · 2021-03-16T07:19:56Z

This performance difference comes a bit unexpected; great that you measured it. The alternative with injecting a reference to the artifact sounds like a good approach.

axw · 2021-03-16T07:34:34Z

One thing to note: if we don't go with ingest node, then we don't get to fix #2724. We shouldn't kill performance in pursuit of that goal though.

axw · 2021-03-18T04:32:00Z

https://github.com/axw/kibana/tree/apm-sourcemap-routes is a hackish POC which adds source map upload, list, and delete routes to Kibana, storing sourcemaps as fleet artifacts. On upload/delete, references to the artifacts are injected into APM policies.

axw · 2021-03-22T06:10:36Z

@vigneshshanmugam is going to put together some thoughts on how we can improve our source maps experience overall, so I'll wait until that's available before we make a final call and open implementation issues. At this stage it looks likely that we will move ahead with the artifacts approach in favour of ingest node.

axw · 2021-03-25T06:39:31Z

I'm going to open implementation issues now covering the creation of a new Kibana endpoint with minimal differences compared to the existing APM Server source map upload endpoint, to simplify migration. We can always introduce another, simpler, endpoint later on.

axw · 2021-03-25T10:24:46Z

Closing this in favour of #5002 and elastic/kibana#95393

@vigneshshanmugam when you have time to write up your proposal, please share we me and we can create new issues

graphaelli added the [zube]: Backlog label Apr 15, 2020

axw mentioned this issue Aug 31, 2020

Source mapping should not log warnings when service.version is unspecified #4119

Closed

axw added [zube]: Ready and removed [zube]: Backlog labels Oct 7, 2020

axw added this to the 7.11 milestone Oct 7, 2020

simitt self-assigned this Oct 21, 2020

zube bot added [zube]: In Progress and removed [zube]: Ready labels Nov 10, 2020

axw mentioned this issue Nov 16, 2020

Investigate switching error grouping key calculation to use something faster elastic/apm-data#146

Open

axw mentioned this issue Dec 1, 2020

Upload sourcemaps synchronously #2862

Closed

axw unassigned simitt Dec 2, 2020

axw added [zube]: Ready and removed [zube]: In Progress labels Dec 2, 2020

axw mentioned this issue Dec 3, 2020

systemtest: migrate one sourcemap test to Go #4488

Merged

2 tasks

axw added [zube]: In Progress and removed [zube]: Ready labels Dec 8, 2020

axw self-assigned this Dec 8, 2020

jalvz mentioned this issue Dec 8, 2020

Integrate with Elastic Agent #4004

Closed

15 tasks

axw added [zube]: In Review and removed [zube]: In Progress labels Dec 16, 2020

jalvz mentioned this issue Dec 16, 2020

[Fleet] APM Server managed by Elastic Agent with Fleet - 7.12 #4558

Closed

15 tasks

jalvz mentioned this issue Jan 6, 2021

[APM] Add experimental support for data streams elastic/kibana#87501

Closed

jalvz mentioned this issue Jan 21, 2021

[meta] APM Server managed by Elastic Agent with Fleet (GA) #4636

Closed

16 tasks

axw modified the milestones: 7.11, 7.12 Feb 4, 2021

axw removed the [zube]: In Review label Feb 5, 2021

axw added the [zube]: Blocked label Feb 5, 2021

sorenlouv mentioned this issue Feb 16, 2021

[APM] Upload sourcemaps via API in Kibana elastic/kibana#91458

Closed

axw removed this from the 7.12 milestone Mar 3, 2021

axw added the v7.13.0 label Mar 3, 2021

axw closed this as completed Mar 25, 2021

zube bot added [zube]: Done and removed [zube]: Blocked labels Mar 25, 2021

axw removed the [zube]: Done label Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate moving sourcemapping to an enrich processor #3606

Investigate moving sourcemapping to an enrich processor #3606

axw commented Apr 2, 2020

axw commented Dec 3, 2020

axw commented Dec 3, 2020

axw commented Dec 8, 2020

axw commented Dec 9, 2020

axw commented Mar 10, 2021

axw commented Mar 16, 2021 •

edited

Loading

simitt commented Mar 16, 2021

axw commented Mar 16, 2021

axw commented Mar 18, 2021

axw commented Mar 22, 2021

axw commented Mar 25, 2021

axw commented Mar 25, 2021

Investigate moving sourcemapping to an enrich processor #3606

Investigate moving sourcemapping to an enrich processor #3606

Comments

axw commented Apr 2, 2020

axw commented Dec 3, 2020

axw commented Dec 3, 2020

axw commented Dec 8, 2020

axw commented Dec 9, 2020

axw commented Mar 10, 2021

axw commented Mar 16, 2021 • edited Loading

simitt commented Mar 16, 2021

axw commented Mar 16, 2021

axw commented Mar 18, 2021

axw commented Mar 22, 2021

axw commented Mar 25, 2021

axw commented Mar 25, 2021

axw commented Mar 16, 2021 •

edited

Loading