Heap usage increased #3520

axw · 2020-03-23T01:57:00Z

There has been a non-negligible increase in heap allocations since March 11:

This coincides with #3418.

axw · 2020-03-23T12:10:55Z

So, it turns out that the issue started happening earlier than the modules commit.

From the last good run on March 10, note the apm-server version reported: "apm-server version 8.0.0 built on 6 March [2800966]"

05:04:15  hey-apm            | 2020/03/10 05:04:14 run.go:36: 10.000224548s elapsed since event generation completed
05:04:15  hey-apm            | transactions sent ............. 0
05:04:15  hey-apm            | transactions dropped .......... 0
05:04:15  hey-apm            | errors sent ................... 5109
05:04:15  hey-apm            | errors dropped ................ 21338
05:04:15  hey-apm            |  - success % .................. 19.32
05:04:15  hey-apm            | total events sent ............. 5109
05:04:15  hey-apm            |  - per second ................. 119.34
05:04:15  hey-apm            |  - accepted ................... 5109
05:04:15  hey-apm            |    - per second ............... 119.34
05:04:15  hey-apm            |    - success % ................ 19.32
05:04:15  hey-apm            | total requests ................ 1
05:04:15  hey-apm            | failed ........................ 0
05:04:15  hey-apm            | apm-server version 8.0.0 built on 6 March [2800966]
05:04:15  hey-apm            | heap ................ 85.3Mb
05:04:15  hey-apm            | total allocated ..... 370.5Mb
05:04:15  hey-apm            | heap allocated ...... 74.8Mb
05:04:15  hey-apm            | mallocs ............. 4501343
05:04:15  hey-apm            | num GC .............. 13
05:04:15  hey-apm            | cooling down 60.0 seconds...

From the job run on March 11: "apm-server version 8.0.0 built on 10 March [64c4140]"

05:05:02  hey-apm            | 2020/03/11 05:05:01 run.go:36: 10.000199787s elapsed since event generation completed
05:05:02  hey-apm            | transactions sent ............. 0
05:05:02  hey-apm            | transactions dropped .......... 0
05:05:02  hey-apm            | errors sent ................... 5070
05:05:02  hey-apm            | errors dropped ................ 21424
05:05:02  hey-apm            |  - success % .................. 19.14
05:05:02  hey-apm            | total events sent ............. 5070
05:05:02  hey-apm            |  - per second ................. 118.35
05:05:02  hey-apm            |  - accepted ................... 5070
05:05:02  hey-apm            |    - per second ............... 118.35
05:05:02  hey-apm            |    - success % ................ 19.14
05:05:02  hey-apm            | total requests ................ 1
05:05:02  hey-apm            | failed ........................ 0
05:05:02  hey-apm            | apm-server version 8.0.0 built on 10 March [64c4140]
05:05:02  hey-apm            | heap ................ 98.7Mb
05:05:02  hey-apm            | total allocated ..... 447.1Mb
05:05:02  hey-apm            | heap allocated ...... 89.0Mb
05:05:02  hey-apm            | mallocs ............. 5090610
05:05:02  hey-apm            | num GC .............. 13
05:05:02  hey-apm            | cooling down 60.0 seconds...

Commit 64c4140 immediately precedes the modules commit. It seems that the hey-apm job runs with whatever is the latest snapshot image. I think perhaps we should build and publish our own image nightly specifically for hey-apm? (CC @elastic/observablt-robots)

The only thing in there that looks suspicious is d5a0c46, but that only applies to spans AFAIK. I don't see why error allocations would be impacted by that.

kuisathaverat · 2020-03-23T16:52:53Z

sure, we want to have a Docker image of every valid commit stored on our Docker registry is part of our incremental deployments effort

axw · 2020-03-24T08:51:17Z

So far I've been unable to reproduce the difference locally. Just to recap, the change is apparently somewhere in:

2800966...64c4140

Looking through the changes, the only thing that could possibly make sense to me is d5a0c46 (adding all metadata fields to spans). As mentioned above, workloads that do not involve spans shouldn't be affected.

My current hypothesis is that the load from hey-apm benchmark jobs is queuing up, and spilling over into subsequent jobs. Currently, hey-apm has a fixed 60s cooldown between jobs, but does not restart apm-server, or explicitly wait for it to quiesce.

I'm going to take a look at modifying hey-apm to wait until the server's queue is empty before proceeding.

jalvz · 2020-03-24T14:09:23Z

the load from hey-apm benchmark jobs is queuing up, and spilling over into subsequent jobs

That probably happens, yes. I wouldn't know why happens more after 11 March, thou.

Any case, I would generally take CI hey-apm results with a grain of salt. (and yeah, its hard to reproduce locally).

I can try to dig a bit in that spans commit.

axw · 2020-04-08T06:05:29Z

I never did get to the bottom of the increase, but with a series of refactoring and optimisation PRs, we're now at or above (in some cases significantly above) the previous performance:

(Ignore the dip in ingestion rate towards the end – that's caused by enabling continuous profiling, which caused interference with benchmarking.)

We've still got some room for improvement, but I think we can close this for now and continue improvements as a matter of course. I've opened an issue about better control of the load-testing environment, which should also enable us to turn on continuous profiling: elastic/hey-apm#167

axw self-assigned this Mar 23, 2020

axw added the v7.7.0 label Mar 23, 2020

This was referenced Mar 23, 2020

Capture apm-server profiles elastic/hey-apm#156

Closed

processor/stream optimisations #3523

Merged

axw mentioned this issue Mar 25, 2020

Report all libbeat/monitoring metrics in /debug/vars #3550

Merged

5 tasks

graphaelli added [zube]: Backlog v7.7.0 and removed v7.7.0 labels Apr 1, 2020

axw closed this as completed Apr 8, 2020

zube bot added [zube]: Done and removed [zube]: Backlog labels Apr 8, 2020

graphaelli removed the [zube]: Done label Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heap usage increased #3520

Heap usage increased #3520

axw commented Mar 23, 2020

axw commented Mar 23, 2020

kuisathaverat commented Mar 23, 2020

axw commented Mar 24, 2020

jalvz commented Mar 24, 2020

axw commented Apr 8, 2020

Heap usage increased #3520

Heap usage increased #3520

Comments

axw commented Mar 23, 2020

axw commented Mar 23, 2020

kuisathaverat commented Mar 23, 2020

axw commented Mar 24, 2020

jalvz commented Mar 24, 2020

axw commented Apr 8, 2020