Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heap usage increased #3520

Closed
axw opened this issue Mar 23, 2020 · 5 comments
Closed

Heap usage increased #3520

axw opened this issue Mar 23, 2020 · 5 comments
Assignees
Labels

Comments

@axw
Copy link
Member

axw commented Mar 23, 2020

There has been a non-negligible increase in heap allocations since March 11:

image

This coincides with #3418.

@axw axw self-assigned this Mar 23, 2020
@axw axw added the v7.7.0 label Mar 23, 2020
@axw
Copy link
Member Author

axw commented Mar 23, 2020

So, it turns out that the issue started happening earlier than the modules commit.

From the last good run on March 10, note the apm-server version reported: "apm-server version 8.0.0 built on 6 March [2800966]"

05:04:15  hey-apm            | 2020/03/10 05:04:14 run.go:36: 10.000224548s elapsed since event generation completed
05:04:15  hey-apm            | transactions sent ............. 0
05:04:15  hey-apm            | transactions dropped .......... 0
05:04:15  hey-apm            | errors sent ................... 5109
05:04:15  hey-apm            | errors dropped ................ 21338
05:04:15  hey-apm            |  - success % .................. 19.32
05:04:15  hey-apm            | total events sent ............. 5109
05:04:15  hey-apm            |  - per second ................. 119.34
05:04:15  hey-apm            |  - accepted ................... 5109
05:04:15  hey-apm            |    - per second ............... 119.34
05:04:15  hey-apm            |    - success % ................ 19.32
05:04:15  hey-apm            | total requests ................ 1
05:04:15  hey-apm            | failed ........................ 0
05:04:15  hey-apm            | apm-server version 8.0.0 built on 6 March [2800966]
05:04:15  hey-apm            | heap ................ 85.3Mb
05:04:15  hey-apm            | total allocated ..... 370.5Mb
05:04:15  hey-apm            | heap allocated ...... 74.8Mb
05:04:15  hey-apm            | mallocs ............. 4501343
05:04:15  hey-apm            | num GC .............. 13
05:04:15  hey-apm            | cooling down 60.0 seconds... 

From the job run on March 11: "apm-server version 8.0.0 built on 10 March [64c4140]"

05:05:02  hey-apm            | 2020/03/11 05:05:01 run.go:36: 10.000199787s elapsed since event generation completed
05:05:02  hey-apm            | transactions sent ............. 0
05:05:02  hey-apm            | transactions dropped .......... 0
05:05:02  hey-apm            | errors sent ................... 5070
05:05:02  hey-apm            | errors dropped ................ 21424
05:05:02  hey-apm            |  - success % .................. 19.14
05:05:02  hey-apm            | total events sent ............. 5070
05:05:02  hey-apm            |  - per second ................. 118.35
05:05:02  hey-apm            |  - accepted ................... 5070
05:05:02  hey-apm            |    - per second ............... 118.35
05:05:02  hey-apm            |    - success % ................ 19.14
05:05:02  hey-apm            | total requests ................ 1
05:05:02  hey-apm            | failed ........................ 0
05:05:02  hey-apm            | apm-server version 8.0.0 built on 10 March [64c4140]
05:05:02  hey-apm            | heap ................ 98.7Mb
05:05:02  hey-apm            | total allocated ..... 447.1Mb
05:05:02  hey-apm            | heap allocated ...... 89.0Mb
05:05:02  hey-apm            | mallocs ............. 5090610
05:05:02  hey-apm            | num GC .............. 13
05:05:02  hey-apm            | cooling down 60.0 seconds... 

Commit 64c4140 immediately precedes the modules commit. It seems that the hey-apm job runs with whatever is the latest snapshot image. I think perhaps we should build and publish our own image nightly specifically for hey-apm? (CC @elastic/observablt-robots)

The only thing in there that looks suspicious is d5a0c46, but that only applies to spans AFAIK. I don't see why error allocations would be impacted by that.

@kuisathaverat
Copy link
Contributor

sure, we want to have a Docker image of every valid commit stored on our Docker registry is part of our incremental deployments effort

@axw
Copy link
Member Author

axw commented Mar 24, 2020

So far I've been unable to reproduce the difference locally. Just to recap, the change is apparently somewhere in:

2800966...64c4140

Looking through the changes, the only thing that could possibly make sense to me is d5a0c46 (adding all metadata fields to spans). As mentioned above, workloads that do not involve spans shouldn't be affected.

My current hypothesis is that the load from hey-apm benchmark jobs is queuing up, and spilling over into subsequent jobs. Currently, hey-apm has a fixed 60s cooldown between jobs, but does not restart apm-server, or explicitly wait for it to quiesce.

I'm going to take a look at modifying hey-apm to wait until the server's queue is empty before proceeding.

@jalvz
Copy link
Contributor

jalvz commented Mar 24, 2020

the load from hey-apm benchmark jobs is queuing up, and spilling over into subsequent jobs

That probably happens, yes. I wouldn't know why happens more after 11 March, thou.

Any case, I would generally take CI hey-apm results with a grain of salt. (and yeah, its hard to reproduce locally).

I can try to dig a bit in that spans commit.

@axw
Copy link
Member Author

axw commented Apr 8, 2020

I never did get to the bottom of the increase, but with a series of refactoring and optimisation PRs, we're now at or above (in some cases significantly above) the previous performance:

image

(Ignore the dip in ingestion rate towards the end – that's caused by enabling continuous profiling, which caused interference with benchmarking.)

We've still got some room for improvement, but I think we can close this for now and continue improvements as a matter of course. I've opened an issue about better control of the load-testing environment, which should also enable us to turn on continuous profiling: elastic/hey-apm#167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants