[BUG] Memory regression with 3002 #58791

ixs · 2020-10-22T22:50:23Z

Description
After upgrading 3001 to 3002 massively increased runtime for highstates has been observed.

A 3001 minion highstate:

[root@bb2 ~]# time salt-call state.highstate
...
Summary for local
--------------
Succeeded: 259 (changed=4)
Failed:      0
--------------
Total states run:     259
Total run time:    90.437 s

real	2m22.719s
user	0m23.485s
sys	0m3.956s
[root@bb2 ~]#

The same host once the minion and the master have been upgraded to 3002:

[root@bb2 ~]# time salt-call state.highstate
...
Summary for local
Summary for local
--------------
Succeeded: 255 (changed=4)
Failed:      1
--------------
Total states run:     256
Total run time:   279.516 s

real	9m33.315s
user	0m22.670s
sys	0m3.968s
[root@bb2 ~]#

The failure was a service error:

----------
          ID: service_iptables
    Function: service.running
        Name: iptables
      Result: False
     Comment: Running scope as unit run-15222.scope.
              Job for iptables.service failed because the control process exited with error code. See "systemctl status iptables.service" and "journalctl -xe" for details.
     Started: 23:01:31.466069
    Duration: 97.186 ms
     Changes:

Possibly unrelated...

Setup

Salt Version:
           Salt: 3001.1

Dependency Versions:
           cffi: 1.13.0
       cherrypy: 5.6.0
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: 1.3.12
      pycparser: 2.19
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4

System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1127.19.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7 Core

Master and Minion are using the same stack, when running on 3002 the libraries are unchanged.

Downgrading to 3001 resolves the issue.

The text was updated successfully, but these errors were encountered:

ixs · 2020-10-22T22:50:43Z

@cachedout as requested on slack...

garethgreenaway · 2020-10-22T23:07:37Z

@ixs Thanks for the report. Can you provide an example of the state where you're seeing these slow times?

ixs · 2020-10-23T00:07:55Z

@garethgreenaway not really. It's a highstate run with a bunch of basic states plus third party formulas and a lot of pillars.

I'm trying to recreate the problem inside a venv right now so that I can come up with a test for git bisect...

ixs · 2020-10-25T13:37:10Z

Deleted the last two comments because I was following down a wrong path there looking at runtime.
What needs to be looked at instead is memory consumption.

I wrote a quick shell script that bascially does the following:

python setup.py sdist
Build a RPM from it using the regular salt SPEC file.
Install that on my master.
run /usr/bin/time -v on salt-master and salt-minion. Run salt-call state.highstate on the master itself.

Then compare the memory consumption of both. 3001 was always under 110MB and 3002 was always above.

Git bisect says, 5c15def is the first bad commit.

comparing the previous commit and that one shows a striking difference in memory consumption:

tail -n 24 {5c15def711d1460703b951655dca9c3f0e0e4c10,8b7e9b34a5f5f5f3961a3263a3d363b5c0a0b4f2}/master.log
==> 5c15def711d1460703b951655dca9c3f0e0e4c10/master.log <==
The salt master is shutdown. Master received a SIGTERM. Exited.
        Command being timed: "salt-master -l quiet"
        User time (seconds): 218.03
        System time (seconds): 77.62
        Percent of CPU this job got: 78%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:15.65
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 208924
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 25
        Minor (reclaiming a frame) page faults: 3584324
        Voluntary context switches: 49267
        Involuntary context switches: 24971
        Swaps: 0
        File system inputs: 19568
        File system outputs: 51680
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
==> 8b7e9b34a5f5f5f3961a3263a3d363b5c0a0b4f2/master.log <==
The salt master is shutdown. Master received a SIGTERM. Exited.
        Command being timed: "salt-master -l quiet"
        User time (seconds): 127.37
        System time (seconds): 61.89
        Percent of CPU this job got: 74%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:14.55
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 89396
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2431712
        Voluntary context switches: 33952
        Involuntary context switches: 15422
        Swaps: 0
        File system inputs: 384
        File system outputs: 29256
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ixs · 2020-10-25T13:39:05Z

It might be worthwhile to check that #58785 and this issue are not about the same underlying issue. Increased memory consumption (90MB vs. 200MB) on a 10min runtime is pretty extreme and could be seen as a memory leak...

ixs · 2020-10-25T16:49:28Z

Running a second git bisect round and keeping the log files from my salt run, shows a pretty clear increase in memory consumption with commit 5c15def.

[root@herbert salt]# for id in $(git rev-list --all --topo-order | grep -o -E "$(cd ../archive/; ls | tr '\n' '|')"); do echo "$id: "$(grep 'Maximum resident set size' ../archive/$id/master.log); done
dec7a4a94381dd306a47135cac4047cfc84a7dfb:  Maximum resident set size (kbytes): 146868
a8a7eec0113b16cd7c92328df935fdb3839aa80d:  Maximum resident set size (kbytes): 139476
23906b6941ec5c28560a85805c468b0e6d3e8122:  Maximum resident set size (kbytes): 119296
26b5714f57eaae5d4895d3d03c233c350f241c90:  Maximum resident set size (kbytes): 131096
5c15def711d1460703b951655dca9c3f0e0e4c10:  Maximum resident set size (kbytes): 162236
8b7e9b34a5f5f5f3961a3263a3d363b5c0a0b4f2:  Maximum resident set size (kbytes): 90160
983d20e3bd4cee61ca12a699e43afc2bc0add451:  Maximum resident set size (kbytes): 96568
417b3ac43cd09920d80592541e9ca3f4b259568b:  Maximum resident set size (kbytes): 89936
348e930dff02bfa43789873b8aaad5966e0dde2d:  Maximum resident set size (kbytes): 90676
885e8763ec6183435962525f0ccd308bef620f83:  Maximum resident set size (kbytes): 93400
a836b6d8d09d6633e055a3fa15c77bccb12ca432:  Maximum resident set size (kbytes): 88560
1decb95c69af87651800d1ddae9f689df0425cf3:  Maximum resident set size (kbytes): 86380
[root@herbert salt]#

ixs · 2020-10-25T17:13:55Z

For reference: Test scripts, git bisect built RPMS and the logs containing runtime info from GNU time are at https://github.com/ixs/salt-58791

ixs · 2020-10-25T18:23:39Z

Reverting the following changes brings memory consumption back down to 90MB.
GNU time says Maximum resident set size (kbytes): 91688 which is fine.

5c15def (Test loader namespaceing)
26b5714 (Honor loaded_base_name)
a6921d6 (Clean up sys.modules when the loader is GC'ed)

I'd recommend rolling a 3002.1 release ASAP.

RPMS and a python sdist tarball and logs is at https://github.com/ixs/salt-58791/tree/master/archive/fba062ce8e96d9fe657fdc7cc8ce808cc7a4aacd

ixs · 2020-10-25T18:28:44Z

@OrangeDog @whytewolf @tuxthepenguin84 @Arderos @jbouter Have a look at https://github.com/ixs/salt-58791/tree/master/archive/fba062ce8e96d9fe657fdc7cc8ce808cc7a4aacd and feel free to try out the centos7 RPMS there or the salt tarball.
That has regular memory consumption for me, I bet it also fixes your issues.

garethgreenaway · 2020-10-27T15:49:33Z

@ixs Thanks for the report. It's likely that the cause is the same as in #58785 and the fix in #58811 will solve this issue. We're planning for the 3002.1 release early next week.

ixs · 2020-10-27T15:56:10Z

@garethgreenaway I agree this it is likely that this is the same issue as #58785.
But I've got bad news: I've been running a local build with the fix in #58811 on my machine and it is not fixing the problem as far as I can see:

[root@herbert salt]# dmesg -T | grep "Out of memory"
[Mon Oct 26 16:43:35 2020] Out of memory: Kill process 527 (/usr/bin/python) score 143 or sacrifice child
[Mon Oct 26 17:19:27 2020] Out of memory: Kill process 548 (/usr/bin/python) score 167 or sacrifice child
[Mon Oct 26 17:19:27 2020] Out of memory: Kill process 537 (/usr/bin/python) score 166 or sacrifice child
[Mon Oct 26 18:26:12 2020] Out of memory: Kill process 544 (/usr/bin/python) score 218 or sacrifice child
[Mon Oct 26 19:18:49 2020] Out of memory: Kill process 534 (/usr/bin/python) score 211 or sacrifice child
[Mon Oct 26 20:01:21 2020] Out of memory: Kill process 21877 (/usr/bin/python) score 189 or sacrifice child
[Mon Oct 26 20:39:52 2020] Out of memory: Kill process 19237 (/usr/bin/python) score 222 or sacrifice child
[Mon Oct 26 21:37:05 2020] Out of memory: Kill process 21879 (/usr/bin/python) score 188 or sacrifice child
[Mon Oct 26 22:13:58 2020] Out of memory: Kill process 1618 (/usr/bin/python) score 224 or sacrifice child

As you can see, I was running the code with the fix from #58811 from 16:30 to about 22:30... 🤨

The problem is not just the maintenance process. In my case the processes being killed are all MWorkers... The maintenance process is fine.

Timbus · 2020-10-28T03:14:36Z

Isn't this just a string matching issue? Since you've changed the loaded_base_name but you were/are testing if modules startswith loaded_base_name in a few places, the match is broken? That's just how it looks from the bisect at least.

dwoz · 2020-11-03T02:52:39Z

@ixs Please see #58862

ixs · 2020-11-03T11:30:57Z

@dwoz thanks for the headsup. Let me pull that change into my local tree and test...

ixs · 2020-11-03T17:10:14Z

@dwoz my testsuite is super slow and times out with your code...
I am pretty sure this is unrelated to your change but I have not found out yet, why this is happening.

dwoz · 2021-06-04T21:24:27Z

@ixs Can you confirm this has been resolved with 3002.2?

garethgreenaway · 2023-03-06T18:28:35Z

Closing this out with the assumption that the fix went into 3002.2. If the issue persists please let us know.

ixs added the Bug broken, incorrect, or confusing behavior label Oct 22, 2020

sagetherage added the needs-triage label Oct 22, 2020

sagetherage assigned garethgreenaway Oct 22, 2020

OrangeDog mentioned this issue Oct 25, 2020

[3002rc1] [Bug] Time increase running commands in 3002rc1 compared to 3001.1. #58716

Closed

ixs changed the title ~~[BUG] Massive performance regression with 3002~~ [BUG] Memory regression with 3002 Oct 25, 2020

garethgreenaway added v3002.2 vulnerable version and removed needs-triage labels Oct 27, 2020

garethgreenaway added this to the 3002.1 milestone Oct 27, 2020

garethgreenaway added the severity-high 2nd top severity, seen by most users, causes major problems label Oct 27, 2020

garethgreenaway mentioned this issue Oct 27, 2020

Fix memory leaks in master's maintenance process #58811

Closed

2 tasks

sagetherage linked a pull request Oct 27, 2020 that will close this issue

Fix memory leaks in master's maintenance process #58811

Closed

2 tasks

sagetherage removed a link to a pull request Oct 27, 2020

Fix memory leaks in master's maintenance process #58811

Closed

2 tasks

dwoz mentioned this issue Nov 2, 2020

Partial revert of #56513 loader namespacing #58862

Merged

3 tasks

sagetherage removed this from the 3002.2 milestone Nov 16, 2020

sagetherage added this to the Aluminium milestone Nov 16, 2020

sagetherage added Aluminium Release Post Mg and Pre Si and removed v3002.2 vulnerable version labels Nov 16, 2020

jwolfe-ns mentioned this issue Nov 19, 2020

Salt-Master Slow Memory Increase and CPU Pegged #58890

Closed

sagetherage added phase-execute labels Jan 4, 2021

sagetherage assigned dwoz Jan 4, 2021

garethgreenaway unassigned garethgreenaway and dwoz Jan 11, 2021

sagetherage removed Aluminium Release Post Mg and Pre Si phase-execute labels Mar 22, 2021

sagetherage modified the milestones: Aluminium, Approved Mar 22, 2021

sagetherage added the Phosphorus v3005.0 Release code name and version label Jun 18, 2021

sagetherage modified the milestones: Approved, Phosphorus Jun 18, 2021

Ch3LL removed the Phosphorus v3005.0 Release code name and version label Mar 30, 2022

Ch3LL modified the milestones: Phosphorus v3005.0, Sulphur v3006.0 Mar 30, 2022

waynew modified the milestones: Sulphur v3006.0, Chlorine v3007.0 Dec 16, 2022

denschub mentioned this issue Feb 21, 2023

[BUG] Maintenance process leaking memory in 3005.1 #63747

Closed

9 tasks

garethgreenaway closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory regression with 3002 #58791

[BUG] Memory regression with 3002 #58791

ixs commented Oct 22, 2020

ixs commented Oct 22, 2020

garethgreenaway commented Oct 22, 2020

ixs commented Oct 23, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020 •

edited

Loading

ixs commented Oct 25, 2020

garethgreenaway commented Oct 27, 2020

ixs commented Oct 27, 2020 •

edited

Loading

Timbus commented Oct 28, 2020

dwoz commented Nov 3, 2020

ixs commented Nov 3, 2020

ixs commented Nov 3, 2020 •

edited

Loading

dwoz commented Jun 4, 2021

garethgreenaway commented Mar 6, 2023

[BUG] Memory regression with 3002 #58791

[BUG] Memory regression with 3002 #58791

Comments

ixs commented Oct 22, 2020

ixs commented Oct 22, 2020

garethgreenaway commented Oct 22, 2020

ixs commented Oct 23, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020

ixs commented Oct 25, 2020 • edited Loading

ixs commented Oct 25, 2020

garethgreenaway commented Oct 27, 2020

ixs commented Oct 27, 2020 • edited Loading

Timbus commented Oct 28, 2020

dwoz commented Nov 3, 2020

ixs commented Nov 3, 2020

ixs commented Nov 3, 2020 • edited Loading

dwoz commented Jun 4, 2021

garethgreenaway commented Mar 6, 2023

ixs commented Oct 25, 2020 •

edited

Loading

ixs commented Oct 27, 2020 •

edited

Loading

ixs commented Nov 3, 2020 •

edited

Loading