Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory regression with 3002 #58791

Closed
ixs opened this issue Oct 22, 2020 · 17 comments
Closed

[BUG] Memory regression with 3002 #58791

ixs opened this issue Oct 22, 2020 · 17 comments
Labels
Bug broken, incorrect, or confusing behavior severity-high 2nd top severity, seen by most users, causes major problems

Comments

@ixs
Copy link
Contributor

ixs commented Oct 22, 2020

Description
After upgrading 3001 to 3002 massively increased runtime for highstates has been observed.

A 3001 minion highstate:

[root@bb2 ~]# time salt-call state.highstate
...
Summary for local
--------------
Succeeded: 259 (changed=4)
Failed:      0
--------------
Total states run:     259
Total run time:    90.437 s

real	2m22.719s
user	0m23.485s
sys	0m3.956s
[root@bb2 ~]#

The same host once the minion and the master have been upgraded to 3002:

[root@bb2 ~]# time salt-call state.highstate
...
Summary for local
Summary for local
--------------
Succeeded: 255 (changed=4)
Failed:      1
--------------
Total states run:     256
Total run time:   279.516 s

real	9m33.315s
user	0m22.670s
sys	0m3.968s
[root@bb2 ~]#

The failure was a service error:

----------
          ID: service_iptables
    Function: service.running
        Name: iptables
      Result: False
     Comment: Running scope as unit run-15222.scope.
              Job for iptables.service failed because the control process exited with error code. See "systemctl status iptables.service" and "journalctl -xe" for details.
     Started: 23:01:31.466069
    Duration: 97.186 ms
     Changes:

Possibly unrelated...

Setup

Salt Version:
           Salt: 3001.1

Dependency Versions:
           cffi: 1.13.0
       cherrypy: 5.6.0
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.1
        libgit2: Not Installed
       M2Crypto: 0.35.2
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: 1.3.12
      pycparser: 2.19
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Apr  2 2020, 13:34:55)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.0.0
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.1.4

System Versions:
           dist: centos 7 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1127.19.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7 Core

Master and Minion are using the same stack, when running on 3002 the libraries are unchanged.

Downgrading to 3001 resolves the issue.

@ixs ixs added the Bug broken, incorrect, or confusing behavior label Oct 22, 2020
@ixs
Copy link
Contributor Author

ixs commented Oct 22, 2020

@cachedout as requested on slack...

@garethgreenaway
Copy link
Contributor

@ixs Thanks for the report. Can you provide an example of the state where you're seeing these slow times?

@ixs
Copy link
Contributor Author

ixs commented Oct 23, 2020

@garethgreenaway not really. It's a highstate run with a bunch of basic states plus third party formulas and a lot of pillars.

I'm trying to recreate the problem inside a venv right now so that I can come up with a test for git bisect...

@ixs
Copy link
Contributor Author

ixs commented Oct 25, 2020

Deleted the last two comments because I was following down a wrong path there looking at runtime.
What needs to be looked at instead is memory consumption.

I wrote a quick shell script that bascially does the following:

python setup.py sdist
Build a RPM from it using the regular salt SPEC file.
Install that on my master.
run /usr/bin/time -v on salt-master and salt-minion. Run salt-call state.highstate on the master itself.

Then compare the memory consumption of both. 3001 was always under 110MB and 3002 was always above.

Git bisect says, 5c15def is the first bad commit.

comparing the previous commit and that one shows a striking difference in memory consumption:

tail -n 24 {5c15def711d1460703b951655dca9c3f0e0e4c10,8b7e9b34a5f5f5f3961a3263a3d363b5c0a0b4f2}/master.log
==> 5c15def711d1460703b951655dca9c3f0e0e4c10/master.log <==
The salt master is shutdown. Master received a SIGTERM. Exited.
        Command being timed: "salt-master -l quiet"
        User time (seconds): 218.03
        System time (seconds): 77.62
        Percent of CPU this job got: 78%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:15.65
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 208924
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 25
        Minor (reclaiming a frame) page faults: 3584324
        Voluntary context switches: 49267
        Involuntary context switches: 24971
        Swaps: 0
        File system inputs: 19568
        File system outputs: 51680
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
==> 8b7e9b34a5f5f5f3961a3263a3d363b5c0a0b4f2/master.log <==
The salt master is shutdown. Master received a SIGTERM. Exited.
        Command being timed: "salt-master -l quiet"
        User time (seconds): 127.37
        System time (seconds): 61.89
        Percent of CPU this job got: 74%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:14.55
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 89396
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2431712
        Voluntary context switches: 33952
        Involuntary context switches: 15422
        Swaps: 0
        File system inputs: 384
        File system outputs: 29256
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ixs
Copy link
Contributor Author

ixs commented Oct 25, 2020

It might be worthwhile to check that #58785 and this issue are not about the same underlying issue. Increased memory consumption (90MB vs. 200MB) on a 10min runtime is pretty extreme and could be seen as a memory leak...

@ixs
Copy link
Contributor Author

ixs commented Oct 25, 2020

Running a second git bisect round and keeping the log files from my salt run, shows a pretty clear increase in memory consumption with commit 5c15def.

[root@herbert salt]# for id in $(git rev-list --all --topo-order | grep -o -E "$(cd ../archive/; ls | tr '\n' '|')"); do echo "$id: "$(grep 'Maximum resident set size' ../archive/$id/master.log); done
dec7a4a94381dd306a47135cac4047cfc84a7dfb:  Maximum resident set size (kbytes): 146868
a8a7eec0113b16cd7c92328df935fdb3839aa80d:  Maximum resident set size (kbytes): 139476
23906b6941ec5c28560a85805c468b0e6d3e8122:  Maximum resident set size (kbytes): 119296
26b5714f57eaae5d4895d3d03c233c350f241c90:  Maximum resident set size (kbytes): 131096
5c15def711d1460703b951655dca9c3f0e0e4c10:  Maximum resident set size (kbytes): 162236
8b7e9b34a5f5f5f3961a3263a3d363b5c0a0b4f2:  Maximum resident set size (kbytes): 90160
983d20e3bd4cee61ca12a699e43afc2bc0add451:  Maximum resident set size (kbytes): 96568
417b3ac43cd09920d80592541e9ca3f4b259568b:  Maximum resident set size (kbytes): 89936
348e930dff02bfa43789873b8aaad5966e0dde2d:  Maximum resident set size (kbytes): 90676
885e8763ec6183435962525f0ccd308bef620f83:  Maximum resident set size (kbytes): 93400
a836b6d8d09d6633e055a3fa15c77bccb12ca432:  Maximum resident set size (kbytes): 88560
1decb95c69af87651800d1ddae9f689df0425cf3:  Maximum resident set size (kbytes): 86380
[root@herbert salt]#

@ixs ixs changed the title [BUG] Massive performance regression with 3002 [BUG] Memory regression with 3002 Oct 25, 2020
@ixs
Copy link
Contributor Author

ixs commented Oct 25, 2020

For reference: Test scripts, git bisect built RPMS and the logs containing runtime info from GNU time are at https://github.com/ixs/salt-58791

@ixs
Copy link
Contributor Author

ixs commented Oct 25, 2020

Reverting the following changes brings memory consumption back down to 90MB.
GNU time says Maximum resident set size (kbytes): 91688 which is fine.

  • 5c15def (Test loader namespaceing)
  • 26b5714 (Honor loaded_base_name)
  • a6921d6 (Clean up sys.modules when the loader is GC'ed)

I'd recommend rolling a 3002.1 release ASAP.

RPMS and a python sdist tarball and logs is at https://github.com/ixs/salt-58791/tree/master/archive/fba062ce8e96d9fe657fdc7cc8ce808cc7a4aacd

@ixs
Copy link
Contributor Author

ixs commented Oct 25, 2020

@OrangeDog @whytewolf @tuxthepenguin84 @Arderos @jbouter Have a look at https://github.com/ixs/salt-58791/tree/master/archive/fba062ce8e96d9fe657fdc7cc8ce808cc7a4aacd and feel free to try out the centos7 RPMS there or the salt tarball.
That has regular memory consumption for me, I bet it also fixes your issues.

@garethgreenaway
Copy link
Contributor

@ixs Thanks for the report. It's likely that the cause is the same as in #58785 and the fix in #58811 will solve this issue. We're planning for the 3002.1 release early next week.

@garethgreenaway garethgreenaway added v3002.2 vulnerable version and removed needs-triage labels Oct 27, 2020
@garethgreenaway garethgreenaway added this to the 3002.1 milestone Oct 27, 2020
@garethgreenaway garethgreenaway added the severity-high 2nd top severity, seen by most users, causes major problems label Oct 27, 2020
@ixs
Copy link
Contributor Author

ixs commented Oct 27, 2020

@garethgreenaway I agree this it is likely that this is the same issue as #58785.
But I've got bad news: I've been running a local build with the fix in #58811 on my machine and it is not fixing the problem as far as I can see:

[root@herbert salt]# dmesg -T | grep "Out of memory"
[Mon Oct 26 16:43:35 2020] Out of memory: Kill process 527 (/usr/bin/python) score 143 or sacrifice child
[Mon Oct 26 17:19:27 2020] Out of memory: Kill process 548 (/usr/bin/python) score 167 or sacrifice child
[Mon Oct 26 17:19:27 2020] Out of memory: Kill process 537 (/usr/bin/python) score 166 or sacrifice child
[Mon Oct 26 18:26:12 2020] Out of memory: Kill process 544 (/usr/bin/python) score 218 or sacrifice child
[Mon Oct 26 19:18:49 2020] Out of memory: Kill process 534 (/usr/bin/python) score 211 or sacrifice child
[Mon Oct 26 20:01:21 2020] Out of memory: Kill process 21877 (/usr/bin/python) score 189 or sacrifice child
[Mon Oct 26 20:39:52 2020] Out of memory: Kill process 19237 (/usr/bin/python) score 222 or sacrifice child
[Mon Oct 26 21:37:05 2020] Out of memory: Kill process 21879 (/usr/bin/python) score 188 or sacrifice child
[Mon Oct 26 22:13:58 2020] Out of memory: Kill process 1618 (/usr/bin/python) score 224 or sacrifice child

As you can see, I was running the code with the fix from #58811 from 16:30 to about 22:30... 🤨

The problem is not just the maintenance process. In my case the processes being killed are all MWorkers... The maintenance process is fine.

@sagetherage sagetherage linked a pull request Oct 27, 2020 that will close this issue
2 tasks
@Timbus
Copy link
Contributor

Timbus commented Oct 28, 2020

Isn't this just a string matching issue? Since you've changed the loaded_base_name but you were/are testing if modules startswith loaded_base_name in a few places, the match is broken? That's just how it looks from the bisect at least.

@dwoz
Copy link
Contributor

dwoz commented Nov 3, 2020

@ixs Please see #58862

@ixs
Copy link
Contributor Author

ixs commented Nov 3, 2020

@dwoz thanks for the headsup. Let me pull that change into my local tree and test...

@ixs
Copy link
Contributor Author

ixs commented Nov 3, 2020

@dwoz my testsuite is super slow and times out with your code...
I am pretty sure this is unrelated to your change but I have not found out yet, why this is happening.

@sagetherage sagetherage removed this from the 3002.2 milestone Nov 16, 2020
@sagetherage sagetherage added this to the Aluminium milestone Nov 16, 2020
@sagetherage sagetherage added Aluminium Release Post Mg and Pre Si and removed v3002.2 vulnerable version labels Nov 16, 2020
@sagetherage sagetherage removed Aluminium Release Post Mg and Pre Si phase-execute labels Mar 22, 2021
@sagetherage sagetherage modified the milestones: Aluminium, Approved Mar 22, 2021
@dwoz
Copy link
Contributor

dwoz commented Jun 4, 2021

@ixs Can you confirm this has been resolved with 3002.2?

@sagetherage sagetherage added the Phosphorus v3005.0 Release code name and version label Jun 18, 2021
@sagetherage sagetherage modified the milestones: Approved, Phosphorus Jun 18, 2021
@Ch3LL Ch3LL removed the Phosphorus v3005.0 Release code name and version label Mar 30, 2022
@garethgreenaway
Copy link
Contributor

Closing this out with the assumption that the fix went into 3002.2. If the issue persists please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

No branches or pull requests

7 participants