Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CT: Error: Failed to launch the browser process! #150

Closed
zepumph opened this issue Jun 24, 2022 · 28 comments
Closed

CT: Error: Failed to launch the browser process! #150

zepumph opened this issue Jun 24, 2022 · 28 comments
Assignees

Comments

@zepumph
Copy link
Member

zepumph commented Jun 24, 2022

I think this has something to do with bayes and the number of puppeteer instances on it, but I'm not really sure. I'll have to take a look.

This looks to be exactly the same as puppeteer/puppeteer#6757 (except docker). I'll see if there is something I need to do better about closing it down.


arithmetic : build
Build failed with status code 1:
Running "report-media" task

Running "clean" task

Running "build" task
>> TypeScript compilation complete: 24665ms
Building runnable repository (arithmetic, brands: phet, phet-io)
Building brand: phet
>> Webpack build complete: 6747ms
>> Production minification complete: 35456ms (2241851 bytes)
>> Debug minification complete: 0ms (20575992 bytes)
Building brand: phet-io
>> Webpack build complete: 5473ms
>> Production minification complete: 30259ms (2275428 bytes)
>> Debug minification complete: 29481ms (2586396 bytes)
No migration rules found at ../chipper/dist/js/phet-io-sim-specific/repos/arithmetic/js/arithmetic-migration-rules.js, no rules to be bundled with phet-io.js.
No client guides found at ../phet-io-sim-specific/repos/arithmetic/client-guide/, no guides being built.
>> TypeScript compilation complete: 6355ms
Fatal error: Perennial task failed:
Error: Failed to launch the browser process!
[0624/084852.594901:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process
#0 0x558cadb82469 base::debug::CollectStackTrace()
#1 0x558cadae8f33 base::debug::StackTrace::StackTrace()
#2 0x558cadafbd50 logging::LogMessage::~LogMessage()
#3 0x558cadafc90e logging::LogMessage::~LogMessage()
#4 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote()
#5 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper()
#6 0x558cab21b6cb content::ZygoteCommunication::Init()
#7 0x558cab21bcd4 content::CreateGenericZygote()
#8 0x558cad695e95 content::ContentMainRunnerImpl::Initialize()
#9 0x558cad693eb9 content::RunContentProcess()
#10 0x558cad69400e content::ContentMain()
#11 0x558cad6eefca headless::(anonymous namespace)::RunContentMain()
#12 0x558cad6eecd5 headless::HeadlessShellMain()
#13 0x558caa2e0668 ChromeMain
#14 0x7f55b3e79555 __libc_start_main
#15 0x558caa2e04aa _start

Received signal 6
#0 0x558cadb82469 base::debug::CollectStackTrace()
#1 0x558cadae8f33 base::debug::StackTrace::StackTrace()
#2 0x558cadb81f71 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7f55b85f6630 (/usr/lib64/libpthread-2.17.so+0xf62f)
#4 0x7f55b3e8d387 __GI_raise
#5 0x7f55b3e8ea78 __GI_abort
#6 0x558cadb811f5 base::debug::BreakDebuggerAsyncSafe()
#7 0x558cadafc221 logging::LogMessage::~LogMessage()
#8 0x558cadafc90e logging::LogMessage::~LogMessage()
#9 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote()
#10 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper()
#11 0x558cab21b6cb content::ZygoteCommunication::Init()
#12 0x558cab21bcd4 content::CreateGenericZygote()
#13 0x558cad695e95 content::ContentMainRunnerImpl::Initialize()
#14 0x558cad693eb9 content::RunContentProcess()
#15 0x558cad69400e content::ContentMain()
#16 0x558cad6eefca headless::(anonymous namespace)::RunContentMain()
#17 0x558cad6eecd5 headless::HeadlessShellMain()
#18 0x558caa2e0668 ChromeMain
#19 0x7f55b3e79555 __libc_start_main
#20 0x558caa2e04aa _start
r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206
r12: 000039d800300140 r13: 000039d800300158 r14: 000039d800300150 r15: 000039d800221400
di: 0000000000004fa6 si: 0000000000004fa6 bp: 00007ffc4a797e90 bx: 00007ffc4a7986d0
dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc4a797d58
ip: 00007f55b3e8d387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000
trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

at onClose (/data/share/phet/continuous-testing/ct-snapshots/1656077734615/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:241:20)
at ChildProcess.<anonymous> (/data/share/phet/continuous-testing/ct-snapshots/1656077734615/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:232:79)
at ChildProcess.emit (events.js:327:22)
at Process.ChildProcess._handle.onexit (internal/child_process.js:277:12)
Full Error details:
Error: Failed to launch the browser process!
[0624/084852.594901:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process
#0 0x558cadb82469 base::debug::CollectStackTrace()
#1 0x558cadae8f33 base::debug::StackTrace::StackTrace()
#2 0x558cadafbd50 logging::LogMessage::~LogMessage()
#3 0x558cadafc90e logging::LogMessage::~LogMessage()
#4 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote()
#5 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper()
#6 0x558cab21b6cb content::ZygoteCommunication::Init()
#7 0x558cab21bcd4 content::CreateGenericZygote()
#8 0x558cad695e95 content::ContentMainRunnerImpl::Initialize()
#9 0x558cad693eb9 content::RunContentProcess()
#10 0x558cad69400e content::ContentMain()
#11 0x558cad6eefca headless::(anonymous namespace)::RunContentMain()
#12 0x558cad6eecd5 headless::HeadlessShellMain()
#13 0x558caa2e0668 ChromeMain
#14 0x7f55b3e79555 __libc_start_main
#15 0x558caa2e04aa _start

Received signal 6
#0 0x558cadb82469 base::debug::CollectStackTrace()
#1 0x558cadae8f33 base::debug::StackTrace::StackTrace()
#2 0x558cadb81f71 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7f55b85f6630 (/usr/lib64/libpthread-2.17.so+0xf62f)
#4 0x7f55b3e8d387 __GI_raise
#5 0x7f55b3e8ea78 __GI_abort
#6 0x558cadb811f5 base::debug::BreakDebuggerAsyncSafe()
#7 0x558cadafc221 logging::LogMessage::~LogMessage()
#8 0x558cadafc90e logging::LogMessage::~LogMessage()
#9 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote()
#10 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper()
#11 0x558cab21b6cb content::ZygoteCommunication::Init()
#12 0x558cab21bcd4 content::CreateGenericZygote()
#13 0x558cad695e95 content::ContentMainRunnerImpl::Initialize()
#14 0x558cad693eb9 content::RunContentProcess()
#15 0x558cad69400e content::ContentMain()
#16 0x558cad6eefca headless::(anonymous namespace)::RunContentMain()
#17 0x558cad6eecd5 headless::HeadlessShellMain()
#18 0x558caa2e0668 ChromeMain
#19 0x7f55b3e79555 __libc_start_main
#20 0x558caa2e04aa _start
r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206
r12: 000039d800300140 r13: 000039d800300158 r14: 000039d800300150 r15: 000039d800221400
di: 0000000000004fa6 si: 0000000000004fa6 bp: 00007ffc4a797e90 bx: 00007ffc4a7986d0
dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc4a797d58
ip: 00007f55b3e8d387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000
trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
�
Snapshot from 6/24/2022, 7:35:34 AM
@zepumph
Copy link
Member Author

zepumph commented Jun 24, 2022

Likely puppeteer/puppeteer#1981 is a byproduct of this error.

@zepumph
Copy link
Member Author

zepumph commented Jun 24, 2022

I was also able to reproduce this just by going onto bayes and building arithmetic --brands=phet-io in the continuous-client repos. It worked when I stopped all pm2 processes. Something with continuous-client is not working well over a long period of time.

@zepumph
Copy link
Member Author

zepumph commented Jun 27, 2022

I think @mattpen and I got to the bottom of this. The new puppeteer client seems to be orphaning chrome instances when it stops or crashes, and so this morning, when all pm2 processes were stopped, we found 575 chrome instances still running. I will work on better cleanup/error handling for the puppeteer client.

Also noting that once orphaned chrome instances were killed off, I was able to successfully build arithmetic in the same while while all pm2 processes were running.

@KatieWoe
Copy link

KatieWoe commented Jun 29, 2022

A number of repos are showing similar errors this morning (6/29/22)
Example:

arithmetic : build
Build failed with status code 1:
Running "report-media" task

Running "clean" task

Running "build" task
>> TypeScript compilation complete: 21831ms
Building runnable repository (arithmetic, brands: phet, phet-io)
Building brand: phet
>> Webpack build complete: 6363ms
>> Production minification complete: 33441ms (2241784 bytes)
>> Debug minification complete: 1ms (20575041 bytes)
Building brand: phet-io
>> Webpack build complete: 4835ms
>> Production minification complete: 33991ms (2275351 bytes)
>> Debug minification complete: 39640ms (2586496 bytes)
No migration rules found at ../chipper/dist/js/phet-io-sim-specific/repos/arithmetic/js/arithmetic-migration-rules.js, no rules to be bundled with phet-io.js.
No client guides found at ../phet-io-sim-specific/repos/arithmetic/client-guide/, no guides being built.
>> TypeScript compilation complete: 5083ms
Fatal error: Perennial task failed:
Error: Failed to launch the browser process!
[0629/065748.409687:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process
#0 0x5589efb53469 base::debug::CollectStackTrace()
#1 0x5589efab9f33 base::debug::StackTrace::StackTrace()
#2 0x5589efaccd50 logging::LogMessage::~LogMessage()
#3 0x5589efacd90e logging::LogMessage::~LogMessage()
#4 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote()
#5 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper()
#6 0x5589ed1ec6cb content::ZygoteCommunication::Init()
#7 0x5589ed1eccd4 content::CreateGenericZygote()
#8 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize()
#9 0x5589ef664eb9 content::RunContentProcess()
#10 0x5589ef66500e content::ContentMain()
#11 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain()
#12 0x5589ef6bfcd5 headless::HeadlessShellMain()
#13 0x5589ec2b1668 ChromeMain
#14 0x7fbd63230555 __libc_start_main
#15 0x5589ec2b14aa _start

Received signal 6
#0 0x5589efb53469 base::debug::CollectStackTrace()
#1 0x5589efab9f33 base::debug::StackTrace::StackTrace()
#2 0x5589efb52f71 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7fbd679ad630 (/usr/lib64/libpthread-2.17.so+0xf62f)
#4 0x7fbd63244387 __GI_raise
#5 0x7fbd63245a78 __GI_abort
#6 0x5589efb521f5 base::debug::BreakDebuggerAsyncSafe()
#7 0x5589efacd221 logging::LogMessage::~LogMessage()
#8 0x5589efacd90e logging::LogMessage::~LogMessage()
#9 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote()
#10 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper()
#11 0x5589ed1ec6cb content::ZygoteCommunication::Init()
#12 0x5589ed1eccd4 content::CreateGenericZygote()
#13 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize()
#14 0x5589ef664eb9 content::RunContentProcess()
#15 0x5589ef66500e content::ContentMain()
#16 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain()
#17 0x5589ef6bfcd5 headless::HeadlessShellMain()
#18 0x5589ec2b1668 ChromeMain
#19 0x7fbd63230555 __libc_start_main
#20 0x5589ec2b14aa _start
r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206
r12: 0000187600300140 r13: 0000187600300158 r14: 0000187600300150 r15: 0000187600221400
di: 000000000000975f si: 000000000000975f bp: 00007ffc57156180 bx: 00007ffc571569c0
dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc57156048
ip: 00007fbd63244387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000
trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

at onClose (/data/share/phet/continuous-testing/ct-snapshots/1656506053578/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:241:20)
at ChildProcess.<anonymous> (/data/share/phet/continuous-testing/ct-snapshots/1656506053578/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:232:79)
at ChildProcess.emit (events.js:327:22)
at Process.ChildProcess._handle.onexit (internal/child_process.js:277:12)
Full Error details:
Error: Failed to launch the browser process!
[0629/065748.409687:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process
#0 0x5589efb53469 base::debug::CollectStackTrace()
#1 0x5589efab9f33 base::debug::StackTrace::StackTrace()
#2 0x5589efaccd50 logging::LogMessage::~LogMessage()
#3 0x5589efacd90e logging::LogMessage::~LogMessage()
#4 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote()
#5 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper()
#6 0x5589ed1ec6cb content::ZygoteCommunication::Init()
#7 0x5589ed1eccd4 content::CreateGenericZygote()
#8 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize()
#9 0x5589ef664eb9 content::RunContentProcess()
#10 0x5589ef66500e content::ContentMain()
#11 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain()
#12 0x5589ef6bfcd5 headless::HeadlessShellMain()
#13 0x5589ec2b1668 ChromeMain
#14 0x7fbd63230555 __libc_start_main
#15 0x5589ec2b14aa _start

Received signal 6
#0 0x5589efb53469 base::debug::CollectStackTrace()
#1 0x5589efab9f33 base::debug::StackTrace::StackTrace()
#2 0x5589efb52f71 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7fbd679ad630 (/usr/lib64/libpthread-2.17.so+0xf62f)
#4 0x7fbd63244387 __GI_raise
#5 0x7fbd63245a78 __GI_abort
#6 0x5589efb521f5 base::debug::BreakDebuggerAsyncSafe()
#7 0x5589efacd221 logging::LogMessage::~LogMessage()
#8 0x5589efacd90e logging::LogMessage::~LogMessage()
#9 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote()
#10 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper()
#11 0x5589ed1ec6cb content::ZygoteCommunication::Init()
#12 0x5589ed1eccd4 content::CreateGenericZygote()
#13 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize()
#14 0x5589ef664eb9 content::RunContentProcess()
#15 0x5589ef66500e content::ContentMain()
#16 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain()
#17 0x5589ef6bfcd5 headless::HeadlessShellMain()
#18 0x5589ec2b1668 ChromeMain
#19 0x7fbd63230555 __libc_start_main
#20 0x5589ec2b14aa _start
r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206
r12: 0000187600300140 r13: 0000187600300158 r14: 0000187600300150 r15: 0000187600221400
di: 000000000000975f si: 000000000000975f bp: 00007ffc57156180 bx: 00007ffc571569c0
dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc57156048
ip: 00007fbd63244387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000
trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]

TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
�
Snapshot from 6/29/2022, 6:34:13 AM

@zepumph
Copy link
Member Author

zepumph commented Jun 30, 2022

This morning there were 5200 instances of chrome running. I should look into this! I was hoping to have time this afternoon

zepumph added a commit that referenced this issue Jun 30, 2022
zepumph added a commit that referenced this issue Jun 30, 2022
@zepumph
Copy link
Member Author

zepumph commented Jun 30, 2022

https://stackoverflow.com/questions/62220867/puppeteer-chromium-instances-remain-active-in-the-background-after-browser-disc has something I'd like to try, instead of using zygote (an internal process forker), lets have things forked via node. Worth a shot!

@zepumph
Copy link
Member Author

zepumph commented Jun 30, 2022

This didn't seem to change anything, After 2 hours, there are double the number of chrome instances running. I'll have to keep looking.

@samreid
Copy link
Member

samreid commented Jul 6, 2022

I noticed that we are calling withServer again and again in a while (true) loop in QuickServer. Each time, this allocates a new node server:

    const server = http.createServer( ( req, res ) => {

We call server.close on it, but after reading https://stackoverflow.com/questions/14626636/how-do-i-shutdown-a-node-js-https-server-immediately it's unclear whether that's enough to clear that memory. @zepumph could there be a memory leak in these server instances?

@zepumph
Copy link
Member Author

zepumph commented Jul 6, 2022

I just logged onto bayes and saw that there were 2000 instances of chrome running.

Then I ran pm2 stop continuous-client and checked and there were 500 chrome instances still running.

I then ran pm2 stop continuous-quick-server and checked and there were 0 chrome instances still running. Quite likely however we are running Puppeteer in continuous client is similarly problematic to the quick server. Thanks for the intel @samreid! It may have to do with withServer, but I think the most likely culprit is puppeteerLoad itself.

@samreid
Copy link
Member

samreid commented Jul 7, 2022

Can you reproduce this locally, or only on bayes? Do we think it is related to pm2 or linux? Do you think I could reproduce the problem on mac with or without pm2? I'm investigating playwright in a separate issue but would like a way to test how it impacts this problem (even if I have to test on bayes directly).

@samreid
Copy link
Member

samreid commented Jul 7, 2022

I logged in to bayes as phet-admin and saw pm2 list like so:

[phet-admin@bayes ~]$ pm2 list
┌─────┬────────────────────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id  │ name                       │ namespace   │ version │ mode    │ pid      │ uptime │ ↺    │ status    │ cpu      │ mem      │ user     │ watching │
├─────┼────────────────────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 4   │ continuous-client          │ default     │ N/A     │ fork    │ 20052    │ 11h    │ 1    │ online    │ 100%     │ 387.6mb  │ phe… │ disabled │
│ 3   │ continuous-quick-server    │ default     │ N/A     │ fork    │ 20049    │ 11h    │ 0    │ online    │ 0%       │ 102.7mb  │ phe… │ disabled │
│ 1   │ continuous-server          │ default     │ N/A     │ fork    │ 20043    │ 11h    │ 10   │ online    │ 0%       │ 346.3mb  │ phe… │ disabled │
│ 0   │ phettest-server            │ default     │ 0.0.0   │ fork    │ 20035    │ 11h    │ 2    │ online    │ 0%       │ 43.8mb   │ phe… │ disabled │
│ 2   │ yotta-server               │ default     │ 0.0.0   │ fork    │ 20068    │ 11h    │ 2    │ online    │ 0%       │ 37.1mb   │ phe… │ disabled │
└─────┴────────────────────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘

that pgrep -c chrome reported 1579. Also, ps aux --no-heading showed many as defunct. Also there were many listings from perennial:

...
phet-ad+ 40211  0.5  0.0      0     0 ?        Zs   18:21   0:48 [chrome] <defunct>
phet-ad+ 40408  0.3  0.0      0     0 ?        Zs   15:19   1:00 [chrome] <defunct>
phet-ad+ 40414  0.0  0.0 17469332 57344 ?      Ssl  16:48   0:03 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40421  0.0  0.0 17299480 35760 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40423  0.0  0.0 17299464 35764 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40425  0.0  0.0 17299496 10972 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40442  0.0  0.0 17492024 40044 ?      Sl   16:48   0:04 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40445  0.0  0.0 25919264 35332 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40555  0.0  0.0 17332372 37372 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40727  0.0  0.0 17465132 57832 ?      Ssl  16:48   0:04 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40730  0.0  0.0 17299480 35760 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40732  0.0  0.0 17299464 35764 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40734  0.0  0.0 17299500 10976 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40748  0.0  0.0 17491828 31116 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40755  0.0  0.0 25927528 35576 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40909  0.6  0.0      0     0 ?        Zs   18:21   0:58 [chrome] <defunct>
phet-ad+ 40927  0.3  0.0      0     0 ?        Zs   18:21   0:30 [chrome] <defunct>
phet-ad+ 40942  0.6  0.0      0     0 ?        Zs   18:21   0:53 [chrome] <defunct>

I also saw:

[phet-admin@bayes ~]$ pgrep -c node
44

After stopping continuous-server, I saw:

[phet-admin@bayes ~]$ pm2 stop continuous-server
[PM2] Applying action stopProcessId on app [continuous-server](ids: [ 1 ])
[phet-admin@bayes ~]$ pgrep -c chrome
1582
[phet-admin@bayes ~]$ pgrep -c node
35

These numbers don't make sense, since the number of chromes went from 1579 to 1582.

Then I stopped quick-server:

[phet-admin@bayes ~]$ pm2 stop continuous-quick-server
[PM2] Applying action stopProcessId on app [continuous-quick-server](ids: [ 3 ])
...
[phet-admin@bayes ~]$ pgrep -c chrome
784
[phet-admin@bayes ~]$ pgrep -c node
33

So that reduced chromes from 1582 to 784 and nodes from 35 to 33. Still disconcerting that we still have 784 chromes running.

After restarting those pm2 services, and running pgrep -c chrome intermittently, I saw the number of chrome instances climb from 796 on startup to 811 after a few minutes.

@samreid
Copy link
Member

samreid commented Jul 7, 2022

I also saw puppeteer/puppeteer#1825 which described defunct zombie processes like we see here.

I also saw https://askubuntu.com/questions/201303/what-is-a-defunct-process-and-why-doesnt-it-get-killed which describes defunct processes.

I used that process to find the PPID of the bad parent process, and I ran kill -9 20052 and saw that pgrep -c chrome dropped to 94. One other process accounted for the other defunct chromes. I killed that parent process too.

I checked that pm2 is still operational, but after 5 minutes, there are already 50 defunct chrome processes.

@samreid
Copy link
Member

samreid commented Jul 7, 2022

During my review of quick server, I noticed two potential problems:

  • The browser was being created over and over again twice per iteration in the while loop. It seems like we could benefit by creating the browser once and asking it to generate new pages.
  • It looks like there is a bug in puppeteerLaunch because it is killing the browser if the browser was passed in, not if puppeteerLaunch owns the browser. I would like to try correcting these and pushing to test on bayes.

By the way, bayes has climbed to 204 chrome processes (as reported by pgrep -c chrome) while I worked on this. UPDATE: 220

@samreid
Copy link
Member

samreid commented Jul 7, 2022

I stopped continuous-server and continuous-quick-server, pulled aqua and perennial in each, and restarted. The number of pgrep -c chrome chrome instances held steady at 208 for a while, but just jumped to 241. But only a few (if any) look like defunct zombies.

UPDATE: Holding steady at 241 for quite a while.

@samreid
Copy link
Member

samreid commented Jul 7, 2022

The perennial commit seems like a bugfix, but warrants a code review since it is used in many places, and one or more usage sites could have been relying on the bug somehow? I marked blocks-publication and ready for review for that part.

Also, we are still around 300 chrome instances, and not many look like zombies. So there is probably still a problem of some sort.

@samreid
Copy link
Member

samreid commented Jul 7, 2022

This morning (10 hours later than preceding comment), I observed:

[phet-admin@bayes ~]$ pgrep -c chrome
927

With around 600 marked as defunct according to ps -ef | grep defunct.

@zepumph
Copy link
Member Author

zepumph commented Jul 7, 2022

Great fix! I was using this command to see how many chrome instances there were ps aux | grep chrome | wc -l.

We have to apply this to the continuous-client process too. I just did this. I bet that will solve the rest of the defunct guys.

@zepumph
Copy link
Member Author

zepumph commented Jul 7, 2022

After pulling/restarting continuous-client, I see no defunct chrome instances. There were 307 upon restart, and are now 313. I believe this is solved, but we can check in tomorrow. Thanks so much @samreid for finding that bug!

@zepumph
Copy link
Member Author

zepumph commented Jul 13, 2022

We are still hovering around 330 chrome instances. Closing

@zepumph zepumph closed this as completed Jul 13, 2022
@zepumph zepumph reopened this Nov 1, 2022
zepumph added a commit to phetsims/perennial that referenced this issue Nov 1, 2022
zepumph added a commit that referenced this issue Nov 1, 2022
@zepumph
Copy link
Member Author

zepumph commented Nov 1, 2022

In general things were working well, but when there were errors outside the expected places for the webpage, puppeteerLoad still wasn't cleaning up. Thus, above I added even more cleanup.

When looking at the error logs, I see the page.goto has caused a fair number of errors over the past months. I wonder if that contributed to the errors. I'll check into to see if that helps things soon.

zepumph added a commit to phetsims/perennial that referenced this issue Nov 1, 2022
zepumph added a commit to phetsims/perennial that referenced this issue Nov 1, 2022
@zepumph
Copy link
Member Author

zepumph commented Nov 2, 2022

This happened again tonight, and my best guess is that it was because 1de7fdf was launching too many chrome instances. I restarted things and pushed a fix over in #152 and ps aux | grep chrome | wc -l is showing steadily with 88 chrome instances.

@zepumph
Copy link
Member Author

zepumph commented Nov 7, 2022

After the weekend we are at 112 chrome instances. Perhaps there is still a slow problem. After stopping all CT processes, there are still 25 zombies. I will keep tabs on these.

@zepumph
Copy link
Member Author

zepumph commented Nov 28, 2022

Now there are 184 chrome instances. That is still growing. I need to continue to work here.

zepumph added a commit that referenced this issue Nov 29, 2022
@zepumph
Copy link
Member Author

zepumph commented Nov 29, 2022

@mattpen and I spoke about this yesterday. We feel like it is important to improve the logging in our puppeteer processes, and then from there figure out where the error case is coming from. I'll take the lead!

@zepumph
Copy link
Member Author

zepumph commented Dec 7, 2022

I think that phetsims/perennial#291 could be helpful here, or at least something to note.

@zepumph zepumph removed their assignment Mar 3, 2023
@zepumph zepumph self-assigned this Apr 27, 2023
@zepumph
Copy link
Member Author

zepumph commented Apr 27, 2023

We are running into this over in https://github.com/phetsims/special-ops/issues/234

@zepumph
Copy link
Member Author

zepumph commented Apr 28, 2023

Most likely our sparky case will be solved by #172.

@zepumph
Copy link
Member Author

zepumph commented May 1, 2023

Ok. After some discussion with @mattpen, The above commit should handle the orphaned chrome instances that are much more rare from the build process. I'm ready to close this issue, but I'll reopen if we need to track down any more orphaned chrome instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants