Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout error getting DevTools URL during browser launch #559

Closed
imiric opened this issue Sep 30, 2022 · 7 comments · Fixed by #555
Closed

Timeout error getting DevTools URL during browser launch #559

imiric opened this issue Sep 30, 2022 · 7 comments · Fixed by #555
Labels
bug Something isn't working

Comments

@imiric
Copy link
Contributor

imiric commented Sep 30, 2022

This happens very rarely on current main (db80f94), even on Cloud test runs.

In some cases while running the test script from #510 we see the following event logged:

launching browser: getting DevTools URL: timed out after 30s
	at reflect.methodValueCall (native)
	at file:///tmp/PvwCGT/script.js:28:34(6)
	at native
 executor=constant-vus scenario=default		test_run_id: 141246

The effect of this is that the iteration fails, and it gets manifested as a pause in execution for the VU while waiting for the default 30s timeout (see related issue #502). There's no reason this particular timeout should be that long, so we can shorten it, but also fix the root cause, which seems to be because of race condition between when the process starts, and us attaching the stdout listener to get the DevTools URL. Also, try to look into a more robust way of getting the URL that doesn't involve parsing stdout.

@imiric imiric added the bug Something isn't working label Sep 30, 2022
@inancgumus
Copy link
Member

inancgumus commented Sep 30, 2022

Related: #491?

@imiric
Copy link
Contributor Author

imiric commented Oct 3, 2022

On second thought, I think this issue is a duplicate of #491. That context deadline exceeded is likely the same error as this timeout one, just before the logging improvements we did after v0.4.0.

So I'll close this issue and we can track it #491, since it also happens on Linux, though apparently much rarely than on WSL2. If it's easily reproducible on WSL2 then it would help us resolve it more quickly.

@imiric imiric closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2022
@imiric
Copy link
Contributor Author

imiric commented Oct 4, 2022

I'm reopening this issue, as the root cause is different from #491. While the errors are the same, #491 is caused by incorrectly handling when the browser exits with a non-0 exit code, in general, or maybe something specific to Snap. In either case, it's a different root cause than this issue, even though the errors are the same.

While #491 is always reproducible, this issue happens very rarely on some iterations, and we've only seen it in the Cloud. Since we ruled out the chances of a race condition in #563, the only explanation is that some environments do see a >30s delay to actually start the browser process. So it's likely that we can't do anything about it in the code, and may need to address this on the infra side. Let's leave this issue open until we decide.

@inancgumus
Copy link
Member

inancgumus commented Oct 5, 2022

So it's likely that we can't do anything about it in the code, and may need to address this on the infra side. Let's leave this issue open until we decide.

💡❓

In this case, one (previously discussed) idea can be utilizing Device Farms on AWS. So that we can have a farm of browsers that are pre-launched, connect to them for each test start, and mitigate the long time it takes to launch a browser. To do that, we'll also need to work on #17.

If using Device Farms turns out to be unfruitful, we might also want to evaluate/discuss pre-launched browsers without using Device Farms. For example, we can pre-launch the browsers and then connect to them for each test start. Instead of launching a new one for every test, we can connect to one of the available instances.

@imiric
Copy link
Contributor Author

imiric commented Oct 5, 2022

@inancgumus That architecture might address this issue, but it's still a long ways off, and will not be part of the initial public beta release.

Your second point is about instance reuse, which is a sensitive topic, as we must not allow data from previous test runs to be accessible to subsequent ones that use the same instance. Since we've had issues properly cleaning the data directory (#403, #484), instance reuse is currently disabled in the Cloud beta (i.e. new instances are created for each test run, and are terminated when the run ends). If we're going to reuse browsers between test runs, then it's even more critical that we handle this properly.

What we might want to explore is launching the browser before we start k6, and then connecting to it when the test starts.

There's a problem with this, though: BrowserType.connect() is a JS API, and depends on scripts actually using it. What do we do if the script calls launch() on every iteration, as they do now? Try to force the connection to an existing browser process anyway? Similarly with Browser.close(), do we disregard it and only disconnect? Or do we simply block users from using launch and close() in the Cloud? 😕

So there are many open questions and issues we need to resolve before we can consider #17 as a solution to this.

What I was referring to with addressing it on the infra side, is just using a different EC2 instance type. Instead of a general purpose one like m5, pick an IO-optimized one that ensures (or minimizes the chances of) not having to wait 30s for the browser process to launch. From my tests, CPU and RAM usage was fine, so it's likely this is storage IO related. I still have to run tests to confirm this, but in any case, it's worth talking to DevOps, as there might be some Linux optimizations we can consider as well.

@inancgumus
Copy link
Member

inancgumus commented Oct 5, 2022

Thanks, @imiric; these all are good points. It seems like there are a lot of things we need to consider, research, and evaluate.

@imiric
Copy link
Contributor Author

imiric commented Oct 11, 2022

After #555, we no longer see this issue in the Cloud 🎉 So I'm closing this.

That said, it's very unintuitive that #555 would be related to this issue, as during browser launch the event system hasn't been initialized yet, and no event handlers take part in this process... 😕 But maybe it impacted the launch indirectly? 🤷‍♂️ In any case, we can reopen this if it pops up again.

@imiric imiric closed this as completed Oct 11, 2022
@imiric imiric added this to the v0.6.0 milestone Oct 11, 2022
@imiric imiric linked a pull request Oct 11, 2022 that will close this issue
@inancgumus inancgumus closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2022
@inancgumus inancgumus removed this from the v0.6.0 milestone Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants