Unify and standardize behavior and exit codes when k6 is stopped #2804

na-- · 2022-12-06T12:16:08Z

k6 has multiple different ways to stop a test:

Single Ctrl+C for semi-graceful shutdown
Second Ctrl+C for immediate abort
The test.abort() JS API from k6/execution
JS script exception
Via the REST API, e.g. with curl -i -X PATCH -d '{"data":{"type":"status","id":"default","attributes":{"stopped":true}}}' 'http://localhost:6565/v1/status'
A threshold with abortOnFail: true

And, for reference, we currently have these exit codes predefined:

Lines 10 to 21 in 1b9c5fa

    
           CloudTestRunFailed       ExitCode = 97 // This used to be 99 before k6 v0.33.0 
        
           CloudFailedToGetProgress ExitCode = 98 
        
           ThresholdsHaveFailed     ExitCode = 99 
        
           SetupTimeout             ExitCode = 100 
        
           TeardownTimeout          ExitCode = 101 
        
           GenericTimeout           ExitCode = 102 // TODO: remove? 
        
           GenericEngine            ExitCode = 103 
        
           InvalidConfig            ExitCode = 104 
        
           ExternalAbort            ExitCode = 105 
        
           CannotStartRESTAPI       ExitCode = 106 
        
           ScriptException          ExitCode = 107 
        
           ScriptAborted            ExitCode = 108

The problem is that stopping the test in different ways and at different times behaves very inconsistently and unpredictably, even when you account for the unavoidable peculiarities of k6... 😭 Here is what I've determined:

While k6 is initializing for the first time (to get the exported options):
- ❎ 1. a single Ctrl+C will not be graceful at all, it will behave like 2.; it will also cause the process to abort with an exit code of 1, which is apparently the default Go interrupt signal handling behavior when it has not yet been handled by k6
- ❎ 2. will not even be reached (see above)
- ✔️ 3. test.abort() will surprisingly work as expected and cause k6 to exit with ScriptAborted
- ✔️ 4. script exceptions work as expected, k6 exits with ScriptException
- 🔲 5. doesn't apply, since the REST API has not yet been started at this point (and probably can't be started so early), so it can't be used to stop the test
- 🔲 6. also doesn't apply, the test hasn't been started
While k6 is initializing one of the actual test VUs:
- ❎ 1. is now handled by the k6 signal handler, but again it won't be graceful and the process will exit with code GenericEngine, the catch-all exit code for when k6 doesn't "know" what caused the failure... 😞
- ❎ 2. will probably never be reached, since 1. is not graceful, though that might be fixed by Pass and respect a Context when initializing VUs #2800
- ✅ 3. test.abort() kind of works, k6 exits with ScriptAborted and after Pass and respect a Context when initializing VUs #2800 it should be graceful 🤞
- ✔️ 4. script exceptions work as expected, k6 exits with ScriptException
- ❎ 5. The REST API replies with 200 when you try to stop the test with it, and it sets the status to stopped, however the VU initialization won't actually stop, even after Pass and respect a Context when initializing VUs #2800 😞 After VU initialization finishes, setup() is executed (:facepalm:), then no iterations are executed, then teardown() is executed and finally, k6 exits with a 0 exit code
- ✅ 6. this kind of works, assuming the user has set correct thresholds that will fail if no iterations were executed... the thresholds will be evaluated, but if all of them were http_req_duration: ['p(99)<100'], they will happily pass with no iterations 😞
During setup() execution:
- ❎ same as VU init ⬆️, 1. is now handled by the k6 signal handler, but again it won't be graceful and the process will exit with code GenericEngine, the catch-all exit code for when k6 doesn't "know" what caused the failure... 😞
- ❎ same as VU init ⬆️, 2. will not be reached since 1. is not graceful
- ✔️ 3. test.abort() works as expected and exits with ScriptAborted
- ✔️ 4. script exceptions work as expected, k6 exits with ScriptException and handleSummary() is even executed (after Wait for metrics handling to finish even when there is an error #2798)
- ✅ 5. REST API stop seems to work almost as expected, though it exits with a GenericEngine code and needs more tests
- ✅ 6. Thresholds evaluation appears to not be running during setup(), which is probably the correct behavior, though it needs to be evaluated and tested 🤔
During normal test run execution
- ❓ 1. Ctrl+C gracefully stops the execution, though k6 exits with a 0 code, which is probably not what we want? 🤔
- ✔️ 2. Second Ctrl+C works as expected and exits with ExternalAbort
- ✔️ 3. test.abort() works as expected and exits with ScriptAborted
- ✅ 4. quite intentionally, script exceptions interrupt only the current iteration, but not the whole test; users need to set thresholds or try / catch and test.abort() if they want an exception to abort the whole test, though there are certainly some UX improvements we can make by default (e.g. Emit an errors metric and have a default error rate script abort threshold #877)
- ❔ 5. The REST API stop works as expected, it gracefully stops the execution, though again k6 will exit with a 0 code; there is a better argument to be made here that this is the correct behavior, compared to Ctrl+C, though again I think it needs to have its own dedicated non-zero exit code, for consistency 🤔
- ✔️ 6. Threshold aborting works as expected and exits with code ThresholdsHaveFailed
During teardown() execution:
- ❓ 1. Single Ctrl+C doesn't actually stop the teardown() execution which, IIRC, was an intentional decision... and it's probably the correct decision - if the user interrupted the execution mid-test (i.e. section ⬆️), and setup() had already executed, we probably want to make sure we run teardown() too 🤔 however, the exit code probably shouldn't be 0
- ✔️ 2. And a second Ctrl+C works as expected and exits with ExternalAbort, so it's not a big deal that the first Ctrl+C waits for teardown to finish.
- ✔️ 3. test.abort() works as expected and exits with ScriptAborted
- ✔️ 4. script exceptions work as expected, k6 exits with ScriptException and handleSummary() is even executed (after Wait for metrics handling to finish even when there is an error #2798)
- ❓ 5. REST API' stop doesn't stop the teardown() execution, which makes some sense for similar reasons to Ctrl+C not stopping it, however the exit code should probably also not be 0
- ✅ 6. Thresholds are running and appear (from logs) to fail, and while that doesn't abort the teardown() execution (arguably the correct behavior, the exit code will be ThresholdsHaveFailed in the end, so it seems fine to me
During handleSummary() execution
- ❎ 1. The first Ctrl+C has no effect at all
- ✔️ 2. Second Ctrl+C works as expected and exits with ExternalAbort
- ❎ 3. test.abort() aborts the test run and the error is logged, but the exit code is 0... k6 doesn't fall back to the default built-in end-of-test summary, which might be a good idea, but needs evaluation to be sure
- ❓ 4. Same as test.abort(), a script exception aborts the function and is logged, and k6 even falls back and runs the default built-in end-of-test summary - all of that is completely fine, but the exit code should probably not be 0 🤔
- ❔ 5. The REST API stop has absolutely no effect, which might arguably be the correct behavior 🤔
- 🔲 6. doesn't apply, metrics and threshold processing have finished before handleSummary() is called
After everything has been executed, but k6 is waiting because of the --linger option
- ✔️ 1. Ctrl+C aborts the test, and because it is what --linger is waiting for, the exit code is (and should be) 0
- ✔️ 2. Second Ctrl+C works as expected too - it shouldn't be required, but it will exit immediately with ExternalAbort if there is some sort of a bug in k6's test run finishing logic
- 🔲 3. test.abort() can't be used at this point, no JS code is running
- 🔲 4. same for JS exceptions, no JS code is running
- ❔ 5. The REST API's stop also doesn't do anything, which is probably what we want 🤔 we want to have access to the REST API, to be able to query metrics, but we don't want the method that stops the test to also stop --linger... we should add a separate endpoint (e.g. as a part of Deprecate the REST API/v1 #995) if we want to be able to clear the lingering state
- 🔲 6. doesn't apply, threshold crunching has long been stopped at this point

This is connected to #2790 and #1889, but goes way beyond them... All of these behaviors should be standardized before test suites (#1342) can be implemented, for example.

In general, it makes sense for all external and internal test aborts to make k6 exit with a non-zero exit code. We could maybe have different exit codes, depending on the cause, so users can filter out expected ones. Maybe we can even make some of these configurable (#870, #680) but, by default, it makes sense to me that the default behavior should be a non-zero exit code when the test was prematurely stopped in any way.

The only way a k6 process should exit with a 0 exit code is if the test finished normally and no thresholds failed.

The text was updated successfully, but these errors were encountered:

na-- · 2022-12-07T14:09:54Z

There is actually a 7th way to stop a test - timeouts 😅

Specifically, both setup() and teardown() have timeout values (60s by default and configurable by the setupTimeout and teardownTimeout options respectively). handleSummary() also has a fixed 2-minute timeout that isn't configurable at the moment.

These timeouts have their own exit and run_status codes and everything, so they are not just a normal script error or something like that.

na-- · 2023-01-19T13:06:33Z

It's also worth considering and testing that there is probably a difference between script exceptions (e.g. throw new Error('foo');) and script errors (e.g. wrong JS syntax or things like using await in a non-async function). For example:

❎ script errors cause k6 to exit with a -1 exit code when k6 is initializing for the first time, instead of 107 (ScriptException)

na-- · 2023-01-23T08:42:11Z

@imiric raised a very good suggestion in #2810 (comment) - now that #2810 will add a errext.AbortReason type, which will track the internal k6 test run error, we can stop manually assigning exit codes to errors deep in the codebase 🎉 We should be able to have a AbortReason -> ExitCode mapping the same way that PR adds a AbortReason -> cloudapi.RunStatus mappnig!

na-- · 2023-01-30T17:01:45Z

#2885 was a reminder that k6 run --paused is kind of it's own weird in-between state, somewhere after initializing the VUs and before setup() execution... 😞 So many cases... 😭

na-- · 2023-02-02T07:07:25Z

As #2885 and #2893 have proven, --linger is somewhat big complication in the k6 run logic...

In general, if a script error or test.abort() occurs during the VU init phase, --linger should not apply and k6 should exit immediately with the appropriate exit code for whatever aborted the init. But if test.abort() is called during the test run itself, or the test was stopped in some other way besides Ctrl+C (e.g. REST API, thresholds with abortOnFail), --linger means that k6 should not exit immediately after stopping the test.

na-- · 2023-02-02T17:18:24Z

Another way a k6 test (the 8th? 😱) can be stopped is when a specific output tells it to 😩 See

k6/output/types.go

Lines 69 to 76 in b85d09d

    
           // WithTestRunStop is an output that can stop the Engine mid-test, interrupting 
        
           // the whole test run execution if some internal condition occurs, completely 
        
           // independently from the thresholds. It requires a callback function which 
        
           // expects an error and triggers the Engine to stop. 
        
           type WithTestRunStop interface { 
        
           	Output 
        
           	SetTestRunStopCallback(func(error)) 
        
           }

na-- · 2023-02-03T10:26:56Z

Connected to #2804 (comment), the exit code of a k6 run --out cloud test with K6_CLOUD_STOP_ON_ERROR=true that is stopped from the k6 cloud app is currently wrong, it's -1 (i.e. 255) when it should probably be 105 (ExternalAbort) .

urugator · 2024-01-06T14:39:17Z

Would it be possible to pass custom exit code to test.abort() and also include the message and the code in the summary?

We have tests that are not interested about metrics, but about specific functional failures that occur only under load. Typically we need to know what happened and provide some contextual data eg:

  if (res.status === 400 && res.json('error_code') === 666) {
    const context = JSON.stringify({      
      reqBody,
      resStatus: res.status,
      resBody: res.body,
    });
    execution.test.abort(`error_code must not be 666: ${context}`);
  }

I would probably like to have something like:

execution.test.abort({
  exitCode: 1666,
  message: `error_code must not be 666`,
  context: whatever, // I assume this has to be serializable
});
/*
The summary would then contain:
type SummaryData = {
  result: {
    outcome: 'passed' | 'aborted' | ...
    exitCode: number,
    context: unknown,
  }
  // ...other props ...
}
Btw summary data type is not provided by types/k6
*/

We also have some post-test scripts that depend on what went wrong, so it's quite useful to reflect this in the exit code, without the need to inspect summary/output.

There are workarounds like counter + threshold[count<1] + check with dynamic name (contextual data serialized in the check name) or looking up the specific check in the output (contextual data in tags), but it's quite a hassle for such a simple thing.

I noticed discussions about "angry checks" (checks interrupting iteration), maybe there should be a "furious check" (check aborting a test) :D

Please excuse me if this is not a good place to discuss this.

joanlopez · 2024-08-29T08:51:19Z

After #3876 and #3923, Cloud runs with thresholds crossed are consistently being reported as ThresholdsHaveFailed, which is what happens with local executions under the same conditions.

Before, those were being reported as CloudTestRunFailed, which wasn't accurate enough, as k6 users didn't know why the test run failed in the cloud (but the reason is proven to be known).

Both will be shipped in v0.54 🚀

na-- added enhancement ux refactor evaluation needed proposal needs to be validated or tested before fully implementing it in k6 breaking change for PRs that need to be mentioned in the breaking changes section of the release notes labels Dec 6, 2022

na-- mentioned this issue Dec 7, 2022

Use non-zero exit codes for some aborted by Ctrl+C or REST API tests #2807

Merged

na-- added the tests label Dec 8, 2022

This was referenced Dec 8, 2022

Remove the core.Engine 🎉 #2815

Merged

Pass and respect a Context when initializing VUs #2800

Merged

na-- mentioned this issue Jan 12, 2023

Inconsistent behavior of exec.test.abort within a group #2849

Closed

na-- mentioned this issue Jan 23, 2023

Add new errext.AbortReason type, move and use old RunStatus type only for k6 cloud code #2810

Merged

This was referenced Jan 30, 2023

Remove the unnecessary execution.Scheduler.Init() #2885

Merged

Split Scheduler.Run() into Init() and Run() again #2893

Merged

na-- mentioned this issue Feb 9, 2023

Fix the exit code and run_status value when the cloud output aborts a test #2907

Merged

yorugac mentioned this issue Mar 17, 2023

Handle k6 exit codes grafana/k6-operator#75

Open

heitortsergent mentioned this issue Sep 6, 2023

Document k6 CLI exit codes grafana/k6-docs#1315

Open

codebien mentioned this issue Feb 6, 2024

Add a design doc for distributed execution and test suites #3217

Merged

yorugac mentioned this issue Feb 7, 2024

HTTP call to terminate k6 process #3583

Open

codebien mentioned this issue Aug 29, 2024

Report Cloud runs aborted by thresholds with exitcodes.ThresholdsHaveFailed #3923

Merged

5 tasks

frittentheke mentioned this issue Oct 1, 2024

Combine all initializer commands with && to catch any failing commands grafana/k6-operator#453

Open

yorugac mentioned this issue Nov 5, 2024

Rework Initializer container and handling its verdict grafana/k6-operator#450

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify and standardize behavior and exit codes when k6 is stopped #2804

Unify and standardize behavior and exit codes when k6 is stopped #2804

na-- commented Dec 6, 2022

na-- commented Dec 7, 2022

na-- commented Jan 19, 2023

na-- commented Jan 23, 2023

na-- commented Jan 30, 2023

na-- commented Feb 2, 2023

na-- commented Feb 2, 2023

na-- commented Feb 3, 2023 •

edited

Loading

urugator commented Jan 6, 2024 •

edited

Loading

joanlopez commented Aug 29, 2024

Unify and standardize behavior and exit codes when k6 is stopped #2804

Unify and standardize behavior and exit codes when k6 is stopped #2804

Comments

na-- commented Dec 6, 2022

na-- commented Dec 7, 2022

na-- commented Jan 19, 2023

na-- commented Jan 23, 2023

na-- commented Jan 30, 2023

na-- commented Feb 2, 2023

na-- commented Feb 2, 2023

na-- commented Feb 3, 2023 • edited Loading

urugator commented Jan 6, 2024 • edited Loading

joanlopez commented Aug 29, 2024

na-- commented Feb 3, 2023 •

edited

Loading

urugator commented Jan 6, 2024 •

edited

Loading