Failed TaskRuns should still report results #3439

ekupershlak · 2020-10-22T21:28:16Z

Expected Behavior

Results produced by failed steps should still be include in the task results.

Actual Behavior

Results are present in the "message" field of the step, but not in the task results.

Steps to Reproduce the Problem

Apply the following to a cluster with Tekton installed:

apiVersion: tekton.dev/v1beta1
kind: ClusterTask
metadata:
  name: result-task
spec:
  results:
  - name: sampleResult
    description: Result
  steps:
  - name: write-result
    image: gcr.io/google-containers/busybox:1.27
    script: |
      #!/bin/sh
      /bin/echo 'Result expected' > $(results.sampleResult.path)
      exit 1

---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: result-task-run
spec:
  taskRef:
    kind: ClusterTask
    name: result-task

Additional Info

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-gke.401", GitCommit:"eb94c181eea5290e9da1238db02cfef263542f5f", GitTreeState:"clean", BuildDate:"2020-09-09T00:57:35Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:
v0.17.1 (I first ran into this in V16.3)

The text was updated successfully, but these errors were encountered:

vdemeester · 2020-10-23T08:50:27Z

@cc @pritidesai @bobcatfish @jerop

ghost · 2020-10-23T12:23:36Z

Results produced by failed steps should still be include in the task results.

Could you go in to a bit more detail as to why they should be included? What's your use-case? A failed Task may have failed because the data it generated was invalid.

ekupershlak · 2020-10-26T16:33:15Z

I'm using Tekton for execution, though the user-facing abstraction has no notion of Tekton concepts. When execution fails, I need to provide some details about this failure.

I would like to store details about the failure in a result and then interrogate it when retreiving the PipelineRun.

The only alternative I can think of is to access the logs directly, but this is undesirable. Having the user do it breaks encapsulation and doing it programatically requires access to the underlying Pods.

pritidesai · 2020-10-26T17:06:22Z

yup I agree and would be great to provide a way to communicate failure logs which are otherwise just limited to the pod its running in.

We have been having lot of discussion around task results in finally and task results without results, please feel free to leave a comment and voice your opinion:

tektoncd/community#240
https://docs.google.com/document/d/1tV1LgPOINnmlDV-oSNdLB39IlLcQRGaYAxYZjVwVWcs/edit#heading=h.1huoxojutir3

ghost · 2020-10-28T17:00:02Z

I dunno. I feel like we've jumped the gun a bit. I definitely see the value in returning some information about the failure of a Task. I'm not convinced that a Task Result is the absolutely only way we could tackle this tho. Here's an alternative: we add a new field to a TaskRun's status (e.g. taskRun.status.failureReason) that gets populated from a /tekton/failure-reason file that a Step can write. Then a PipelineRun is updated with that TaskRun Status including the failure reason. The benefit to this would be that the mechanism is consistent across every TaskRun - each Task doesn't have to invent its own "failure result" that also has to be wired in to every Pipeline Result.

Task Results already have a defined behaviour when Tasks fail - they don't populate in TaskRun Status. If we modify that contract I kinda feel like we're making a backwards-incompatible change. So I'm going to re-classify this as a feature request rather than a bug since Task Results aren't behaving incorrectly here - their behaviour matches the expectations we set for them when they were designed. Happy to discuss further but I think this needs to be considered a bit carefully before moving ahead on it.

/kind feature
/remove-kind bug

ekupershlak · 2020-10-28T17:02:43Z

Having a consistent way to populate a failure reason would be great.

pritidesai · 2020-11-10T19:45:52Z

@ekupershlak we have stdout and stderr for steps being implemented, see TEP and PoC. Would this work for your use case?

See issue #2925

ekupershlak · 2020-11-10T22:02:12Z

@pritidesai if I understand correctly, that proposal simplifies capturing stdout/stderr of a step. This is useful, but doesn't provide a way to obtain this data.

I tried to use a result because it's accessible from a TaskRun without any additional management (i.e. creating buckets or other similar stores).

vincent-pli · 2020-12-01T00:45:05Z

@ekupershlak
Your case is exactly same as mine.
I need delivery something like error source, error message and some customized data to outer world when a Pipelinerun get failed. I think TaskRunResult is a good candidate, just watch the Pipelinerun and parse the TaskRunResult, then know the root cause and decide what to do next.

@pritidesai
I think the result consuming in finally can help on my case, then I can launch my custom task(since finally do not support task chain) in finally with the result as param, no need to retrieval the object Pipelinerun, but seems the pr still not merged.

So, any reason we cannot let a failed task report it's result? I have no idea about previous discussion about that, any clarification? @sbwsg

ghost · 2020-12-01T13:38:11Z

We're essentially describing Results that are only optionally populated (i.e. they only populate when there's an error). Elsewhere we are proposing that a Task which doesn't emit a declared result should be an automatic failure: #3497. We've proposed this because users should be able to depend on Task Results so that other Tasks in a Pipeline can use them reliably. A missing Result should be considered a bug in a successful Task.

So, let's look at an example Task with a failure result:

spec:
  results:
  - name: errorMessage
    description: If this task fails, it will return an error message in this result.

When this Task succeeds it must return a value for errorMessage if we implement #3497. What's the correct value of errorMessage when the Task succeeds? Empty string? "No errors" message? An exit code "0"? And then, thinking about this further, could a UI or other tool make use of that failure information and display it in some way to help users? How does the UI/tool discover the correct Result to read the error information from?

So I guess I'm still not that convinced that we need to muddy the contract around Results to support this. I'd personally prefer to see Results only returned on success and a different field returned on failure. See my earlier comment #3439 (comment) for a suggestion of a different possible way to tackle the same problem.

bobcatfish · 2020-12-01T14:49:22Z

Great discussion!!! Thanks @ekupershlak and @vincent-pli for going into detail about your use cases, this really helps.

I'm thinking 2 things:

Maybe we could nail down the problem statement a bit, e.g. in the form of a TEP? (https://github.com/tektoncd/community/tree/master/teps) The problem seems to be about understanding why a Task failed and we've generated a couple of alternatives in this issue already, e.g. leveraging Task results and also letting a Task emit a reason to provide more context. We've been trying to use an approach with TEPs where we nail down the problem statement before getting too deep into the solution
I actually want to double check that using logs isn't the best solution here? (I think a very clear problem statement could help explain why using logs doesn't accomplish what you need as well) @ekupershlak you said:

Having the user do it breaks encapsulation and doing it programatically requires access to the underlying Pods.

Is it possible that the user facing abstraction you are creating could access these logs? If the main problem is actually that it's difficult for you to fetch the logs, maybe we should tackle that directly.

My thinking is that even if we DO add a failure reason or allow Tasks to emit results even when they fail, you're still going to need to write ALL of your Tasks in such a way that they are aware of their own failure and can introspect. That's going to be difficult to do since you'd need to handle the failure of ANY part of your Task.

ekupershlak · 2020-12-01T15:55:22Z

Is it possible that the user facing abstraction you are creating could access these logs? If the main problem is actually that it's difficult for you to fetch the logs, maybe we should tackle that directly.

@bobcatfish This sums up the bulk of the issue: I was building an operator that, as part of reconciliation, is managing Tekton tasks. I could definitely add logic to read logs, but I was trying to avoid the work (and needing to add the corresponding permissions).

I do think that there is room for error details to be distinct from logs. Tools can have noisy outputs (e.g. prompts for surveys), and STDERR is generally more valuable. Additionally, error detection in a script (e.g. checking if an output file is empty) could use echo, but writing directly to an error log might be preferable to produce actionable output.

pritidesai · 2020-12-01T17:44:35Z

@ekupershlak from the pipelineRun perspective, how are you capturing the task result (in your case failures)? in a pipeline result?

The reason I am asking is, we have a feature being designed to access the pipelineRun metadata as a json payload from any other tasks. In your use case, you could access this in a finally task. Does this work for you?

I agree with @sbwsg, results are not the right fit for exposing failures.

ekupershlak · 2020-12-01T19:48:59Z

@pritidesai I configured the tasks to emit results, but didn't specify any pipeline results. To get the value, I'm parsing pipelineRun.Status.TaskRuns[].Status.Steps[].Terminated.Message.

Regarding a finally task, is the idea that it would access error details, and use them to produce a result? If so, I think that would work.

For what it's worth, I don't feel strongly about how error details are surfaced as long as there's a way.

vincent-pli · 2020-12-02T01:19:19Z

@bobcatfish thanks for your summary. and thanks @sbwsg for clarification

One quick question about #3479 , I guess I like current behavior better: a task declare a Result but not emit. The Pipeline will failed if other task in that Pipeline depends on the Result, if no dependency, nothing happen.

I mean task 's Result could be used by other tasks in the same pipeline, also could used by outer world application. i.e. a custom task controller, a monitor/audit application.

From task perspective, I guess Result is the most straight forward way to deliver customized information to outer world (correct me if I missing something), no matter it's success or failed.

Actually, I have workaround the issue by parse the pipelineRun.Status.TaskRuns[].Status.Steps[].Terminated.Message, the result will be always write to there : )

tekton-robot · 2021-03-02T01:54:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-04-01T02:32:23Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-08T00:40:45Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-05-08T00:40:46Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ekupershlak added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2020

tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 28, 2020

ghost deleted a comment from tekton-robot Oct 28, 2020

ghost mentioned this issue Nov 30, 2020

Any reason set "TaskRunResults" only when the "taskrun" succeed #3568

Closed

vincent-pli mentioned this issue Dec 30, 2020

Allow task results to be published on failure #3649

Closed

ghost mentioned this issue Feb 4, 2021

Publish results when task and pipeline runs fail #3749

Open

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2021

tekton-robot closed this as completed May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed TaskRuns should still report results #3439

Failed TaskRuns should still report results #3439

ekupershlak commented Oct 22, 2020

vdemeester commented Oct 23, 2020

ghost commented Oct 23, 2020

ekupershlak commented Oct 26, 2020

pritidesai commented Oct 26, 2020

ghost commented Oct 28, 2020 •

edited by ghost

Loading

ekupershlak commented Oct 28, 2020

pritidesai commented Nov 10, 2020

ekupershlak commented Nov 10, 2020

vincent-pli commented Dec 1, 2020 •

edited

Loading

ghost commented Dec 1, 2020

bobcatfish commented Dec 1, 2020

ekupershlak commented Dec 1, 2020

pritidesai commented Dec 1, 2020

ekupershlak commented Dec 1, 2020

vincent-pli commented Dec 2, 2020

tekton-robot commented Mar 2, 2021

tekton-robot commented Apr 1, 2021

tekton-robot commented May 8, 2021

tekton-robot commented May 8, 2021

Failed TaskRuns should still report results #3439

Failed TaskRuns should still report results #3439

Comments

ekupershlak commented Oct 22, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

vdemeester commented Oct 23, 2020

ghost commented Oct 23, 2020

ekupershlak commented Oct 26, 2020

pritidesai commented Oct 26, 2020

ghost commented Oct 28, 2020 • edited by ghost Loading

ekupershlak commented Oct 28, 2020

pritidesai commented Nov 10, 2020

ekupershlak commented Nov 10, 2020

vincent-pli commented Dec 1, 2020 • edited Loading

ghost commented Dec 1, 2020

bobcatfish commented Dec 1, 2020

ekupershlak commented Dec 1, 2020

pritidesai commented Dec 1, 2020

ekupershlak commented Dec 1, 2020

vincent-pli commented Dec 2, 2020

tekton-robot commented Mar 2, 2021

tekton-robot commented Apr 1, 2021

tekton-robot commented May 8, 2021

tekton-robot commented May 8, 2021

ghost commented Oct 28, 2020 •

edited by ghost

Loading

vincent-pli commented Dec 1, 2020 •

edited

Loading