Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed TaskRuns should still report results #3439

Closed
ekupershlak opened this issue Oct 22, 2020 · 19 comments
Closed

Failed TaskRuns should still report results #3439

ekupershlak opened this issue Oct 22, 2020 · 19 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ekupershlak
Copy link

Expected Behavior

Results produced by failed steps should still be include in the task results.

Actual Behavior

Results are present in the "message" field of the step, but not in the task results.

Steps to Reproduce the Problem

  1. Apply the following to a cluster with Tekton installed:
apiVersion: tekton.dev/v1beta1
kind: ClusterTask
metadata:
  name: result-task
spec:
  results:
  - name: sampleResult
    description: Result
  steps:
  - name: write-result
    image: gcr.io/google-containers/busybox:1.27
    script: |
      #!/bin/sh
      /bin/echo 'Result expected' > $(results.sampleResult.path)
      exit 1

---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: result-task-run
spec:
  taskRef:
    kind: ClusterTask
    name: result-task

Additional Info

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-gke.401", GitCommit:"eb94c181eea5290e9da1238db02cfef263542f5f", GitTreeState:"clean", BuildDate:"2020-09-09T00:57:35Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
    v0.17.1 (I first ran into this in V16.3)
@ekupershlak ekupershlak added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2020
@vdemeester
Copy link
Member

@ghost
Copy link

ghost commented Oct 23, 2020

Results produced by failed steps should still be include in the task results.

Could you go in to a bit more detail as to why they should be included? What's your use-case? A failed Task may have failed because the data it generated was invalid.

@ekupershlak
Copy link
Author

I'm using Tekton for execution, though the user-facing abstraction has no notion of Tekton concepts. When execution fails, I need to provide some details about this failure.

I would like to store details about the failure in a result and then interrogate it when retreiving the PipelineRun.

The only alternative I can think of is to access the logs directly, but this is undesirable. Having the user do it breaks encapsulation and doing it programatically requires access to the underlying Pods.

@pritidesai
Copy link
Member

yup I agree and would be great to provide a way to communicate failure logs which are otherwise just limited to the pod its running in.

We have been having lot of discussion around task results in finally and task results without results, please feel free to leave a comment and voice your opinion:

tektoncd/community#240
https://docs.google.com/document/d/1tV1LgPOINnmlDV-oSNdLB39IlLcQRGaYAxYZjVwVWcs/edit#heading=h.1huoxojutir3

@ghost
Copy link

ghost commented Oct 28, 2020

I dunno. I feel like we've jumped the gun a bit. I definitely see the value in returning some information about the failure of a Task. I'm not convinced that a Task Result is the absolutely only way we could tackle this tho. Here's an alternative: we add a new field to a TaskRun's status (e.g. taskRun.status.failureReason) that gets populated from a /tekton/failure-reason file that a Step can write. Then a PipelineRun is updated with that TaskRun Status including the failure reason. The benefit to this would be that the mechanism is consistent across every TaskRun - each Task doesn't have to invent its own "failure result" that also has to be wired in to every Pipeline Result.

Task Results already have a defined behaviour when Tasks fail - they don't populate in TaskRun Status. If we modify that contract I kinda feel like we're making a backwards-incompatible change. So I'm going to re-classify this as a feature request rather than a bug since Task Results aren't behaving incorrectly here - their behaviour matches the expectations we set for them when they were designed. Happy to discuss further but I think this needs to be considered a bit carefully before moving ahead on it.

/kind feature
/remove-kind bug

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 28, 2020
@ghost ghost deleted a comment from tekton-robot Oct 28, 2020
@ekupershlak
Copy link
Author

Having a consistent way to populate a failure reason would be great.

@pritidesai
Copy link
Member

@ekupershlak we have stdout and stderr for steps being implemented, see TEP and PoC. Would this work for your use case?

See issue #2925

@ekupershlak
Copy link
Author

@pritidesai if I understand correctly, that proposal simplifies capturing stdout/stderr of a step. This is useful, but doesn't provide a way to obtain this data.

I tried to use a result because it's accessible from a TaskRun without any additional management (i.e. creating buckets or other similar stores).

@vincent-pli
Copy link
Member

vincent-pli commented Dec 1, 2020

@ekupershlak
Your case is exactly same as mine.
I need delivery something like error source, error message and some customized data to outer world when a Pipelinerun get failed. I think TaskRunResult is a good candidate, just watch the Pipelinerun and parse the TaskRunResult, then know the root cause and decide what to do next.

@pritidesai
I think the result consuming in finally can help on my case, then I can launch my custom task(since finally do not support task chain) in finally with the result as param, no need to retrieval the object Pipelinerun, but seems the pr still not merged.

So, any reason we cannot let a failed task report it's result? I have no idea about previous discussion about that, any clarification? @sbwsg

@ghost
Copy link

ghost commented Dec 1, 2020

We're essentially describing Results that are only optionally populated (i.e. they only populate when there's an error). Elsewhere we are proposing that a Task which doesn't emit a declared result should be an automatic failure: #3497. We've proposed this because users should be able to depend on Task Results so that other Tasks in a Pipeline can use them reliably. A missing Result should be considered a bug in a successful Task.

So, let's look at an example Task with a failure result:

spec:
  results:
  - name: errorMessage
    description: If this task fails, it will return an error message in this result.

When this Task succeeds it must return a value for errorMessage if we implement #3497. What's the correct value of errorMessage when the Task succeeds? Empty string? "No errors" message? An exit code "0"? And then, thinking about this further, could a UI or other tool make use of that failure information and display it in some way to help users? How does the UI/tool discover the correct Result to read the error information from?

So I guess I'm still not that convinced that we need to muddy the contract around Results to support this. I'd personally prefer to see Results only returned on success and a different field returned on failure. See my earlier comment #3439 (comment) for a suggestion of a different possible way to tackle the same problem.

@bobcatfish
Copy link
Collaborator

Great discussion!!! Thanks @ekupershlak and @vincent-pli for going into detail about your use cases, this really helps.

I'm thinking 2 things:

  1. Maybe we could nail down the problem statement a bit, e.g. in the form of a TEP? (https://github.com/tektoncd/community/tree/master/teps) The problem seems to be about understanding why a Task failed and we've generated a couple of alternatives in this issue already, e.g. leveraging Task results and also letting a Task emit a reason to provide more context. We've been trying to use an approach with TEPs where we nail down the problem statement before getting too deep into the solution
  2. I actually want to double check that using logs isn't the best solution here? (I think a very clear problem statement could help explain why using logs doesn't accomplish what you need as well) @ekupershlak you said:

Having the user do it breaks encapsulation and doing it programatically requires access to the underlying Pods.

Is it possible that the user facing abstraction you are creating could access these logs? If the main problem is actually that it's difficult for you to fetch the logs, maybe we should tackle that directly.

My thinking is that even if we DO add a failure reason or allow Tasks to emit results even when they fail, you're still going to need to write ALL of your Tasks in such a way that they are aware of their own failure and can introspect. That's going to be difficult to do since you'd need to handle the failure of ANY part of your Task.

@ekupershlak
Copy link
Author

Is it possible that the user facing abstraction you are creating could access these logs? If the main problem is actually that it's difficult for you to fetch the logs, maybe we should tackle that directly.

@bobcatfish This sums up the bulk of the issue: I was building an operator that, as part of reconciliation, is managing Tekton tasks. I could definitely add logic to read logs, but I was trying to avoid the work (and needing to add the corresponding permissions).

I do think that there is room for error details to be distinct from logs. Tools can have noisy outputs (e.g. prompts for surveys), and STDERR is generally more valuable. Additionally, error detection in a script (e.g. checking if an output file is empty) could use echo, but writing directly to an error log might be preferable to produce actionable output.

@pritidesai
Copy link
Member

@ekupershlak from the pipelineRun perspective, how are you capturing the task result (in your case failures)? in a pipeline result?

The reason I am asking is, we have a feature being designed to access the pipelineRun metadata as a json payload from any other tasks. In your use case, you could access this in a finally task. Does this work for you?

I agree with @sbwsg, results are not the right fit for exposing failures.

@ekupershlak
Copy link
Author

@pritidesai I configured the tasks to emit results, but didn't specify any pipeline results. To get the value, I'm parsing pipelineRun.Status.TaskRuns[].Status.Steps[].Terminated.Message.

Regarding a finally task, is the idea that it would access error details, and use them to produce a result? If so, I think that would work.

For what it's worth, I don't feel strongly about how error details are surfaced as long as there's a way.

@vincent-pli
Copy link
Member

@bobcatfish thanks for your summary. and thanks @sbwsg for clarification

One quick question about #3479 , I guess I like current behavior better: a task declare a Result but not emit. The Pipeline will failed if other task in that Pipeline depends on the Result, if no dependency, nothing happen.

I mean task 's Result could be used by other tasks in the same pipeline, also could used by outer world application. i.e. a custom task controller, a monitor/audit application.

From task perspective, I guess Result is the most straight forward way to deliver customized information to outer world (correct me if I missing something), no matter it's success or failed.

Actually, I have workaround the issue by parse the pipelineRun.Status.TaskRuns[].Status.Steps[].Terminated.Message, the result will be always write to there : )

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants