Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create more detailed error responses when submitting batch jobs #2940

Merged
merged 64 commits into from
Dec 12, 2023

Conversation

richscott
Copy link
Member

NOTE: this PR is a continuation of the work/PR that our summer 2023 MLH fellow/intern Raaj Patel had created. Unfortunately, his internship ended and he has returned to university before that PR (#2623) was fully finished, approved, and merged. The only change to this code since he departed I have made is to fix a conflict in go.mod and go.sum.

Following is his description in the original PR:

What type of PR is this?

This PR adds functionality to JobSubmitResponse when an error has occured. Previously the only thing returned was the error. The changes made in this PR allow the code to do 2 things:

  • Build a collection of errors, if mulitple jobs error for the same return, these changes will return them at the same time to the user. Rather than them having to reattempt submissions over and over again with 1 fix each time.
  • Return more detailed errors to the user by utilizing the JobSubmitResponse Struct

Which issue(s) this PR fixes:

Fixed #1748

Original Description from issue 1748 in JIRA, opened by @d80tb7 :

When we reject a job (for example because it can't be scheduled on any node) the whole grpc call fails.
There is an error message that is returned, but users have said that this doesn't give enough infomation
about what was valid. Instead they ask:

is it possible for these errors to be returned on the Error field on the JobResponseItem that we're given?

@richscott
Copy link
Member Author

Screenshots from Raaj Patel, showing the logging output before and after this change.

Submitted a bad sleep yaml job containing two jobs, both have invalid podSpecs as the requested and limit resources both contain "memory: 0"

before-log-changes
after-log-changes

@richscott richscott requested review from Mo-Fatah and Sharpz7 August 31, 2023 19:43
ClifHouck and others added 22 commits August 31, 2023 14:02
* Update docker stuff for latest airflow 2.7.0

* Use AirflowException instead of AirflowFailException to allow for retries

Signed-off-by: Rich Scott <[email protected]>
* update

* update pulsar client

* Fix bug causing server spinning

* Abstract out the retry until success logic for testing (armadaproject#2901)

* Respond to review

---------

Co-authored-by: Chris Martin <[email protected]>
Co-authored-by: Daniel Rastelli <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
* allow logger to report caller

* allow logger to report caller

* lint

---------

Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
…roject#2743)

Snyk has created this PR to upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <[email protected]>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
Snyk has created this PR to upgrade @types/react from 16.14.32 to 16.14.43.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <[email protected]>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
…aproject#2316)

Bumps [github.com/go-openapi/jsonreference](https://github.com/go-openapi/jsonreference) from 0.20.0 to 0.20.2.
- [Release notes](https://github.com/go-openapi/jsonreference/releases)
- [Commits](go-openapi/jsonreference@v0.20.0...v0.20.2)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/jsonreference
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
This will ensure the job leased first, gets send to the cluster first

Currently we just order by postgres default sorting - which often picks the most recently leased - causing the first lease jobs to get stuck
 - This only occurs when scheduling is faster than leasing

Signed-off-by: Rich Scott <[email protected]>
…ject#2302)

Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.77.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](webpack/webpack@v5.75.0...v5.77.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
…ject#2806)

Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.5.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](jonschlinkert/word-wrap@1.2.3...1.2.5)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
Co-authored-by: Adam McArthur <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
…rmadaproject#2744)

Snyk has created this PR to upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <[email protected]>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
Snyk has created this PR to upgrade react-router-dom from 6.9.0 to 6.14.1.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

Co-authored-by: snyk-bot <[email protected]>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
…t#2661)

Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1.
- [Release notes](https://github.com/npm/node-semver/releases)
- [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md)
- [Commits](npm/node-semver@v6.3.0...v6.3.1)

---
updated-dependencies:
- dependency-name: semver
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Adam McArthur <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
* Helm chart update: executor

At the moment the helm chart for the executor doesn't include priorityClass even though one is created in the chart. This means that the executor deployment is unable to set the priorityClass.

Signed-off-by: Rich Scott <[email protected]>
* Bump github.com/go-openapi/strfmt from 0.21.3 to 0.21.7

Bumps [github.com/go-openapi/strfmt](https://github.com/go-openapi/strfmt) from 0.21.3 to 0.21.7.
- [Release notes](https://github.com/go-openapi/strfmt/releases)
- [Commits](go-openapi/strfmt@v0.21.3...v0.21.7)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/strfmt
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump github.com/go-openapi/runtime from 0.24.2 to 0.26.0

Bumps [github.com/go-openapi/runtime](https://github.com/go-openapi/runtime) from 0.24.2 to 0.26.0.
- [Release notes](https://github.com/go-openapi/runtime/releases)
- [Commits](go-openapi/runtime@v0.24.2...v0.26.0)

---
updated-dependencies:
- dependency-name: github.com/go-openapi/runtime
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump github.com/goreleaser/nfpm/v2 from 2.25.1 to 2.29.0

Bumps [github.com/goreleaser/nfpm/v2](https://github.com/goreleaser/nfpm) from 2.25.1 to 2.29.0.
- [Release notes](https://github.com/goreleaser/nfpm/releases)
- [Changelog](https://github.com/goreleaser/nfpm/blob/main/.goreleaser.yml)
- [Commits](goreleaser/nfpm@v2.25.1...v2.29.0)

---
updated-dependencies:
- dependency-name: github.com/goreleaser/nfpm/v2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump github.com/go-playground/validator/v10 from 10.11.1 to 10.14.1

Bumps [github.com/go-playground/validator/v10](https://github.com/go-playground/validator) from 10.11.1 to 10.14.1.
- [Release notes](https://github.com/go-playground/validator/releases)
- [Commits](go-playground/validator@v10.11.1...v10.14.1)

---
updated-dependencies:
- dependency-name: github.com/go-playground/validator/v10
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump Grpc.Net.Client in /client/DotNet/ArmadaProject.Io.Client

Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.47.0 to 2.52.0.
- [Release notes](https://github.com/grpc/grpc-dotnet/releases)
- [Changelog](https://github.com/grpc/grpc-dotnet/blob/master/doc/release_process.md)
- [Commits](grpc/grpc-dotnet@v2.47.0...v2.52.0)

---
updated-dependencies:
- dependency-name: Grpc.Net.Client
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* fix: upgrade @mui/material from 5.10.17 to 5.13.6

Snyk has created this PR to upgrade @mui/material from 5.10.17 to 5.13.6.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade prettier from 2.7.1 to 2.8.8

Snyk has created this PR to upgrade prettier from 2.7.1 to 2.8.8.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade @mui/icons-material from 5.10.16 to 5.14.3

Snyk has created this PR to upgrade @mui/icons-material from 5.10.16 to 5.14.3.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade eslint-plugin-import from 2.26.0 to 2.28.0

Snyk has created this PR to upgrade eslint-plugin-import from 2.26.0 to 2.28.0.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* fix: upgrade eslint-config-prettier from 8.5.0 to 8.10.0

Snyk has created this PR to upgrade eslint-config-prettier from 8.5.0 to 8.10.0.

See this package in npm:

See this project in Snyk:
https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr

* Trying to update klog

* go mod fix

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: snyk-bot <[email protected]>
Co-authored-by: Mohamed Abdelfatah <[email protected]>
Signed-off-by: Rich Scott <[email protected]>
* Add error message of final job run to JobFailedMessage

When we hit the maximum retry limit, the JobFailedMessage just says something along the lines of
"Job has been retried too many times, giving up"

Now we include the final run error in that message - to make it easier to work out the cause of retries

* Fix bug causing GetJobSetEvents to get stuck

GetJobSetEvents only increments its fromId variable on sending new messages

However now all redis events produce api events that will be sent downstream

The issue here is if we get 500 redis events in a row that don't produce api events, then the fromId never gets updated
 - Meaning the watching gets stuck here

To fix this, ReadEvents now returns a lastMessageId. So if there are no messages to process, the fromId should be updated using the lastMessageId

* Formatting

Signed-off-by: Rich Scott <[email protected]>
…madaproject#2931)

Bumps [@adobe/css-tools](https://github.com/adobe/css-tools) from 4.0.1 to 4.3.1.
- [Changelog](https://github.com/adobe/css-tools/blob/main/History.md)
- [Commits](https://github.com/adobe/css-tools/commits)

---
updated-dependencies:
- dependency-name: "@adobe/css-tools"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Rich Scott <[email protected]>
for i, item := range request.JobRequestItems {
jobId := getUlid()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a change in behaviour in that previously we used to return nil for the id when we rejected jobs but now we return a jobId. I think this is good in that we now have an id that the user can give us to search logs etc, but does have a couple of downsides:

  • Users may get confused if they see jobIds that don't appear anywhere else in the system (e.g. lookout)
  • It's a mildly breaking interface change- e.g. if clients previously had code that did if jobResponse.JobId == nil... this will now behave differently.

I think I'm mildly in favour of this change, but it might be good to get a second opinion- @severinson ?

@@ -253,13 +268,30 @@ func (server *SubmitServer) DeleteQueue(ctx context.Context, request *api.QueueD
func (server *SubmitServer) SubmitJobs(ctx context.Context, req *api.JobSubmitRequest) (*api.JobSubmitResponse, error) {
principal := authorization.GetPrincipal(ctx)

jobs, e := server.createJobs(req, principal.GetName(), principal.GetGroupNames())
jobs, responseItems, e := server.createJobs(req, principal.GetName(), principal.GetGroupNames())
if e != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any limit to the size of the response we can generate here? E.g. if the user submits a batch of 100 jobs and they all fail with some massive error are we sure we're not going to generate a response that exceeds the max grpc response size? Likewise (although admittedly loess likely) could we end up generaling such a massive message that we would OOM.

If it were me I'd try and limit the size of the error message by saying:

  • how may jobs failed validation (out of how may submitted)
  • detailed error messages for the first n jobs that failed. (maybe n=5 or something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d80tb7 I'll replace that potentially-huge error with a concise/distilled one, using that logic as you suggest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d80tb7 An attempt to implement the logic described above is now in commit 894c265 - let me know what you think.

@Sharpz7 Sharpz7 added the PR merge conflicts This PR has merge conflicts label Sep 20, 2023
Signed-off-by: Rich Scott <[email protected]>
Generate and return more detailed submission and/or validation errors.
If there are numerous jobs with errors, just give the number of failed
jobs (and the total number originally submitted), and truncate the list
of failed jobs errors to just the first 5 (this is defined in a single
constant variable, if neededed to change later).

Signed-off-by: Rich Scott <[email protected]>
@richscott richscott added component/scheduling Armada Server, Scheduler and Scheduler Injester and removed PR merge conflicts This PR has merge conflicts labels Oct 26, 2023
@richscott richscott requested a review from d80tb7 October 26, 2023 04:08
@richscott richscott merged commit 356a5f0 into armadaproject:master Dec 12, 2023
14 checks passed
@richscott richscott deleted the raaj-patel/create_job_error branch December 12, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/scheduling Armada Server, Scheduler and Scheduler Injester
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better Error Message When Rejecting Jobs