-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic working implementation of executor component #19
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This makes it much clearer what is being asserted in each test This library properly handles showing the difference between expected and actual, saving us time and preventing manual mistakes
This can be used to submit jobs to kubernetes and adds any "extra" stuff to the job this module will use. Currently that involves adding labels: - JobId, JobSetId, Queue, ReadyForCleanup This allows us to store the job data we'll need to refer to on the pod, without having to hold each job in memory for the entirety of its execution. Maybe we'll change the approach in the future, but this allows us to be more resilient to executor (module) failures, as kubernetes holds all the state rather than this application
These tasks are the main responsibility of the application. The reason they are separate is: - Each can be simpler as just do one thing each - They don't really interact with each other, so it is easy to have them run individually - They can be run with different frequencies. So we don't have to perform all the operations as frequently as the most frequent task (i.e we can keep the cluster well allocated by checking it each 5 seconds, without having to try and delete all jobs out of the cluster every 5 seconds and instead can be every 30 seconds or whatever) In the future for optimisation, some of these background tasks may be merged to avoid looking up/calculating similar thing in separate threads. However for now this keeps it easier to understand and reason about and I don't want to add complexity without knowing it is needed
jankaspar
approved these changes
Jul 15, 2019
GROpenSourceRO
pushed a commit
that referenced
this pull request
Apr 22, 2022
* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * Added gr-tests-e2e make target (#2) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * added gr-tests-e2e target * make e2e tests work inside gr (#3) * make e2e tests work inside gr * rever change for normal e2e * and again * enable docker build * Enable e2e tests running in WSL (#4) * Enable e2e tests running in WSL * Submit to pulsar, fall back to existing API for queue admin * Added pulsar-client-go and go-multierror * Spin up Pulsar in e2e tests, load config for Pulsar * Start Pulsar submit API and log processor in Armada * Removed debug messages, comments * Add flag to explicitly enable Pulsar * Added periodic logging to the submit from Pulsar service * Import ordering * Kubernetes object metadata improvements, improved logging (#5) * Improved logging and error handling * Import ordering * Comments, logging * Include any additional podspecs in Pulsar submit jobs message * go mod tidy * Use a separate ObjectMeta for each k8s object in Pulsar' * Merge namespace/annotations/labels at the Pulsar submit API * Support submitting jobs with multiple podspecs * Annotate each incoming gRPC request with a request id * Annotate Pulsar messages with gRPC request id * Annotate per-message logger with gRPC request id attached to the Pulsar message * Import ordering * Preserve ordering within sequences (#6) * Publish job transitions to Pulsar (#7) * Added JobRunFailed reasons * Added logic to covert legacy events to Pulsar events * Publish events to Pulsar in addition to Redis * Added e2e tests that connrect directly to Pulsar * Updated Pulsar message spec (#8) * Comments * comments * Added function to return a request id or missing if none is found * Updated events spec * Updated Pulsar e2e tests * Removed commented-out code * Updated state transition message adapter to reflect changes to the proto * Generate JobSucceeded on JobRunSucceeded, logging * Provide Pulsar producer for SubmitFromLog service * Added utility function to insert error information and stack trace to a logrus.Entry * Removed deprecated code * Import ordering * Removed commented-out code * Removed commented-out code * Added isSequencef that takes a message to be logged on error * Removed commented-out code * Comments, removed debug logging * Removed temporary swagger.merged file (#9) * Removed temporary swagger.merged file * Removed temporary swagger.merged file * add pulsar tls config * add pulsar tls config * remove stray files * Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10) * Pulsar message utilities * Added service for writing to Pulsar based on Pulsar messages * Refactoring, use separate PulsarFromPulsar service * Import ordering * Refactoring * Return an error on invalid pulsar message id comparison * Improved error message * Renamed Pulsar events topic to be more descriptive * Removed commented-out code * add advanced pulsar config * review comments * review comments * more review comments * more review comments * more review comments * Pulsar events spec improvements (#13) * Use uint32 instead of double for priority * Todo comment * Hash queue + job_set_name instead of job_set_name * Added efficient UUID message type * Added conversion between google UUID and proto message UUID * Added converters between proto UUIDs and ULIDs * Import ordering * Use optimised uuid message * Added converters between strings and proto uuids * Function to generate a plain ULID, comments * Use optimised proto UUIDs * Comments * More fine-grained settings for job guarantees * Replace 4294967295 by math.MaxUint32 * Break priority parsing into a separate function * Securely hash queue and jobSetName together * Refactoring * Added lifetime to SubmitJob message * fix chart * move defaults * remove pulsar enabled * test fixes on wsl and windows * End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15) * Pass through GOPROXY/GOPRIVATE from the host for make proto * Removed commented-out code * Open armadactl by relative path, use valid priority * Refactoring, cleanup * Added test submitting several jobs, more rigorous event comparison * Removed test submitting only a single job * Pulsar e2e test cleanup * Added code for getting jobIds from events * Remove GR-specific GOPROXY/GOPRIVATE * Remove references to GR from proto build * Todo, whitespace * Test improvements * Use same alpine image as for tests, set limits equal to requests (as required by Armada) * Removed todos * Disallow combining PodSpec and PodSpecs, dissallow PodSpecs * Correctly create services and ingresses in log submit API * Set name of objects to create from the ObjectMeta included with the SubmitJob message * Comments * Added todo * Comments * Todos * Test jobs with services/ingresses * Comments * Use PodSpec instead of PodSpecs * Fail immediately on failure to connect to db * Convert PodSpecs with 1 entry to PodSpec * Avoid panics, check for PodSpec instead of PodSpecs[0] * Import ordering * Pulsar events refactoring (#17) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Moved events package into pkg * Update reference to events.proto * Updated events package import * Added Pulsar properties to distinguish between control and utilisation messages * Handle legacy job utilisation messages, set message key * Removed queue_job_set_hash * Comments * Added ObjectMeta to main object, comments * Added executor_id to ObjectMeta * Renamed code to exit_code in ApplicationError message * Comments * Comments Co-authored-by: JamesMurkin <[email protected]> * Address comments on PR ARMADA/990 GRPub/armada (#18) * Refer to corporate proxies in general terms * Comments * Removed events.pb.go to simplify PR * Restore swagger files to simplify PR * Comments * Removed proposed scheduler code * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true * Enable dotnet and npm build internally (#19) * Added make target for dotnet that works internally * Added dotnet tests to tests-e2e target * Handle end-of-line symbols in a cross-platform manner * Moved dotnet build to separate make target * Get protoc via Maven * Load Maven URL from environment variables * Single Dockerfile for building proto * Cleaned up tests make target * Run unit tests in docker containers * Removed unused armada-test docker network * Run e2e tests and dotnet target in containers * Bump go version to 1.16 for consitency * Optionally run all go commands in containers * Mount GOPROXY and GOPRIVATE into go containers * Comments * Run npm in docker containers * Get go version string correctly from containers * Added missing .SubmitServer * Generate legacy job submitted events when submitting to Pulsar * Have cancel and reprioritise endpoints generate set messages * Always run builds in containers * Moved armadactl tests into e2e * Moved pulsar e2e tests into separate directory * Renamed directory * Updated e2e tests target to reflect new directories * Updated tests-e2e-no-setup target * Commented in tests * Removed npm environment variables values * Commented in tests * Removed commented-out code * ARMADA-1028 Events proto updates (#20) * Moved terminal flag into individual errors * Generate JobErrors instead of JobRunErrors on JobFailed API message * Updated to reflect moving terminal flag in errors * Added JobErrors to JobIdFromEvent * Added config files for local use to .gitignore * Added ReprioritisedJob, CancelledJob, and JobDuplicateDetected * Handle all api messages * Populate ObjectMeta info for errors * Include container name with container errors * Create EventId type (#21) * Return a concrete type from new to enable comparison * Added event id type * Create utility for sniffing Pulsar events (#22) * Added program to print events * Write eventsprinter as a cobra app * Take pulsar.Message instead of pulsar.ConsumerMessage * Filter out non-control messages * Fix bug associated with creating nil events * Include CancelledJob in list of messages indicating job failure * Print job ids * Improved testing (#23) * Added missing events to JobIdFromEvent * write test output to disk, convert test output to junit format * Added go-junit-report as dependency * Spin up postgres for e2e tests * Write html test report if possible * Removed problematic -e flag * Test submitting a job with errors, test cancelling jobs * Ignore test_reports * Improve Pulsar consumer retry logic (#24) * Do not return an error on messages requiring no action * Ack messages after processing * Fail test immediately on error * Propagate errors correctly * Add code to detect (possibly nested) network errors * Return immediately on nil in IsNetworkError * Improved error handling and retry logic * Added missing parentheses * Consider context.DeadlineExceeded a network error * Improved failure and retry logic * Removed seek from Pulsar setup * Removed multierror from GetActiveJobIds * Added tests for non-network errors * Only ack on successfully processing a sequence * Set Pulsar message key * keyshared sub (#27) * Merge In changes from public github (#28) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true Co-authored-by: JamesMurkin <[email protected]> * fix infinite loop (#29) * lowe case returned jobIds (#30) * ARMADA-995: Ingester from pulsar-> lookout database (#26) * initial impl of lookout ingester * added pod/container error * fixes after testing * goimports * doc * doc * fixes * remove unneeded code * add docker file- move tests * Fix event sequence number update (#31) * Enable CI builds (#25) * Added Jenkinsfile * Fix syntax errors * Changed var to def * Removed need to clone armada-ci * Removed armada-ci dependency * Use writeFile instead of echo * Removed -it * Get version correctly * Added debug printouts * Printouts * Set PWD to one valid on the host * Set PWD in sh * Set PWD consistently * Make proto * Exit on proto no matching * Added goimports as dependency * Import ordering * Check error code * Check error * Added code checks make target * Go mod tidy * Aded code checks stage * Updated line endings * Escape $ * Added download make target * Set GOPATH such that it is available on the host * Removed unnecessary mkdir * Added build tag to avoid compiling tools into binary * Automatically install tools * Run goimports for pkg/events * Whitespace * Fixed line endings * Added templify to list of tools * Comment out proto stage during development * Renamed junit_report to junit-report for consistency * Run go-junit-report in docker * Require kind, run kubectl in docker * Do not check for kind, to allow downloading it using make download * Add kind to list of tools * Fixed import for go-junit-report * Added tests-teardown target, avoid returning error code from tests-e2e-teardown * Run teardown targets before tests * Removed unnecessary cleanup stage * Set shorter OperationTimeout, bundle stack trace with errors * Commented out tests failing due to Docker setup * Added gox to list of tools * Run gox in docker * Comments * Whitespace * Whitespace * Whitespace * Commented out tests during development * Commented in all stages * Enabled junit plugin * Removed junit2html * Updated paths * Updated PWD * Removed -set-exit-code from go-junit-report * For Pulsar, only expose IPv4 to fix issue with Pulsar client library * Added go-swagger and grpc-gateway to tools * Move proto post-processing from proto.sh into makefile for performance * Fix go-imports * Comment in Pulsar tests * Update Jenkinsfile * Fix ineffassign * Fix typo NoError -> Error * Only build if tests succeed * Delete Jenkinsfile * Import ordering * Timeout retrying processing an event sequence (#32) * Add dependent targets to tests-e2e-no-setup * Only ack messages if sequence is empty or if at least 1 event was processed * Timeout processing a sequence after 5 minutes * Added rebuild-server target for testing, removed tests-e2e-no-setup dependencies * Comments * Always sleep on making no progress * Rename events module to armadaevents (#33) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Lookout Ingester fixes following testing (#35) * fixes following testing * fixes following testing * Use PGX for Database Conections to Lookout (#915) * use pgx * removed log lines * fix import order * fix test * fix test * Add retries to Lookout Ingester (#36) * fixes following testing * fixes following testing * added database error handling * import order * code review comments * code review comments * cause doesn't work * import order * Set default tolerations (#34) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Add default tolerations for e2e tests * Test for default tolerations * Improved job vonersion code * Add job conversion tests * Use job conversion code in eventutil * Test expected tolerations * More verbose printing * Lookout ingester produces compressed proto (#37) * add compressed proto * wip * go imports * fixed config- added todo * Remove changed generated files * Comments * Remove changed generated files * Remove changed generated files Co-authored-by: Chris Martin <[email protected]> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
severinson
added a commit
that referenced
this pull request
Apr 26, 2022
* Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * Added gr-tests-e2e make target (#2) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * added gr-tests-e2e target * make e2e tests work inside gr (#3) * make e2e tests work inside gr * rever change for normal e2e * and again * enable docker build * Enable e2e tests running in WSL (#4) * Enable e2e tests running in WSL * Submit to pulsar, fall back to existing API for queue admin * Added pulsar-client-go and go-multierror * Spin up Pulsar in e2e tests, load config for Pulsar * Start Pulsar submit API and log processor in Armada * Removed debug messages, comments * Add flag to explicitly enable Pulsar * Added periodic logging to the submit from Pulsar service * Import ordering * Kubernetes object metadata improvements, improved logging (#5) * Improved logging and error handling * Import ordering * Comments, logging * Include any additional podspecs in Pulsar submit jobs message * go mod tidy * Use a separate ObjectMeta for each k8s object in Pulsar' * Merge namespace/annotations/labels at the Pulsar submit API * Support submitting jobs with multiple podspecs * Annotate each incoming gRPC request with a request id * Annotate Pulsar messages with gRPC request id * Annotate per-message logger with gRPC request id attached to the Pulsar message * Import ordering * Preserve ordering within sequences (#6) * Publish job transitions to Pulsar (#7) * Added JobRunFailed reasons * Added logic to covert legacy events to Pulsar events * Publish events to Pulsar in addition to Redis * Added e2e tests that connrect directly to Pulsar * Updated Pulsar message spec (#8) * Comments * comments * Added function to return a request id or missing if none is found * Updated events spec * Updated Pulsar e2e tests * Removed commented-out code * Updated state transition message adapter to reflect changes to the proto * Generate JobSucceeded on JobRunSucceeded, logging * Provide Pulsar producer for SubmitFromLog service * Added utility function to insert error information and stack trace to a logrus.Entry * Removed deprecated code * Import ordering * Removed commented-out code * Removed commented-out code * Added isSequencef that takes a message to be logged on error * Removed commented-out code * Comments, removed debug logging * Removed temporary swagger.merged file (#9) * Removed temporary swagger.merged file * Removed temporary swagger.merged file * add pulsar tls config * add pulsar tls config * remove stray files * Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10) * Pulsar message utilities * Added service for writing to Pulsar based on Pulsar messages * Refactoring, use separate PulsarFromPulsar service * Import ordering * Refactoring * Return an error on invalid pulsar message id comparison * Improved error message * Renamed Pulsar events topic to be more descriptive * Removed commented-out code * add advanced pulsar config * review comments * review comments * more review comments * more review comments * more review comments * Pulsar events spec improvements (#13) * Use uint32 instead of double for priority * Todo comment * Hash queue + job_set_name instead of job_set_name * Added efficient UUID message type * Added conversion between google UUID and proto message UUID * Added converters between proto UUIDs and ULIDs * Import ordering * Use optimised uuid message * Added converters between strings and proto uuids * Function to generate a plain ULID, comments * Use optimised proto UUIDs * Comments * More fine-grained settings for job guarantees * Replace 4294967295 by math.MaxUint32 * Break priority parsing into a separate function * Securely hash queue and jobSetName together * Refactoring * Added lifetime to SubmitJob message * fix chart * move defaults * remove pulsar enabled * test fixes on wsl and windows * End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15) * Pass through GOPROXY/GOPRIVATE from the host for make proto * Removed commented-out code * Open armadactl by relative path, use valid priority * Refactoring, cleanup * Added test submitting several jobs, more rigorous event comparison * Removed test submitting only a single job * Pulsar e2e test cleanup * Added code for getting jobIds from events * Remove GR-specific GOPROXY/GOPRIVATE * Remove references to GR from proto build * Todo, whitespace * Test improvements * Use same alpine image as for tests, set limits equal to requests (as required by Armada) * Removed todos * Disallow combining PodSpec and PodSpecs, dissallow PodSpecs * Correctly create services and ingresses in log submit API * Set name of objects to create from the ObjectMeta included with the SubmitJob message * Comments * Added todo * Comments * Todos * Test jobs with services/ingresses * Comments * Use PodSpec instead of PodSpecs * Fail immediately on failure to connect to db * Convert PodSpecs with 1 entry to PodSpec * Avoid panics, check for PodSpec instead of PodSpecs[0] * Import ordering * Pulsar events refactoring (#17) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Moved events package into pkg * Update reference to events.proto * Updated events package import * Added Pulsar properties to distinguish between control and utilisation messages * Handle legacy job utilisation messages, set message key * Removed queue_job_set_hash * Comments * Added ObjectMeta to main object, comments * Added executor_id to ObjectMeta * Renamed code to exit_code in ApplicationError message * Comments * Comments Co-authored-by: JamesMurkin <[email protected]> * Address comments on PR ARMADA/990 GRPub/armada (#18) * Refer to corporate proxies in general terms * Comments * Removed events.pb.go to simplify PR * Restore swagger files to simplify PR * Comments * Removed proposed scheduler code * Sync changes made internally 220322-220419 to GRPub (#15) * Pulsar submit API and adapter prototypes, scheduler spec, updates to build Armada internally (#1) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * Added gr-tests-e2e make target (#2) * Moved in events code and added scheduler spec * Updated scheduler * Adapter from log messages to Armada * Comments * Deleted unused file * Copied in missing function principalHasQueuePermissions * Replaced atMostOnce with fragile * Scheduler updates * Scheduler updates * Scheduler updates * Scheduler updates * Store k8s services and ingresses in the api.Job object * Use correct time type * Executor uses bundled k8s services and ingresses if present * Removed unused code * Guard populateServicesIngresses against nil values * Added groups to event sequence and namespace, labels, and annotations to EventSequence * Updated Pulsar SubmitJobs to include groups and namespace, annotations, and labels * Comment * Pass through namespace, labels, and annotations * Added a list of concerns * Refactored log submit authorization * Dockerfile for building .proto internally * make proto * Added armadaerrors.ErrNoPermission * Replaced timestamp with time type * Removed go_package option not needed by gogo * make proto * Use armadaerrors.ErrNoPermission instead of server.ErrNoPermission to break import loop * Replaced assert.Nil -> assert.NoError and assert.NotNil -> assert.Error * Removed "", which caused tests to fail, from auth exec test script * Improved exec authenticator error messages, fixed bug where locks were copied * Fail test immediately on error * Fail test immediately on error to avoid panics * Fail test immediately on error to avoid panics * Fail test immediately on error, improved error messages * Create slices using make (seems to have fixed a test failure) * Import ordering * Added corporate proxy and compilation of events.proto * Replace assert.NotEmpty -> assert.NoError * Fixed erroneous error message * commented out ca-certificates install * added google.golang.org/api * replaced assert.Nil -> assert.NoError * added gr-tests-e2e target * make e2e tests work inside gr (#3) * make e2e tests work inside gr * rever change for normal e2e * and again * enable docker build * Enable e2e tests running in WSL (#4) * Enable e2e tests running in WSL * Submit to pulsar, fall back to existing API for queue admin * Added pulsar-client-go and go-multierror * Spin up Pulsar in e2e tests, load config for Pulsar * Start Pulsar submit API and log processor in Armada * Removed debug messages, comments * Add flag to explicitly enable Pulsar * Added periodic logging to the submit from Pulsar service * Import ordering * Kubernetes object metadata improvements, improved logging (#5) * Improved logging and error handling * Import ordering * Comments, logging * Include any additional podspecs in Pulsar submit jobs message * go mod tidy * Use a separate ObjectMeta for each k8s object in Pulsar' * Merge namespace/annotations/labels at the Pulsar submit API * Support submitting jobs with multiple podspecs * Annotate each incoming gRPC request with a request id * Annotate Pulsar messages with gRPC request id * Annotate per-message logger with gRPC request id attached to the Pulsar message * Import ordering * Preserve ordering within sequences (#6) * Publish job transitions to Pulsar (#7) * Added JobRunFailed reasons * Added logic to covert legacy events to Pulsar events * Publish events to Pulsar in addition to Redis * Added e2e tests that connrect directly to Pulsar * Updated Pulsar message spec (#8) * Comments * comments * Added function to return a request id or missing if none is found * Updated events spec * Updated Pulsar e2e tests * Removed commented-out code * Updated state transition message adapter to reflect changes to the proto * Generate JobSucceeded on JobRunSucceeded, logging * Provide Pulsar producer for SubmitFromLog service * Added utility function to insert error information and stack trace to a logrus.Entry * Removed deprecated code * Import ordering * Removed commented-out code * Removed commented-out code * Added isSequencef that takes a message to be logged on error * Removed commented-out code * Comments, removed debug logging * Removed temporary swagger.merged file (#9) * Removed temporary swagger.merged file * Removed temporary swagger.merged file * add pulsar tls config * add pulsar tls config * remove stray files * Separate services for updating Redis/Nats and Pulsar from Pulsar messages (#10) * Pulsar message utilities * Added service for writing to Pulsar based on Pulsar messages * Refactoring, use separate PulsarFromPulsar service * Import ordering * Refactoring * Return an error on invalid pulsar message id comparison * Improved error message * Renamed Pulsar events topic to be more descriptive * Removed commented-out code * add advanced pulsar config * review comments * review comments * more review comments * more review comments * more review comments * Pulsar events spec improvements (#13) * Use uint32 instead of double for priority * Todo comment * Hash queue + job_set_name instead of job_set_name * Added efficient UUID message type * Added conversion between google UUID and proto message UUID * Added converters between proto UUIDs and ULIDs * Import ordering * Use optimised uuid message * Added converters between strings and proto uuids * Function to generate a plain ULID, comments * Use optimised proto UUIDs * Comments * More fine-grained settings for job guarantees * Replace 4294967295 by math.MaxUint32 * Break priority parsing into a separate function * Securely hash queue and jobSetName together * Refactoring * Added lifetime to SubmitJob message * fix chart * move defaults * remove pulsar enabled * test fixes on wsl and windows * End-to-end test improvements and fixed to Pulsar ingress/serviced code (#15) * Pass through GOPROXY/GOPRIVATE from the host for make proto * Removed commented-out code * Open armadactl by relative path, use valid priority * Refactoring, cleanup * Added test submitting several jobs, more rigorous event comparison * Removed test submitting only a single job * Pulsar e2e test cleanup * Added code for getting jobIds from events * Remove GR-specific GOPROXY/GOPRIVATE * Remove references to GR from proto build * Todo, whitespace * Test improvements * Use same alpine image as for tests, set limits equal to requests (as required by Armada) * Removed todos * Disallow combining PodSpec and PodSpecs, dissallow PodSpecs * Correctly create services and ingresses in log submit API * Set name of objects to create from the ObjectMeta included with the SubmitJob message * Comments * Added todo * Comments * Todos * Test jobs with services/ingresses * Comments * Use PodSpec instead of PodSpecs * Fail immediately on failure to connect to db * Convert PodSpecs with 1 entry to PodSpec * Avoid panics, check for PodSpec instead of PodSpecs[0] * Import ordering * Pulsar events refactoring (#17) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Moved events package into pkg * Update reference to events.proto * Updated events package import * Added Pulsar properties to distinguish between control and utilisation messages * Handle legacy job utilisation messages, set message key * Removed queue_job_set_hash * Comments * Added ObjectMeta to main object, comments * Added executor_id to ObjectMeta * Renamed code to exit_code in ApplicationError message * Comments * Comments Co-authored-by: JamesMurkin <[email protected]> * Address comments on PR ARMADA/990 GRPub/armada (#18) * Refer to corporate proxies in general terms * Comments * Removed events.pb.go to simplify PR * Restore swagger files to simplify PR * Comments * Removed proposed scheduler code * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true * Enable dotnet and npm build internally (#19) * Added make target for dotnet that works internally * Added dotnet tests to tests-e2e target * Handle end-of-line symbols in a cross-platform manner * Moved dotnet build to separate make target * Get protoc via Maven * Load Maven URL from environment variables * Single Dockerfile for building proto * Cleaned up tests make target * Run unit tests in docker containers * Removed unused armada-test docker network * Run e2e tests and dotnet target in containers * Bump go version to 1.16 for consitency * Optionally run all go commands in containers * Mount GOPROXY and GOPRIVATE into go containers * Comments * Run npm in docker containers * Get go version string correctly from containers * Added missing .SubmitServer * Generate legacy job submitted events when submitting to Pulsar * Have cancel and reprioritise endpoints generate set messages * Always run builds in containers * Moved armadactl tests into e2e * Moved pulsar e2e tests into separate directory * Renamed directory * Updated e2e tests target to reflect new directories * Updated tests-e2e-no-setup target * Commented in tests * Removed npm environment variables values * Commented in tests * Removed commented-out code * ARMADA-1028 Events proto updates (#20) * Moved terminal flag into individual errors * Generate JobErrors instead of JobRunErrors on JobFailed API message * Updated to reflect moving terminal flag in errors * Added JobErrors to JobIdFromEvent * Added config files for local use to .gitignore * Added ReprioritisedJob, CancelledJob, and JobDuplicateDetected * Handle all api messages * Populate ObjectMeta info for errors * Include container name with container errors * Create EventId type (#21) * Return a concrete type from new to enable comparison * Added event id type * Create utility for sniffing Pulsar events (#22) * Added program to print events * Write eventsprinter as a cobra app * Take pulsar.Message instead of pulsar.ConsumerMessage * Filter out non-control messages * Fix bug associated with creating nil events * Include CancelledJob in list of messages indicating job failure * Print job ids * Improved testing (#23) * Added missing events to JobIdFromEvent * write test output to disk, convert test output to junit format * Added go-junit-report as dependency * Spin up postgres for e2e tests * Write html test report if possible * Removed problematic -e flag * Test submitting a job with errors, test cancelling jobs * Ignore test_reports * Improve Pulsar consumer retry logic (#24) * Do not return an error on messages requiring no action * Ack messages after processing * Fail test immediately on error * Propagate errors correctly * Add code to detect (possibly nested) network errors * Return immediately on nil in IsNetworkError * Improved error handling and retry logic * Added missing parentheses * Consider context.DeadlineExceeded a network error * Improved failure and retry logic * Removed seek from Pulsar setup * Removed multierror from GetActiveJobIds * Added tests for non-network errors * Only ack on successfully processing a sequence * Set Pulsar message key * keyshared sub (#27) * Merge In changes from public github (#28) * Remove accelerator logging (#894) This log line gets called for every pod using an accelerator on the cluster, every 5 seconds (configured by queueUsageDataRefreshInterval) This causes massive spam for little to no benefit * Only report jobs done once their state has been reported (#899) Normally the state gets reported instantly so this is already true 99% of the time. However if reporting the state goes wrong, we shouldn't report the job as done - Otherwise the server will tell the executor to kill the pod when it tries to maintain the lease In all other places we make sure the JobEvent has been reported first before reporting done, so we should do that here too This will only really impact edge cases and most of the time this will already be true Co-authored-by: JamesMurkin <[email protected]> * fix infinite loop (#29) * lowe case returned jobIds (#30) * ARMADA-995: Ingester from pulsar-> lookout database (#26) * initial impl of lookout ingester * added pod/container error * fixes after testing * goimports * doc * doc * fixes * remove unneeded code * add docker file- move tests * Fix event sequence number update (#31) * Enable CI builds (#25) * Added Jenkinsfile * Fix syntax errors * Changed var to def * Removed need to clone armada-ci * Removed armada-ci dependency * Use writeFile instead of echo * Removed -it * Get version correctly * Added debug printouts * Printouts * Set PWD to one valid on the host * Set PWD in sh * Set PWD consistently * Make proto * Exit on proto no matching * Added goimports as dependency * Import ordering * Check error code * Check error * Added code checks make target * Go mod tidy * Aded code checks stage * Updated line endings * Escape $ * Added download make target * Set GOPATH such that it is available on the host * Removed unnecessary mkdir * Added build tag to avoid compiling tools into binary * Automatically install tools * Run goimports for pkg/events * Whitespace * Fixed line endings * Added templify to list of tools * Comment out proto stage during development * Renamed junit_report to junit-report for consistency * Run go-junit-report in docker * Require kind, run kubectl in docker * Do not check for kind, to allow downloading it using make download * Add kind to list of tools * Fixed import for go-junit-report * Added tests-teardown target, avoid returning error code from tests-e2e-teardown * Run teardown targets before tests * Removed unnecessary cleanup stage * Set shorter OperationTimeout, bundle stack trace with errors * Commented out tests failing due to Docker setup * Added gox to list of tools * Run gox in docker * Comments * Whitespace * Whitespace * Whitespace * Commented out tests during development * Commented in all stages * Enabled junit plugin * Removed junit2html * Updated paths * Updated PWD * Removed -set-exit-code from go-junit-report * For Pulsar, only expose IPv4 to fix issue with Pulsar client library * Added go-swagger and grpc-gateway to tools * Move proto post-processing from proto.sh into makefile for performance * Fix go-imports * Comment in Pulsar tests * Update Jenkinsfile * Fix ineffassign * Fix typo NoError -> Error * Only build if tests succeed * Delete Jenkinsfile * Import ordering * Timeout retrying processing an event sequence (#32) * Add dependent targets to tests-e2e-no-setup * Only ack messages if sequence is empty or if at least 1 event was processed * Timeout processing a sequence after 5 minutes * Added rebuild-server target for testing, removed tests-e2e-no-setup dependencies * Comments * Always sleep on making no progress * Rename events module to armadaevents (#33) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Lookout Ingester fixes following testing (#35) * fixes following testing * fixes following testing * Use PGX for Database Conections to Lookout (#915) * use pgx * removed log lines * fix import order * fix test * fix test * Add retries to Lookout Ingester (#36) * fixes following testing * fixes following testing * added database error handling * import order * code review comments * code review comments * cause doesn't work * import order * Set default tolerations (#34) * Fix test * Rename events -> armadaevents * Rename events -> armadaevents * Import ordering * Add default tolerations for e2e tests * Test for default tolerations * Improved job vonersion code * Add job conversion tests * Use job conversion code in eventutil * Test expected tolerations * More verbose printing * Lookout ingester produces compressed proto (#37) * add compressed proto * wip * go imports * fixed config- added todo * Remove changed generated files * Comments * Remove changed generated files * Remove changed generated files Co-authored-by: Chris Martin <[email protected]> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: Chris Martin <[email protected]> * goimports * removed duplicated tests * fix duplicate gopath * fix duplicate gopath * Minor fixes * Regenerate dotnet * Update circleci to reflect makefile changes * Separate e2e-test job * Renamed e2e test job for consistency * Run e2e tests * Increase test VM size * Download tools for build job * Enable integration tests in circleci * Store junit test report * Whitespace * Circleci cleanup * Update job name * Typo * Improved dependency caching * Remove unnecessary make download calls * Print GOPATH * Print GOPATH * Change go cache directories * Only download with GO_TEST_CMD * Always check dependencies * Go mod tidy after make download * Use large resource class * Refactored * Use xlarge resource class * Removed unused dependency * Specify container versions * Remove unused jobs * Enable DLC * Print logs * Fix order of printing logs * Bump e2e test instance resource class * Revert e2e test instance resource class Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Albin Severinson <[email protected]>
svc-gh-ghzonetrans-p
pushed a commit
that referenced
this pull request
Oct 23, 2023
* Update simulator * Replace Output with C * Typo * Restore pkg proto * Restore files * Fixing simulator changes (#6) * Fixing simulator changes * Changed to less than or equal Co-authored-by: Mustafa Ilyas <[email protected]> * Simulator Changes (#9) * Add config and dependency injection to scheduler metrics (#2892) * Replace metrics singleton with an injection pattern. * fix * add configuration structures to metrics * add configuration * rename elements * Maker Pulsar ReceiverQueueSize Configurable (#2895) * wip * wip * set receiverQueueSize to 100 * remove old PulsarReceiverQueueSize * revert * subscriptionin api --------- Co-authored-by: Chris Martin <[email protected]> * Add poll_interval (#2805) * Add poll_interval * Add poll_interval * Added poll_interval * update by running tox-e docs --------- Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> * Seperate python script for armada v1 and v2 system diagrams (#2758) * Seperate python script for armada v1 system diagram * removed generate.py so it can be replaced with two seperate files for Armada V1 and Armada V2 * Python script to generate Armada V2 system diagram * generate_v1.py Update #1 * generate_v1.py Update Number:2 * generate.py runs generate_v1.py as well as generate_v2.py and it is consistent with our instructions as 'docs/design/diagrams/relationships' * generate_v1.py Update No:3 * Armada V1 and Armada V2 diagrams * updated relationships_diagram.md to include armada v1 and v2 diagrams --------- Co-authored-by: Adam McArthur <[email protected]> * Add config to use autoupdater on tagged branches (#2905) * #2904 add autoupdate config * #2904 add label config and other options * docs: create README.md for plugins directory (#2897) * Create README.md for plugins directory * Update README.md * Update plugins/README.md Co-authored-by: Kevin Hannon <[email protected]> * Update README.md --------- Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> * Enables airflow operator level retry. (#2894) * Update docker stuff for latest airflow 2.7.0 * Use AirflowException instead of AirflowFailException to allow for retries * Remove codecov workflows (#2902) * Upgrade Pulsar Client to v0.11 (#2896) * update * update pulsar client * Fix bug causing server spinning * Abstract out the retry until success logic for testing (#2901) * Respond to review --------- Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> * Sync quickstart/index.md with gh-pages/quickstart.md (#2891) * Log Call Site (#2909) * allow logger to report caller * allow logger to report caller * lint --------- Co-authored-by: Chris Martin <[email protected]> * Add cleaner test output for mage with os/exec.Command (#2907) * feat: Update Semver from version 6.3.0 to 6.3.1 (#2686) Co-authored-by: Adam McArthur <[email protected]> * fix: upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0 (#2743) Snyk has created this PR to upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * fix: upgrade @types/react from 16.14.32 to 16.14.43 (#2747) Snyk has created this PR to upgrade @types/react from 16.14.32 to 16.14.43. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Bump github.com/go-openapi/jsonreference from 0.20.0 to 0.20.2 (#2316) Bumps [github.com/go-openapi/jsonreference](https://github.com/go-openapi/jsonreference) from 0.20.0 to 0.20.2. - [Release notes](https://github.com/go-openapi/jsonreference/releases) - [Commits](go-openapi/jsonreference@v0.20.0...v0.20.2) --- updated-dependencies: - dependency-name: github.com/go-openapi/jsonreference dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Order leased jobs by serial (#2912) This will ensure the job leased first, gets send to the cluster first Currently we just order by postgres default sorting - which often picks the most recently leased - causing the first lease jobs to get stuck - This only occurs when scheduling is faster than leasing * Bump webpack from 5.75.0 to 5.77.0 in /internal/lookout/ui (#2302) Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.77.0. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](webpack/webpack@v5.75.0...v5.77.0) --- updated-dependencies: - dependency-name: webpack dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Bump word-wrap from 1.2.3 to 1.2.5 in /internal/lookout/ui (#2806) Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.5. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](jonschlinkert/word-wrap@1.2.3...1.2.5) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * resolve flaky (#2914) Co-authored-by: Adam McArthur <[email protected]> * fix: upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0 (#2744) Snyk has created this PR to upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * fix: upgrade react-router-dom from 6.9.0 to 6.14.1 (#2746) Snyk has created this PR to upgrade react-router-dom from 6.9.0 to 6.14.1. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Bump semver from 6.3.0 to 6.3.1 in /internal/lookout/ui (#2661) Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1. - [Release notes](https://github.com/npm/node-semver/releases) - [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md) - [Commits](npm/node-semver@v6.3.0...v6.3.1) --- updated-dependencies: - dependency-name: semver dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Run CodeQL once daily on a schedule (#2918) * Helm chart update: executor (#2917) * Helm chart update: executor At the moment the helm chart for the executor doesn't include priorityClass even though one is created in the chart. This means that the executor deployment is unable to set the priorityClass. * Patch/dependencies (#2923) * Bump github.com/go-openapi/strfmt from 0.21.3 to 0.21.7 Bumps [github.com/go-openapi/strfmt](https://github.com/go-openapi/strfmt) from 0.21.3 to 0.21.7. - [Release notes](https://github.com/go-openapi/strfmt/releases) - [Commits](go-openapi/strfmt@v0.21.3...v0.21.7) --- updated-dependencies: - dependency-name: github.com/go-openapi/strfmt dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Bump github.com/go-openapi/runtime from 0.24.2 to 0.26.0 Bumps [github.com/go-openapi/runtime](https://github.com/go-openapi/runtime) from 0.24.2 to 0.26.0. - [Release notes](https://github.com/go-openapi/runtime/releases) - [Commits](go-openapi/runtime@v0.24.2...v0.26.0) --- updated-dependencies: - dependency-name: github.com/go-openapi/runtime dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Bump github.com/goreleaser/nfpm/v2 from 2.25.1 to 2.29.0 Bumps [github.com/goreleaser/nfpm/v2](https://github.com/goreleaser/nfpm) from 2.25.1 to 2.29.0. - [Release notes](https://github.com/goreleaser/nfpm/releases) - [Changelog](https://github.com/goreleaser/nfpm/blob/main/.goreleaser.yml) - [Commits](goreleaser/nfpm@v2.25.1...v2.29.0) --- updated-dependencies: - dependency-name: github.com/goreleaser/nfpm/v2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * Bump github.com/go-playground/validator/v10 from 10.11.1 to 10.14.1 Bumps [github.com/go-playground/validator/v10](https://github.com/go-playground/validator) from 10.11.1 to 10.14.1. - [Release notes](https://github.com/go-playground/validator/releases) - [Commits](go-playground/validator@v10.11.1...v10.14.1) --- updated-dependencies: - dependency-name: github.com/go-playground/validator/v10 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Bump Grpc.Net.Client in /client/DotNet/ArmadaProject.Io.Client Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.47.0 to 2.52.0. - [Release notes](https://github.com/grpc/grpc-dotnet/releases) - [Changelog](https://github.com/grpc/grpc-dotnet/blob/master/doc/release_process.md) - [Commits](grpc/grpc-dotnet@v2.47.0...v2.52.0) --- updated-dependencies: - dependency-name: Grpc.Net.Client dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * fix: upgrade @mui/material from 5.10.17 to 5.13.6 Snyk has created this PR to upgrade @mui/material from 5.10.17 to 5.13.6. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade prettier from 2.7.1 to 2.8.8 Snyk has created this PR to upgrade prettier from 2.7.1 to 2.8.8. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade @mui/icons-material from 5.10.16 to 5.14.3 Snyk has created this PR to upgrade @mui/icons-material from 5.10.16 to 5.14.3. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade eslint-plugin-import from 2.26.0 to 2.28.0 Snyk has created this PR to upgrade eslint-plugin-import from 2.26.0 to 2.28.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade eslint-config-prettier from 8.5.0 to 8.10.0 Snyk has created this PR to upgrade eslint-config-prettier from 8.5.0 to 8.10.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * Trying to update klog * go mod fix --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Fix bug causing GetJobSetEvents to get stuck (#2903) * Add error message of final job run to JobFailedMessage When we hit the maximum retry limit, the JobFailedMessage just says something along the lines of "Job has been retried too many times, giving up" Now we include the final run error in that message - to make it easier to work out the cause of retries * Fix bug causing GetJobSetEvents to get stuck GetJobSetEvents only increments its fromId variable on sending new messages However now all redis events produce api events that will be sent downstream The issue here is if we get 500 redis events in a row that don't produce api events, then the fromId never gets updated - Meaning the watching gets stuck here To fix this, ReadEvents now returns a lastMessageId. So if there are no messages to process, the fromId should be updated using the lastMessageId * Formatting * Bump @adobe/css-tools from 4.0.1 to 4.3.1 in /internal/lookout/ui (#2931) Bumps [@adobe/css-tools](https://github.com/adobe/css-tools) from 4.0.1 to 4.3.1. - [Changelog](https://github.com/adobe/css-tools/blob/main/History.md) - [Commits](https://github.com/adobe/css-tools/commits) --- updated-dependencies: - dependency-name: "@adobe/css-tools" dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Improved etcd protection (#2925) * Initial commit * Delete unused code * Export metrics collection delay metrics * Add mutex to InMemoryJobRepository * Add tests * Lint * Update internal/executor/configuration/types.go * Lint --------- Co-authored-by: JamesMurkin <[email protected]> * Stop executor requesting more jobs when it still has leased jobs (#2932) * Stop executor requesting more jobs when it still has leased jobs Currently we "queue" jobs to be submitted on the executor - which sit the leased state until they are submitted to kubernetes However this causes 2 issues with our current setup: - It prevents back-pressure from working well on the scheduler side. As it sees all these "Leased" jobs as active, so just keep scheduling more - In the case we are slowing submission due to etcd going over its limit. We "queue" lots of jobs, and as soon as etcd goes under its limit we hit it with potentially thousands of jobs This flow needs further work and thought - however for now this is the minimal fix to prevent bad behaviour Signed-off-by: JamesMurkin <[email protected]> * WIP Signed-off-by: JamesMurkin <[email protected]> * Fix scheduler side tests Signed-off-by: JamesMurkin <[email protected]> * Implement number of requested jobs on executor side Signed-off-by: JamesMurkin <[email protected]> * Remove unused config Signed-off-by: JamesMurkin <[email protected]> * Fixing panic on startup when etcd health monitor not registered Signed-off-by: JamesMurkin <[email protected]> * Enhance logging Signed-off-by: JamesMurkin <[email protected]> * Set more sensible default for maxLeasedJobs Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]> * Fix race in etcd protections (#2937) * Initial commit * Fix MultiHealthMonitor race * Fix etcd health metric naming conflict (#2939) * Fix metric naming conflict * Fix metric names * Fix metrix prefix * Fix label * Bump golang.org/x/sync from 0.1.0 to 0.3.0 (#2946) Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.1.0 to 0.3.0. - [Commits](golang/sync@v0.1.0...v0.3.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add more scheduler metrics (#2906) * Add jobs considered and refactor to counters * Add fair share metrics * Add reset for gauge metrics * format * cycle imports * modify cycle return struct * verbose logging --------- Co-authored-by: Albin Severinson <[email protected]> * Update config.yaml (#2953) * Remove gang job cardinality submit check. Add placeholder for min gang size * Add msumner91 and mustafai to magic list of trusted people (#2956) * Add msumner91 to magic list of trusted people * Update .mergify.yml * Airflow: always set credentials from args in channel ctor (#2952) In the GrpcChannelArguments constructor, always set the credentials_callback_args member from what is given. Add a test to verify serialization round-tripping is complete, and a __eq__ implementation for GrpcChannelArguments. Signed-off-by: Rich Scott <[email protected]> * Removed Makefile from repo (#2915) Co-authored-by: Mohamed Abdelfatah <[email protected]> * Add per-queue scheduling rate-limiting (#2938) * Initial commit * Add rate limiters * go mod tidy * Updates * Add tests * Update default config * Update default scheduler config * Whitespace * Cleanup * Docstring improvements * Remove limiter nil checks * Add Cardinality() function on gctx * Fix test * Fix test * Add note about signed commits to Contributor documentation (#2960) * Add note about signed commits to Contributor documentation Signed-off-by: Aviral Singh <[email protected]> * Add note about signed commits to Contributor documentation --------- Signed-off-by: Aviral Singh <[email protected]> * ArmadaContext that includes a logger (#2934) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * compilation! * rename package * more compilation * rename to Context * embed * compilation * compilation * fix test * remove old ctxloggers * revert design doc * revert developer doc * formatting * wip * tests * don't gen * don't gen * merged master --------- Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Albin Severinson <[email protected]> * Bump armada airflow operator to version 0.5.4 (#2961) * Bump armada airflow operator to version 0.5.4 Signed-off-by: Rich Scott <[email protected]> * Regenerate Airflow Operator Markdown doc. Signed-off-by: Rich Scott <[email protected]> * Fix regenerated Airflow doc error. Signed-off-by: Rich Scott <[email protected]> * Pin versions of all modules, especially around docs generation. Signed-off-by: Rich Scott <[email protected]> * Regenerate Airflow docs using Python 3.10 Signed-off-by: Rich Scott <[email protected]> --------- Signed-off-by: Rich Scott <[email protected]> * Simulator Changes Made a number of changes to the simulator and simulator tests, most notably: - Fixed implementation of minSubmitTime setting for workload specifications - Added tests for SchedulingConfigsFromPattern, ClusterSpecsFromPattern, WorkloadFromPattern - Added sample workloads, clusters and scheduling configs - Added tests which simulate per-pool and per-executorGroup scheduling - Implemented further metrics for use in simulator tests, such as a cluster's aggregate resources, number of preemptions and schedules for a given test run - Added optimisation to speed up simulator, whereby the scheduler skips the current schedule event if no eventSequences have been received since the previous schedule. * Simplified TestClusterSpecsFromPattern and TestWorkloadFromPattern tests * Removed unused test * Fixed malformed yaml * Improved metrics for simulations. Improved simulator tests with errorgroups. * Removed all simulator test data except basic data necessary for testing * Implementing CLI Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: JamesMurkin <[email protected]> Signed-off-by: Rich Scott <[email protected]> Signed-off-by: Aviral Singh <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Sarthak Negi <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Pradeep Kurapati <[email protected]> Co-authored-by: Dave Gantenbein <[email protected]> Co-authored-by: Shivang Shandilya <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Clif Houck <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> Co-authored-by: Kanu Mike Chibundu <[email protected]> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: owenthomas17 <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mark Sumner <[email protected]> Co-authored-by: Rich Scott <[email protected]> Co-authored-by: MeenuyD <[email protected]> Co-authored-by: Aviral Singh <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> * Adding verbose flag to simulator CLI, changing logging context in simulator * Improved simulator CLI output, removed redundant features, implemented parallel simulations by addressing mutability of structures inputted into the simulator * Removed unknown logging library * Changing threadSafeLogger Info call to Print. Adding separation back between simulation results * Implemented stochastic runtime for jobs using a shifted exponential distribution (#13) * Implemented stochastic runtime for jobs using a shifted exponential distribution * Implemented min submit time from dependency completion (#14) Co-authored-by: Mustafa Ilyas <[email protected]> * Fixed tests * Fixed implementation of shifted exponential distribution * Using FP unrounded parameters to sample from distribution * Modified stochastic runtime definition * Adding logging to simulator Co-authored-by: Mustafa Ilyas <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: JamesMurkin <[email protected]> Signed-off-by: Rich Scott <[email protected]> Signed-off-by: Aviral Singh <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Sarthak Negi <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Pradeep Kurapati <[email protected]> Co-authored-by: Dave Gantenbein <[email protected]> Co-authored-by: Shivang Shandilya <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Clif Houck <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> Co-authored-by: Kanu Mike Chibundu <[email protected]> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: owenthomas17 <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mark Sumner <[email protected]> Co-authored-by: Rich Scott <[email protected]> Co-authored-by: MeenuyD <[email protected]> Co-authored-by: Aviral Singh <[email protected]>
severinson
added a commit
that referenced
this pull request
Oct 27, 2023
* Sync out testsuite changes (#19) * Update simulator * Replace Output with C * Typo * Restore pkg proto * Restore files * Fixing simulator changes (#6) * Fixing simulator changes * Changed to less than or equal Co-authored-by: Mustafa Ilyas <[email protected]> * Simulator Changes (#9) * Add config and dependency injection to scheduler metrics (#2892) * Replace metrics singleton with an injection pattern. * fix * add configuration structures to metrics * add configuration * rename elements * Maker Pulsar ReceiverQueueSize Configurable (#2895) * wip * wip * set receiverQueueSize to 100 * remove old PulsarReceiverQueueSize * revert * subscriptionin api --------- Co-authored-by: Chris Martin <[email protected]> * Add poll_interval (#2805) * Add poll_interval * Add poll_interval * Added poll_interval * update by running tox-e docs --------- Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> * Seperate python script for armada v1 and v2 system diagrams (#2758) * Seperate python script for armada v1 system diagram * removed generate.py so it can be replaced with two seperate files for Armada V1 and Armada V2 * Python script to generate Armada V2 system diagram * generate_v1.py Update #1 * generate_v1.py Update Number:2 * generate.py runs generate_v1.py as well as generate_v2.py and it is consistent with our instructions as 'docs/design/diagrams/relationships' * generate_v1.py Update No:3 * Armada V1 and Armada V2 diagrams * updated relationships_diagram.md to include armada v1 and v2 diagrams --------- Co-authored-by: Adam McArthur <[email protected]> * Add config to use autoupdater on tagged branches (#2905) * #2904 add autoupdate config * #2904 add label config and other options * docs: create README.md for plugins directory (#2897) * Create README.md for plugins directory * Update README.md * Update plugins/README.md Co-authored-by: Kevin Hannon <[email protected]> * Update README.md --------- Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> * Enables airflow operator level retry. (#2894) * Update docker stuff for latest airflow 2.7.0 * Use AirflowException instead of AirflowFailException to allow for retries * Remove codecov workflows (#2902) * Upgrade Pulsar Client to v0.11 (#2896) * update * update pulsar client * Fix bug causing server spinning * Abstract out the retry until success logic for testing (#2901) * Respond to review --------- Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> * Sync quickstart/index.md with gh-pages/quickstart.md (#2891) * Log Call Site (#2909) * allow logger to report caller * allow logger to report caller * lint --------- Co-authored-by: Chris Martin <[email protected]> * Add cleaner test output for mage with os/exec.Command (#2907) * feat: Update Semver from version 6.3.0 to 6.3.1 (#2686) Co-authored-by: Adam McArthur <[email protected]> * fix: upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0 (#2743) Snyk has created this PR to upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * fix: upgrade @types/react from 16.14.32 to 16.14.43 (#2747) Snyk has created this PR to upgrade @types/react from 16.14.32 to 16.14.43. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Bump github.com/go-openapi/jsonreference from 0.20.0 to 0.20.2 (#2316) Bumps [github.com/go-openapi/jsonreference](https://github.com/go-openapi/jsonreference) from 0.20.0 to 0.20.2. - [Release notes](https://github.com/go-openapi/jsonreference/releases) - [Commits](go-openapi/jsonreference@v0.20.0...v0.20.2) --- updated-dependencies: - dependency-name: github.com/go-openapi/jsonreference dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Order leased jobs by serial (#2912) This will ensure the job leased first, gets send to the cluster first Currently we just order by postgres default sorting - which often picks the most recently leased - causing the first lease jobs to get stuck - This only occurs when scheduling is faster than leasing * Bump webpack from 5.75.0 to 5.77.0 in /internal/lookout/ui (#2302) Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.77.0. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](webpack/webpack@v5.75.0...v5.77.0) --- updated-dependencies: - dependency-name: webpack dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Bump word-wrap from 1.2.3 to 1.2.5 in /internal/lookout/ui (#2806) Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.5. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](jonschlinkert/word-wrap@1.2.3...1.2.5) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * resolve flaky (#2914) Co-authored-by: Adam McArthur <[email protected]> * fix: upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0 (#2744) Snyk has created this PR to upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * fix: upgrade react-router-dom from 6.9.0 to 6.14.1 (#2746) Snyk has created this PR to upgrade react-router-dom from 6.9.0 to 6.14.1. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Bump semver from 6.3.0 to 6.3.1 in /internal/lookout/ui (#2661) Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1. - [Release notes](https://github.com/npm/node-semver/releases) - [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md) - [Commits](npm/node-semver@v6.3.0...v6.3.1) --- updated-dependencies: - dependency-name: semver dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Run CodeQL once daily on a schedule (#2918) * Helm chart update: executor (#2917) * Helm chart update: executor At the moment the helm chart for the executor doesn't include priorityClass even though one is created in the chart. This means that the executor deployment is unable to set the priorityClass. * Patch/dependencies (#2923) * Bump github.com/go-openapi/strfmt from 0.21.3 to 0.21.7 Bumps [github.com/go-openapi/strfmt](https://github.com/go-openapi/strfmt) from 0.21.3 to 0.21.7. - [Release notes](https://github.com/go-openapi/strfmt/releases) - [Commits](go-openapi/strfmt@v0.21.3...v0.21.7) --- updated-dependencies: - dependency-name: github.com/go-openapi/strfmt dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Bump github.com/go-openapi/runtime from 0.24.2 to 0.26.0 Bumps [github.com/go-openapi/runtime](https://github.com/go-openapi/runtime) from 0.24.2 to 0.26.0. - [Release notes](https://github.com/go-openapi/runtime/releases) - [Commits](go-openapi/runtime@v0.24.2...v0.26.0) --- updated-dependencies: - dependency-name: github.com/go-openapi/runtime dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Bump github.com/goreleaser/nfpm/v2 from 2.25.1 to 2.29.0 Bumps [github.com/goreleaser/nfpm/v2](https://github.com/goreleaser/nfpm) from 2.25.1 to 2.29.0. - [Release notes](https://github.com/goreleaser/nfpm/releases) - [Changelog](https://github.com/goreleaser/nfpm/blob/main/.goreleaser.yml) - [Commits](goreleaser/nfpm@v2.25.1...v2.29.0) --- updated-dependencies: - dependency-name: github.com/goreleaser/nfpm/v2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * Bump github.com/go-playground/validator/v10 from 10.11.1 to 10.14.1 Bumps [github.com/go-playground/validator/v10](https://github.com/go-playground/validator) from 10.11.1 to 10.14.1. - [Release notes](https://github.com/go-playground/validator/releases) - [Commits](go-playground/validator@v10.11.1...v10.14.1) --- updated-dependencies: - dependency-name: github.com/go-playground/validator/v10 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Bump Grpc.Net.Client in /client/DotNet/ArmadaProject.Io.Client Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.47.0 to 2.52.0. - [Release notes](https://github.com/grpc/grpc-dotnet/releases) - [Changelog](https://github.com/grpc/grpc-dotnet/blob/master/doc/release_process.md) - [Commits](grpc/grpc-dotnet@v2.47.0...v2.52.0) --- updated-dependencies: - dependency-name: Grpc.Net.Client dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * fix: upgrade @mui/material from 5.10.17 to 5.13.6 Snyk has created this PR to upgrade @mui/material from 5.10.17 to 5.13.6. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade prettier from 2.7.1 to 2.8.8 Snyk has created this PR to upgrade prettier from 2.7.1 to 2.8.8. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade @mui/icons-material from 5.10.16 to 5.14.3 Snyk has created this PR to upgrade @mui/icons-material from 5.10.16 to 5.14.3. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade eslint-plugin-import from 2.26.0 to 2.28.0 Snyk has created this PR to upgrade eslint-plugin-import from 2.26.0 to 2.28.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade eslint-config-prettier from 8.5.0 to 8.10.0 Snyk has created this PR to upgrade eslint-config-prettier from 8.5.0 to 8.10.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * Trying to update klog * go mod fix --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> * Fix bug causing GetJobSetEvents to get stuck (#2903) * Add error message of final job run to JobFailedMessage When we hit the maximum retry limit, the JobFailedMessage just says something along the lines of "Job has been retried too many times, giving up" Now we include the final run error in that message - to make it easier to work out the cause of retries * Fix bug causing GetJobSetEvents to get stuck GetJobSetEvents only increments its fromId variable on sending new messages However now all redis events produce api events that will be sent downstream The issue here is if we get 500 redis events in a row that don't produce api events, then the fromId never gets updated - Meaning the watching gets stuck here To fix this, ReadEvents now returns a lastMessageId. So if there are no messages to process, the fromId should be updated using the lastMessageId * Formatting * Bump @adobe/css-tools from 4.0.1 to 4.3.1 in /internal/lookout/ui (#2931) Bumps [@adobe/css-tools](https://github.com/adobe/css-tools) from 4.0.1 to 4.3.1. - [Changelog](https://github.com/adobe/css-tools/blob/main/History.md) - [Commits](https://github.com/adobe/css-tools/commits) --- updated-dependencies: - dependency-name: "@adobe/css-tools" dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Improved etcd protection (#2925) * Initial commit * Delete unused code * Export metrics collection delay metrics * Add mutex to InMemoryJobRepository * Add tests * Lint * Update internal/executor/configuration/types.go * Lint --------- Co-authored-by: JamesMurkin <[email protected]> * Stop executor requesting more jobs when it still has leased jobs (#2932) * Stop executor requesting more jobs when it still has leased jobs Currently we "queue" jobs to be submitted on the executor - which sit the leased state until they are submitted to kubernetes However this causes 2 issues with our current setup: - It prevents back-pressure from working well on the scheduler side. As it sees all these "Leased" jobs as active, so just keep scheduling more - In the case we are slowing submission due to etcd going over its limit. We "queue" lots of jobs, and as soon as etcd goes under its limit we hit it with potentially thousands of jobs This flow needs further work and thought - however for now this is the minimal fix to prevent bad behaviour Signed-off-by: JamesMurkin <[email protected]> * WIP Signed-off-by: JamesMurkin <[email protected]> * Fix scheduler side tests Signed-off-by: JamesMurkin <[email protected]> * Implement number of requested jobs on executor side Signed-off-by: JamesMurkin <[email protected]> * Remove unused config Signed-off-by: JamesMurkin <[email protected]> * Fixing panic on startup when etcd health monitor not registered Signed-off-by: JamesMurkin <[email protected]> * Enhance logging Signed-off-by: JamesMurkin <[email protected]> * Set more sensible default for maxLeasedJobs Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]> * Fix race in etcd protections (#2937) * Initial commit * Fix MultiHealthMonitor race * Fix etcd health metric naming conflict (#2939) * Fix metric naming conflict * Fix metric names * Fix metrix prefix * Fix label * Bump golang.org/x/sync from 0.1.0 to 0.3.0 (#2946) Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.1.0 to 0.3.0. - [Commits](golang/sync@v0.1.0...v0.3.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add more scheduler metrics (#2906) * Add jobs considered and refactor to counters * Add fair share metrics * Add reset for gauge metrics * format * cycle imports * modify cycle return struct * verbose logging --------- Co-authored-by: Albin Severinson <[email protected]> * Update config.yaml (#2953) * Remove gang job cardinality submit check. Add placeholder for min gang size * Add msumner91 and mustafai to magic list of trusted people (#2956) * Add msumner91 to magic list of trusted people * Update .mergify.yml * Airflow: always set credentials from args in channel ctor (#2952) In the GrpcChannelArguments constructor, always set the credentials_callback_args member from what is given. Add a test to verify serialization round-tripping is complete, and a __eq__ implementation for GrpcChannelArguments. Signed-off-by: Rich Scott <[email protected]> * Removed Makefile from repo (#2915) Co-authored-by: Mohamed Abdelfatah <[email protected]> * Add per-queue scheduling rate-limiting (#2938) * Initial commit * Add rate limiters * go mod tidy * Updates * Add tests * Update default config * Update default scheduler config * Whitespace * Cleanup * Docstring improvements * Remove limiter nil checks * Add Cardinality() function on gctx * Fix test * Fix test * Add note about signed commits to Contributor documentation (#2960) * Add note about signed commits to Contributor documentation Signed-off-by: Aviral Singh <[email protected]> * Add note about signed commits to Contributor documentation --------- Signed-off-by: Aviral Singh <[email protected]> * ArmadaContext that includes a logger (#2934) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * compilation! * rename package * more compilation * rename to Context * embed * compilation * compilation * fix test * remove old ctxloggers * revert design doc * revert developer doc * formatting * wip * tests * don't gen * don't gen * merged master --------- Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Albin Severinson <[email protected]> * Bump armada airflow operator to version 0.5.4 (#2961) * Bump armada airflow operator to version 0.5.4 Signed-off-by: Rich Scott <[email protected]> * Regenerate Airflow Operator Markdown doc. Signed-off-by: Rich Scott <[email protected]> * Fix regenerated Airflow doc error. Signed-off-by: Rich Scott <[email protected]> * Pin versions of all modules, especially around docs generation. Signed-off-by: Rich Scott <[email protected]> * Regenerate Airflow docs using Python 3.10 Signed-off-by: Rich Scott <[email protected]> --------- Signed-off-by: Rich Scott <[email protected]> * Simulator Changes Made a number of changes to the simulator and simulator tests, most notably: - Fixed implementation of minSubmitTime setting for workload specifications - Added tests for SchedulingConfigsFromPattern, ClusterSpecsFromPattern, WorkloadFromPattern - Added sample workloads, clusters and scheduling configs - Added tests which simulate per-pool and per-executorGroup scheduling - Implemented further metrics for use in simulator tests, such as a cluster's aggregate resources, number of preemptions and schedules for a given test run - Added optimisation to speed up simulator, whereby the scheduler skips the current schedule event if no eventSequences have been received since the previous schedule. * Simplified TestClusterSpecsFromPattern and TestWorkloadFromPattern tests * Removed unused test * Fixed malformed yaml * Improved metrics for simulations. Improved simulator tests with errorgroups. * Removed all simulator test data except basic data necessary for testing * Implementing CLI Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: JamesMurkin <[email protected]> Signed-off-by: Rich Scott <[email protected]> Signed-off-by: Aviral Singh <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Sarthak Negi <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Pradeep Kurapati <[email protected]> Co-authored-by: Dave Gantenbein <[email protected]> Co-authored-by: Shivang Shandilya <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Clif Houck <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> Co-authored-by: Kanu Mike Chibundu <[email protected]> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: owenthomas17 <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mark Sumner <[email protected]> Co-authored-by: Rich Scott <[email protected]> Co-authored-by: MeenuyD <[email protected]> Co-authored-by: Aviral Singh <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> * Adding verbose flag to simulator CLI, changing logging context in simulator * Improved simulator CLI output, removed redundant features, implemented parallel simulations by addressing mutability of structures inputted into the simulator * Removed unknown logging library * Changing threadSafeLogger Info call to Print. Adding separation back between simulation results * Implemented stochastic runtime for jobs using a shifted exponential distribution (#13) * Implemented stochastic runtime for jobs using a shifted exponential distribution * Implemented min submit time from dependency completion (#14) Co-authored-by: Mustafa Ilyas <[email protected]> * Fixed tests * Fixed implementation of shifted exponential distribution * Using FP unrounded parameters to sample from distribution * Modified stochastic runtime definition * Adding logging to simulator Co-authored-by: Mustafa Ilyas <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: JamesMurkin <[email protected]> Signed-off-by: Rich Scott <[email protected]> Signed-off-by: Aviral Singh <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Sarthak Negi <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Pradeep Kurapati <[email protected]> Co-authored-by: Dave Gantenbein <[email protected]> Co-authored-by: Shivang Shandilya <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Clif Houck <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> Co-authored-by: Kanu Mike Chibundu <[email protected]> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: owenthomas17 <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mark Sumner <[email protected]> Co-authored-by: Rich Scott <[email protected]> Co-authored-by: MeenuyD <[email protected]> Co-authored-by: Aviral Singh <[email protected]> * Add missing brace * Lint * Lint * Lint * Cleanup * Testsuite improvements * Lint * Tidying --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: JamesMurkin <[email protected]> Signed-off-by: Rich Scott <[email protected]> Signed-off-by: Aviral Singh <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Albin Severinson <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> Co-authored-by: Mustafa Ilyas <[email protected]> Co-authored-by: Daniel Rastelli <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Sarthak Negi <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Adam McArthur <[email protected]> Co-authored-by: Pradeep Kurapati <[email protected]> Co-authored-by: Dave Gantenbein <[email protected]> Co-authored-by: Shivang Shandilya <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: Clif Houck <[email protected]> Co-authored-by: Mohamed Abdelfatah <[email protected]> Co-authored-by: Kanu Mike Chibundu <[email protected]> Co-authored-by: snyk-bot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin <[email protected]> Co-authored-by: owenthomas17 <[email protected]> Co-authored-by: Mark Sumner <[email protected]> Co-authored-by: Rich Scott <[email protected]> Co-authored-by: MeenuyD <[email protected]> Co-authored-by: Aviral Singh <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.