Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Scale out with propeller manager and workflow sharding #351

Merged
merged 66 commits into from
Dec 3, 2021

Conversation

hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Oct 21, 2021

TL;DR

Adding a FlytePropeller Manager component which is responsible for configuring and ensuring liveness over a collection of FlytePropeller instances. FlyteWorkflow CRDs are effectively sharded over these instances, so that there is a deterministic one-to-one relationship. this enables horizontally scaling FlytePropeller.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Functionality has been implemented in accordance to the accepted RFC.

Tracking Issue

flyteorg/flyte#125

Follow-up issue

NA

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #351 (f794181) into master (990dc8e) will decrease coverage by 0.09%.
The diff coverage is 39.76%.

Signed-off-by: Daniel Rammer <[email protected]>
cmd/controller/cmd/root.go Outdated Show resolved Hide resolved
}

if len(cfg.ExcludeShardKey) > 0 {
labelSelectorRequirement := v1.LabelSelectorRequirement{"shardKey", v1.LabelSelectorOpNotIn, cfg.IncludeShardKey}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
labelSelectorRequirement := v1.LabelSelectorRequirement{"shardKey", v1.LabelSelectorOpNotIn, cfg.IncludeShardKey}
labelSelectorRequirement := v1.LabelSelectorRequirement{"shardKey", v1.LabelSelectorOpNotIn, cfg.ExcludeShardKey}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will we use exclude? isnt this harder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude is being used for the "enableUncoveredReplica" option. mainly for the project and domain shard strategies, it's only included on the shard key option for completeness - but it's easy to remove.

cmd/manager/cmd/root.go Outdated Show resolved Hide resolved
cmd/manager/cmd/root.go Outdated Show resolved Hide resolved
cmd/manager/cmd/root.go Outdated Show resolved Hide resolved
manager/config/config.go Outdated Show resolved Hide resolved
manager/manager.go Outdated Show resolved Hide resolved
manager/manager.go Outdated Show resolved Hide resolved
manager/manager.go Outdated Show resolved Hide resolved
manager/manager.go Outdated Show resolved Hide resolved
manager/shard_strategy.go Outdated Show resolved Hide resolved
…from shardstrategy package tests

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
… id in EnvironmentShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>
@kumare3
Copy link
Contributor

kumare3 commented Nov 30, 2021

@hamersaw I was talking with @EngHabu and he mentioned one important thing. You should create all the pods, with an owner-reference. The owner reference should be the same as - the flytepropeller deployment owner reference that is populated on the manager.
Thus the manager should pass down its own owner-reference to the propeller pods.

This will enable that when you delete the deployment it will automatically delete all the pods that were created

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
@hamersaw hamersaw requested a review from kumare3 November 30, 2021 23:25
@hamersaw
Copy link
Contributor Author

hamersaw commented Dec 1, 2021

@hamersaw I was talking with @EngHabu and he mentioned one important thing. You should create all the pods, with an owner-reference. The owner reference should be the same as - the flytepropeller deployment owner reference that is populated on the manager. Thus the manager should pass down its own owner-reference to the propeller pods.

This will enable that when you delete the deployment it will automatically delete all the pods that were created

This is completed, to track this we added the owner references (if they exist) from the pod which the manager is started in to the manager pods.

This approach differs from the previous shutdown hook which deleted all managed pods when the manager exited. Meaning a failure in the manager would delete them as well as a successful shutdown. In our new approach, failures in the manager pod do not affect managed pods.

Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits

cmd/manager/main.go Outdated Show resolved Hide resolved
manager/config/config.go Outdated Show resolved Hide resolved
manager/config/config.go Outdated Show resolved Hide resolved
manager/config/config.go Outdated Show resolved Hide resolved
manager/manager.go Outdated Show resolved Hide resolved
manager/manager.go Show resolved Hide resolved
pkg/utils/k8s.go Show resolved Hide resolved
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
@hamersaw hamersaw requested a review from EngHabu December 3, 2021 17:41
kumare3
kumare3 previously approved these changes Dec 3, 2021
EngHabu
EngHabu previously approved these changes Dec 3, 2021
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore...

manager/config/config.go Outdated Show resolved Hide resolved
manager/config/config.go Outdated Show resolved Hide resolved
@hamersaw hamersaw dismissed stale reviews from EngHabu and kumare3 via 85e3517 December 3, 2021 23:06
@hamersaw hamersaw merged commit c485750 into master Dec 3, 2021
@hamersaw hamersaw deleted the feature/sharding-scale-out branch December 4, 2021 14:23
EngHabu pushed a commit that referenced this pull request Jan 5, 2022
* added 'manager' command

Signed-off-by: Daniel Rammer <[email protected]>

* using go routine and timer for manager loop

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager loop out of cmd and into pkg directory

Signed-off-by: Daniel Rammer <[email protected]>

* detecting missing replicas

Signed-off-by: Daniel Rammer <[email protected]>

* moved extracting replica from pod name to new function

Signed-off-by: Daniel Rammer <[email protected]>

* creating managed flytepropeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* refactored configuration

Signed-off-by: Daniel Rammer <[email protected]>

* removed regex parsing for replica - checking for existance with fully qualified pod name

Signed-off-by: Daniel Rammer <[email protected]>

* mocked out shard strategy abstraction

Signed-off-by: Daniel Rammer <[email protected]>

* adding arguments to podspec for ConsistentHashingShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated import naming

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager to a top-level package

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy to manager configuration

Signed-off-by: Daniel Rammer <[email protected]>

* setting shard key label selector on managed propeller instances

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* split pod name generate to separate function to ease future auto-scaler implementation

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up pod label selector

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods on shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* added prometheus metric reporting

Signed-off-by: Daniel Rammer <[email protected]>

* updated manager run loop to use k8s wait.UntilWithContext

Signed-off-by: Daniel Rammer <[email protected]>

* moved getKubeConfig into a shared package

Signed-off-by: Daniel Rammer <[email protected]>

* assigning shard and namespace labels on FlyteWorkflow

Signed-off-by: Daniel Rammer <[email protected]>

* implement NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* implemented NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed shard label

Signed-off-by: Daniel Rammer <[email protected]>

* added comments

Signed-off-by: Daniel Rammer <[email protected]>

* checking for existing pods on startup

Signed-off-by: Daniel Rammer <[email protected]>

* handling delete of non-existent pod

Signed-off-by: Daniel Rammer <[email protected]>

* changes ConsistentHashing name to Random - because that's what it really is

Signed-off-by: Daniel Rammer <[email protected]>

* implemented EnableUncoveredReplica configuration option

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to manager using existing propeller config

Signed-off-by: Daniel Rammer <[email protected]>

* fixed disable leader election in managed propeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* removed listPods function

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to mitigate concurrent modification issues

Signed-off-by: Daniel Rammer <[email protected]>

* enabled pprof to profile resource metrics

Signed-off-by: Daniel Rammer <[email protected]>

* added 'manager' target to Makefile to start manager in development mode (similar to existing server)

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy test for computing key ranges

Signed-off-by: Daniel Rammer <[email protected]>

* fixed key range computation

Signed-off-by: Daniel Rammer <[email protected]>

* implemented project and domain shard types

Signed-off-by: Daniel Rammer <[email protected]>

* returning error on out of range podIndex during UpdatePodSpec call on shard strategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added manager tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added doc comments on exported types and functions

Signed-off-by: Daniel Rammer <[email protected]>

* exporting ComputeKeyRange function and changed adding addLabelSelector function name to addLabelSelectorIfExists to better reflect functionality

Signed-off-by: Daniel Rammer <[email protected]>

* adding pod template resource version and shard config hash annotations to fuel automatic pod management on updates

Signed-off-by: Daniel Rammer <[email protected]>

* removed pod deletion on manager shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up unit tests and lint

Signed-off-by: Daniel Rammer <[email protected]>

* updated getContainer function to retrive flytepropeller container from pod spec using container name instead of command

Signed-off-by: Daniel Rammer <[email protected]>

* removed addLabelSelectorIfExists function call

Signed-off-by: Daniel Rammer <[email protected]>

* changed bytes.Buffer from a var to declaring with new

Signed-off-by: Daniel Rammer <[email protected]>

* created a new shardstrategy package

Signed-off-by: Daniel Rammer <[email protected]>

* generating mocks for ShardStrategy to decouple manager package tests from shardstrategy package tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* changed shard configuration defintions and added support for wildcard id in EnvironmentShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* setting managed pod owner references

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed a few nits

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods with failed state

Signed-off-by: Daniel Rammer <[email protected]>

* changed ShardType type to int instead of string

Signed-off-by: Daniel Rammer <[email protected]>

* removed default values in manager config

Signed-off-by: Daniel Rammer <[email protected]>

* updated config_flags with pflags generation

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Haytham Abuelfutuh <[email protected]>
EngHabu added a commit that referenced this pull request Jan 7, 2022
* Scale out with propeller manager and workflow sharding (#351)

* added 'manager' command

Signed-off-by: Daniel Rammer <[email protected]>

* using go routine and timer for manager loop

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager loop out of cmd and into pkg directory

Signed-off-by: Daniel Rammer <[email protected]>

* detecting missing replicas

Signed-off-by: Daniel Rammer <[email protected]>

* moved extracting replica from pod name to new function

Signed-off-by: Daniel Rammer <[email protected]>

* creating managed flytepropeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* refactored configuration

Signed-off-by: Daniel Rammer <[email protected]>

* removed regex parsing for replica - checking for existance with fully qualified pod name

Signed-off-by: Daniel Rammer <[email protected]>

* mocked out shard strategy abstraction

Signed-off-by: Daniel Rammer <[email protected]>

* adding arguments to podspec for ConsistentHashingShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated import naming

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager to a top-level package

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy to manager configuration

Signed-off-by: Daniel Rammer <[email protected]>

* setting shard key label selector on managed propeller instances

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* split pod name generate to separate function to ease future auto-scaler implementation

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up pod label selector

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods on shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* added prometheus metric reporting

Signed-off-by: Daniel Rammer <[email protected]>

* updated manager run loop to use k8s wait.UntilWithContext

Signed-off-by: Daniel Rammer <[email protected]>

* moved getKubeConfig into a shared package

Signed-off-by: Daniel Rammer <[email protected]>

* assigning shard and namespace labels on FlyteWorkflow

Signed-off-by: Daniel Rammer <[email protected]>

* implement NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* implemented NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed shard label

Signed-off-by: Daniel Rammer <[email protected]>

* added comments

Signed-off-by: Daniel Rammer <[email protected]>

* checking for existing pods on startup

Signed-off-by: Daniel Rammer <[email protected]>

* handling delete of non-existent pod

Signed-off-by: Daniel Rammer <[email protected]>

* changes ConsistentHashing name to Random - because that's what it really is

Signed-off-by: Daniel Rammer <[email protected]>

* implemented EnableUncoveredReplica configuration option

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to manager using existing propeller config

Signed-off-by: Daniel Rammer <[email protected]>

* fixed disable leader election in managed propeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* removed listPods function

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to mitigate concurrent modification issues

Signed-off-by: Daniel Rammer <[email protected]>

* enabled pprof to profile resource metrics

Signed-off-by: Daniel Rammer <[email protected]>

* added 'manager' target to Makefile to start manager in development mode (similar to existing server)

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy test for computing key ranges

Signed-off-by: Daniel Rammer <[email protected]>

* fixed key range computation

Signed-off-by: Daniel Rammer <[email protected]>

* implemented project and domain shard types

Signed-off-by: Daniel Rammer <[email protected]>

* returning error on out of range podIndex during UpdatePodSpec call on shard strategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added manager tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added doc comments on exported types and functions

Signed-off-by: Daniel Rammer <[email protected]>

* exporting ComputeKeyRange function and changed adding addLabelSelector function name to addLabelSelectorIfExists to better reflect functionality

Signed-off-by: Daniel Rammer <[email protected]>

* adding pod template resource version and shard config hash annotations to fuel automatic pod management on updates

Signed-off-by: Daniel Rammer <[email protected]>

* removed pod deletion on manager shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up unit tests and lint

Signed-off-by: Daniel Rammer <[email protected]>

* updated getContainer function to retrive flytepropeller container from pod spec using container name instead of command

Signed-off-by: Daniel Rammer <[email protected]>

* removed addLabelSelectorIfExists function call

Signed-off-by: Daniel Rammer <[email protected]>

* changed bytes.Buffer from a var to declaring with new

Signed-off-by: Daniel Rammer <[email protected]>

* created a new shardstrategy package

Signed-off-by: Daniel Rammer <[email protected]>

* generating mocks for ShardStrategy to decouple manager package tests from shardstrategy package tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* changed shard configuration defintions and added support for wildcard id in EnvironmentShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* setting managed pod owner references

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed a few nits

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods with failed state

Signed-off-by: Daniel Rammer <[email protected]>

* changed ShardType type to int instead of string

Signed-off-by: Daniel Rammer <[email protected]>

* removed default values in manager config

Signed-off-by: Daniel Rammer <[email protected]>

* updated config_flags with pflags generation

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Create codeql-analysis.yml

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Handle code quality issue

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* check boundaries

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* 0 is ok

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Use ParseUint instead

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump for DCO

Signed-off-by: Haytham Abuelfutuh <[email protected]>

Co-authored-by: Dan Rammer <[email protected]>
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
* added 'manager' command

Signed-off-by: Daniel Rammer <[email protected]>

* using go routine and timer for manager loop

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager loop out of cmd and into pkg directory

Signed-off-by: Daniel Rammer <[email protected]>

* detecting missing replicas

Signed-off-by: Daniel Rammer <[email protected]>

* moved extracting replica from pod name to new function

Signed-off-by: Daniel Rammer <[email protected]>

* creating managed flytepropeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* refactored configuration

Signed-off-by: Daniel Rammer <[email protected]>

* removed regex parsing for replica - checking for existance with fully qualified pod name

Signed-off-by: Daniel Rammer <[email protected]>

* mocked out shard strategy abstraction

Signed-off-by: Daniel Rammer <[email protected]>

* adding arguments to podspec for ConsistentHashingShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated import naming

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager to a top-level package

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy to manager configuration

Signed-off-by: Daniel Rammer <[email protected]>

* setting shard key label selector on managed propeller instances

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* split pod name generate to separate function to ease future auto-scaler implementation

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up pod label selector

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods on shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* added prometheus metric reporting

Signed-off-by: Daniel Rammer <[email protected]>

* updated manager run loop to use k8s wait.UntilWithContext

Signed-off-by: Daniel Rammer <[email protected]>

* moved getKubeConfig into a shared package

Signed-off-by: Daniel Rammer <[email protected]>

* assigning shard and namespace labels on FlyteWorkflow

Signed-off-by: Daniel Rammer <[email protected]>

* implement NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* implemented NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed shard label

Signed-off-by: Daniel Rammer <[email protected]>

* added comments

Signed-off-by: Daniel Rammer <[email protected]>

* checking for existing pods on startup

Signed-off-by: Daniel Rammer <[email protected]>

* handling delete of non-existent pod

Signed-off-by: Daniel Rammer <[email protected]>

* changes ConsistentHashing name to Random - because that's what it really is

Signed-off-by: Daniel Rammer <[email protected]>

* implemented EnableUncoveredReplica configuration option

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to manager using existing propeller config

Signed-off-by: Daniel Rammer <[email protected]>

* fixed disable leader election in managed propeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* removed listPods function

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to mitigate concurrent modification issues

Signed-off-by: Daniel Rammer <[email protected]>

* enabled pprof to profile resource metrics

Signed-off-by: Daniel Rammer <[email protected]>

* added 'manager' target to Makefile to start manager in development mode (similar to existing server)

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy test for computing key ranges

Signed-off-by: Daniel Rammer <[email protected]>

* fixed key range computation

Signed-off-by: Daniel Rammer <[email protected]>

* implemented project and domain shard types

Signed-off-by: Daniel Rammer <[email protected]>

* returning error on out of range podIndex during UpdatePodSpec call on shard strategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added manager tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added doc comments on exported types and functions

Signed-off-by: Daniel Rammer <[email protected]>

* exporting ComputeKeyRange function and changed adding addLabelSelector function name to addLabelSelectorIfExists to better reflect functionality

Signed-off-by: Daniel Rammer <[email protected]>

* adding pod template resource version and shard config hash annotations to fuel automatic pod management on updates

Signed-off-by: Daniel Rammer <[email protected]>

* removed pod deletion on manager shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up unit tests and lint

Signed-off-by: Daniel Rammer <[email protected]>

* updated getContainer function to retrive flytepropeller container from pod spec using container name instead of command

Signed-off-by: Daniel Rammer <[email protected]>

* removed addLabelSelectorIfExists function call

Signed-off-by: Daniel Rammer <[email protected]>

* changed bytes.Buffer from a var to declaring with new

Signed-off-by: Daniel Rammer <[email protected]>

* created a new shardstrategy package

Signed-off-by: Daniel Rammer <[email protected]>

* generating mocks for ShardStrategy to decouple manager package tests from shardstrategy package tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* changed shard configuration defintions and added support for wildcard id in EnvironmentShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* setting managed pod owner references

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed a few nits

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods with failed state

Signed-off-by: Daniel Rammer <[email protected]>

* changed ShardType type to int instead of string

Signed-off-by: Daniel Rammer <[email protected]>

* removed default values in manager config

Signed-off-by: Daniel Rammer <[email protected]>

* updated config_flags with pflags generation

Signed-off-by: Daniel Rammer <[email protected]>
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
* Scale out with propeller manager and workflow sharding (flyteorg#351)

* added 'manager' command

Signed-off-by: Daniel Rammer <[email protected]>

* using go routine and timer for manager loop

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager loop out of cmd and into pkg directory

Signed-off-by: Daniel Rammer <[email protected]>

* detecting missing replicas

Signed-off-by: Daniel Rammer <[email protected]>

* moved extracting replica from pod name to new function

Signed-off-by: Daniel Rammer <[email protected]>

* creating managed flytepropeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* refactored configuration

Signed-off-by: Daniel Rammer <[email protected]>

* removed regex parsing for replica - checking for existance with fully qualified pod name

Signed-off-by: Daniel Rammer <[email protected]>

* mocked out shard strategy abstraction

Signed-off-by: Daniel Rammer <[email protected]>

* adding arguments to podspec for ConsistentHashingShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated import naming

Signed-off-by: Daniel Rammer <[email protected]>

* moved manager to a top-level package

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy to manager configuration

Signed-off-by: Daniel Rammer <[email protected]>

* setting shard key label selector on managed propeller instances

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* split pod name generate to separate function to ease future auto-scaler implementation

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up pod label selector

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods on shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* added prometheus metric reporting

Signed-off-by: Daniel Rammer <[email protected]>

* updated manager run loop to use k8s wait.UntilWithContext

Signed-off-by: Daniel Rammer <[email protected]>

* moved getKubeConfig into a shared package

Signed-off-by: Daniel Rammer <[email protected]>

* assigning shard and namespace labels on FlyteWorkflow

Signed-off-by: Daniel Rammer <[email protected]>

* implement NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* implemented NamespaceShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed shard label

Signed-off-by: Daniel Rammer <[email protected]>

* added comments

Signed-off-by: Daniel Rammer <[email protected]>

* checking for existing pods on startup

Signed-off-by: Daniel Rammer <[email protected]>

* handling delete of non-existent pod

Signed-off-by: Daniel Rammer <[email protected]>

* changes ConsistentHashing name to Random - because that's what it really is

Signed-off-by: Daniel Rammer <[email protected]>

* implemented EnableUncoveredReplica configuration option

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to manager using existing propeller config

Signed-off-by: Daniel Rammer <[email protected]>

* fixed disable leader election in managed propeller pods

Signed-off-by: Daniel Rammer <[email protected]>

* removed listPods function

Signed-off-by: Daniel Rammer <[email protected]>

* added leader election to mitigate concurrent modification issues

Signed-off-by: Daniel Rammer <[email protected]>

* enabled pprof to profile resource metrics

Signed-off-by: Daniel Rammer <[email protected]>

* added 'manager' target to Makefile to start manager in development mode (similar to existing server)

Signed-off-by: Daniel Rammer <[email protected]>

* added shard strategy test for computing key ranges

Signed-off-by: Daniel Rammer <[email protected]>

* fixed key range computation

Signed-off-by: Daniel Rammer <[email protected]>

* implemented project and domain shard types

Signed-off-by: Daniel Rammer <[email protected]>

* returning error on out of range podIndex during UpdatePodSpec call on shard strategy

Signed-off-by: Daniel Rammer <[email protected]>

* fixed random lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added manager tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* added doc comments on exported types and functions

Signed-off-by: Daniel Rammer <[email protected]>

* exporting ComputeKeyRange function and changed adding addLabelSelector function name to addLabelSelectorIfExists to better reflect functionality

Signed-off-by: Daniel Rammer <[email protected]>

* adding pod template resource version and shard config hash annotations to fuel automatic pod management on updates

Signed-off-by: Daniel Rammer <[email protected]>

* removed pod deletion on manager shutdown

Signed-off-by: Daniel Rammer <[email protected]>

* cleaned up unit tests and lint

Signed-off-by: Daniel Rammer <[email protected]>

* updated getContainer function to retrive flytepropeller container from pod spec using container name instead of command

Signed-off-by: Daniel Rammer <[email protected]>

* removed addLabelSelectorIfExists function call

Signed-off-by: Daniel Rammer <[email protected]>

* changed bytes.Buffer from a var to declaring with new

Signed-off-by: Daniel Rammer <[email protected]>

* created a new shardstrategy package

Signed-off-by: Daniel Rammer <[email protected]>

* generating mocks for ShardStrategy to decouple manager package tests from shardstrategy package tests

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* changed shard configuration defintions and added support for wildcard id in EnvironmentShardStrategy

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issues

Signed-off-by: Daniel Rammer <[email protected]>

* setting managed pod owner references

Signed-off-by: Daniel Rammer <[email protected]>

* updated documentation

Signed-off-by: Daniel Rammer <[email protected]>

* fixed a few nits

Signed-off-by: Daniel Rammer <[email protected]>

* delete pods with failed state

Signed-off-by: Daniel Rammer <[email protected]>

* changed ShardType type to int instead of string

Signed-off-by: Daniel Rammer <[email protected]>

* removed default values in manager config

Signed-off-by: Daniel Rammer <[email protected]>

* updated config_flags with pflags generation

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Create codeql-analysis.yml

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Handle code quality issue

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* check boundaries

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* 0 is ok

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* Use ParseUint instead

Signed-off-by: Haytham Abuelfutuh <[email protected]>

* bump for DCO

Signed-off-by: Haytham Abuelfutuh <[email protected]>

Co-authored-by: Dan Rammer <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants