Implement scaling latency metrics through revisions #983

Omrigan · 2024-06-20T13:41:58Z

Newly introduced Revision is (basically) an integer value, which can be associated with different parts of a system: Monitor, Plugin, NeonVM and scaling algorithm itself. There are two types of such association:

TargetRevision corresponds to the desired state of a particular part.
CurrentRevision corresponds to the state which was already achieved.

As the system makes progress TargetRevision is propagated to CurrentRevision field, simultaneously tracking how long it took for this propagation.

When the same revision value is passed through multiple parts, we can measure end-to-end latency of multi-component operations.

Fixes #594.

Omrigan · 2024-06-20T14:00:48Z

@sharnoff can you take a look, if you have a moment? This is very WIP, I didn't update the tests or tested it.

Looking at the logic around agent's state -> does this logic of clock propagation looks sound to you? Can I make it simpler and more compact somehow?

sharnoff

Had a look - I think I'm struggling to get the big picture of what the intended flow within the autoscaler-agent state machine is (mostly because the existing code is quite complicated, and the new stuff doesn't yet have comments).

Broadly I think it should be ok, but do you have a quick (1-2 paragraph) explanation of how the clocks are supposed to flow?

(IIUC, currently the "desired logical time" stored in the plugin/monitor/neonvm state is basically the logical time of the most recent scaling that the component has completed successfully? If so, AFAICT there are still some subtle edge cases, but they shouldn't require major changes to accommodate.)

Would be good to discuss on Monday 😅

All that aside, one thing I noticed: In a lot of places, there's variables like desiredClock or desiredLogicalTime -- IMO, this reads with desired as an adjective affecting the clock/logical time, which I guess is not what's intended. I wonder if it'd be better to refer to these more like timestamps, e.g. "tsOfDesired" or "desiredAtTime" etc. (or event just "desiredAt"?)

sharnoff

Some thoughts. Not the most thorough review -- rough expectations: next round will include more nits, then one more as final thoughts.

pkg/agent/core/state.go

neonvm/apis/neonvm/v1/virtualmachine_types.go

pkg/agent/core/logiclock/logiclock.go

.golangci.yml

pkg/agent/core/state_test.go

pkg/agent/executor/exec_monitor.go

pkg/agent/runner.go

Otherwise, the following fails: ~> go list -m all go: github.com/optiopay/[email protected]: invalid version: unknown revision 000000000000 Signed-off-by: Oleg Vasilev <[email protected]>

Signed-off-by: Oleg Vasilev <[email protected]>

tests/e2e/autoscaling.default/00-assert.yaml

Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff

basically final review, a few questions left

pkg/agent/core/state.go

sharnoff · 2024-07-20T02:08:00Z

pkg/agent/core/state.go

+	revsource.Propagate(now,
+		targetRevision,
+		&h.s.Monitor.CurrentRevision,
+		h.s.Config.ObservabilityCallbacks.MonitorLatency,
+	)


Similar question here as with the scheduler - what happens when downscale is denied?

The propagation doesn't happen -> we don't measure the latency.

Here we still measure the latency for the vm-monitor even though it was denied, right? Is that intentional? (if so: what are the expected semantics for component latency?)

Yes, this is intentional, because denial is also a success. I don't expect denied vs allowed yielding different distributions of latency.

what are the expected semantics for component latency?

Well, the distribution of latency for successful requests 🙃

What here confuses you, perhaps I am missing something?

Well, the distribution of latency for successful requests

My current understanding is either:

Component latency should only look at individual request latency

Component latency should only be related to end-to-end scaling

If it's the first one, then presumably we shouldn't be using revsource for this (we'd just want a simple histogram metric looking at the time difference since we started the request, right?).

If it's the second one, then we should treat denial as failure, because that doesn't get us closer to scaling.

Does that make sense?

It is "Component latency should only look at individual request latency".

The implementation will remain as-is for now, and later can be simplified.

I should check the metric name, that it is clearly expresses semantics.

I think "autoscaling_agent_plugin_latency_seconds" fits fine.

Counting retries can be "autoscaling_agent_plugin_phase_seconds" or something.

pkg/agent/runner.go

Signed-off-by: Oleg Vasilev <[email protected]>

pkg/agent/runner.go

Signed-off-by: Oleg Vasilev <[email protected]>

pkg/agent/core/state.go

Signed-off-by: Oleg Vasilev <[email protected]>

Noticed while reviewing a new test in #983 that triggers this warning.

Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff reviewed Jun 22, 2024

View reviewed changes

Omrigan force-pushed the oleg/latency-metrics branch from adf115e to 7331087 Compare July 4, 2024 10:38

Omrigan marked this pull request as ready for review July 8, 2024 13:13

Omrigan requested a review from sharnoff July 8, 2024 15:30

sharnoff reviewed Jul 9, 2024

View reviewed changes

Omrigan added 14 commits July 9, 2024 15:05

go.mod: fix dependency version

b5ba2a9

Otherwise, the following fails: ~> go list -m all go: github.com/optiopay/[email protected]: invalid version: unknown revision 000000000000 Signed-off-by: Oleg Vasilev <[email protected]>

lint: bump golangci-lint to v1.59.1 and fix new warnings

3b19cc5

Signed-off-by: Oleg Vasilev <[email protected]>

Implement scaling latency metrics through logical clock

805292d

Signed-off-by: Oleg Vasilev <[email protected]>

generate CRD

628552a

Signed-off-by: Oleg Vasilev <[email protected]>

add kind

d5f6798

Signed-off-by: Oleg Vasilev <[email protected]>

fix lint

3af9d62

Signed-off-by: Oleg Vasilev <[email protected]>

replace kind with flags

70fef0b

Signed-off-by: Oleg Vasilev <[email protected]>

finish the implementation

20be2d9

Signed-off-by: Oleg Vasilev <[email protected]>

fix earliest

fc2e3d6

Signed-off-by: Oleg Vasilev <[email protected]>

fix tests

e6b7144

Signed-off-by: Oleg Vasilev <[email protected]>

self-review changes

0e1ab3b

Signed-off-by: Oleg Vasilev <[email protected]>

couple of renames

ca9e6c4

Signed-off-by: Oleg Vasilev <[email protected]>

add comments

5561734

Signed-off-by: Oleg Vasilev <[email protected]>

don't exclude Action from exhaustruct

63605e1

Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan force-pushed the oleg/latency-metrics branch from fc37d26 to 63605e1 Compare July 9, 2024 11:19

Omrigan changed the base branch from main to oleg/devex July 9, 2024 11:19

Omrigan added 2 commits July 9, 2024 16:09

move DesiredLogicalTime out of Guest

b43dd40

Signed-off-by: Oleg Vasilev <[email protected]>

small changes

eb1e6b1

Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan requested a review from sharnoff July 9, 2024 13:43

Base automatically changed from oleg/devex to main July 10, 2024 10:33

Omrigan added 4 commits July 11, 2024 13:06

mErge branch 'main' into oleg/latency-metrics

2c13d94

Signed-off-by: Oleg Vasilev <[email protected]>

move labels calculation into logiclock pkg

3118e51

Signed-off-by: Oleg Vasilev <[email protected]>

cleanup extract vm info

5a6b537

Signed-off-by: Oleg Vasilev <[email protected]>

add tests for logic clock

e26c155

Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan commented Jul 19, 2024

View reviewed changes

tests/e2e/autoscaling.default/00-assert.yaml Outdated Show resolved Hide resolved

misc renames

9b07933

Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff mentioned this pull request Jul 19, 2024

agent: Reduce logging #1013

Merged

sharnoff reviewed Jul 20, 2024

View reviewed changes

Omrigan added 4 commits July 21, 2024 16:07

rollback extra diff

84fa062

Signed-off-by: Oleg Vasilev <[email protected]>

add initial revision

0df27ac

Signed-off-by: Oleg Vasilev <[email protected]>

fix revision updating when it is the same

d9dee98

Signed-off-by: Oleg Vasilev <[email protected]>

Merge branch 'main' into oleg/latency-metrics

6d6c25f

Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan requested a review from sharnoff July 21, 2024 12:25

don't propagate if we are already current

2099577

Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff reviewed Jul 21, 2024

View reviewed changes

pkg/agent/runner.go Show resolved Hide resolved

Omrigan added 6 commits July 22, 2024 01:13

fix the test

3e3c765

Signed-off-by: Oleg Vasilev <[email protected]>

fix test

7e5ad35

Signed-off-by: Oleg Vasilev <[email protected]>

fix test

f269424

Signed-off-by: Oleg Vasilev <[email protected]>

fix test

5228335

Signed-off-by: Oleg Vasilev <[email protected]>

add new test

8dcd87a

Signed-off-by: Oleg Vasilev <[email protected]>

change behaviour to count both upscalings

49bd61a

Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff reviewed Jul 22, 2024

View reviewed changes

pkg/agent/core/state.go Show resolved Hide resolved

sharnoff reviewed Jul 22, 2024

View reviewed changes

pkg/agent/core/state.go Outdated Show resolved Hide resolved

sharnoff mentioned this pull request Jul 22, 2024

agent/core: Fix "but but" in warning message #1017

Merged

final thing

3f59c87

Signed-off-by: Oleg Vasilev <[email protected]>

Omrigan requested a review from sharnoff July 22, 2024 16:37

sharnoff added a commit that referenced this pull request Jul 22, 2024

agent/core: Fix "but but" in warning message (#1017)

eed32d4

Noticed while reviewing a new test in #983 that triggers this warning.

Omrigan added 2 commits July 22, 2024 20:39

Merge branch 'main' into oleg/latency-metrics

eb491f2

but but

5a2355e

Signed-off-by: Oleg Vasilev <[email protected]>

sharnoff approved these changes Jul 22, 2024

View reviewed changes

Omrigan enabled auto-merge (squash) July 22, 2024 20:47

Merge branch 'main' into oleg/latency-metrics

fcd957f

Omrigan merged commit 4395a93 into main Jul 22, 2024
15 checks passed

Omrigan deleted the oleg/latency-metrics branch July 22, 2024 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement scaling latency metrics through revisions #983

Implement scaling latency metrics through revisions #983

Omrigan commented Jun 20, 2024 •

edited

Loading

Omrigan commented Jun 20, 2024

sharnoff left a comment •

edited

Loading

sharnoff left a comment

sharnoff left a comment

sharnoff Jul 20, 2024

Omrigan Jul 21, 2024

sharnoff Jul 21, 2024

Omrigan Jul 22, 2024

sharnoff Jul 22, 2024 •

edited

Loading

Omrigan Jul 22, 2024

Omrigan Jul 22, 2024

Implement scaling latency metrics through revisions #983

Implement scaling latency metrics through revisions #983

Conversation

Omrigan commented Jun 20, 2024 • edited Loading

Omrigan commented Jun 20, 2024

sharnoff left a comment • edited Loading

Choose a reason for hiding this comment

sharnoff left a comment

Choose a reason for hiding this comment

sharnoff left a comment

Choose a reason for hiding this comment

sharnoff Jul 20, 2024

Choose a reason for hiding this comment

Omrigan Jul 21, 2024

Choose a reason for hiding this comment

sharnoff Jul 21, 2024

Choose a reason for hiding this comment

Omrigan Jul 22, 2024

Choose a reason for hiding this comment

sharnoff Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

Omrigan Jul 22, 2024

Choose a reason for hiding this comment

Omrigan Jul 22, 2024

Choose a reason for hiding this comment

Omrigan commented Jun 20, 2024 •

edited

Loading

sharnoff left a comment •

edited

Loading

sharnoff Jul 22, 2024 •

edited

Loading