Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Azure Event Hub scaling #5125

Merged
merged 3 commits into from
Jan 4, 2024
Merged

Conversation

troydn
Copy link
Contributor

@troydn troydn commented Oct 24, 2023

This PR improves the Event Hub unprocessed event count calculation to prevent negative and large values.

There are two issues in the current implementation:

  1. Stale partition runtime information results in very large values:
    The scaling calculation uses the SequenceNumber from the partition runtime information and the checkpoint store.
    Partition information could be stale compared to the checkpoint, for example when the partition runtime information is 10 and the checkpoint is 15. This should result in 0 and not in 9223372036854775802.

  2. When the sum of all unprocessed events per partition is greater than the Int64 max value the result is negative:
    When there are a lot of unprocessed events (mostly caused by the first issue), the sum of all unprocessed events per partition is greater than the Int64 max value, resulting in a negative value.
    Using the first example: Partition 1 (9223372036854775802) + Partition 2 (100) = -9223372036854775714
    In this case the result should be the Int64 max value, which will result in the max value for lagRelatedToPartitionCount: (partitionCount * threshold).

To solve the first issue I introduced a new parameter stalePartitionInfoThreshold to configure the stale partition information threshold.
This configures the range to decide if the partition information is stale or if the Event Hub went almost through the full circular buffer.

Using the first example to explain the new implementation:

  • Partition information sequence number: 10
  • Checkpoint store sequence number: 15
  • stalePartitionInfoThreshold: 10000
    • Results in 9223372036854765807 (Int64 max value - threshold) max unprocessed events for the partition info not to be considered stale.
  • Distance between the checkpoint and the partition info: 9223372036854775802
  • 9223372036854775802 > 9223372036854765807, so the partition info is stale. Return 0 as the unprocessed event count for the partition because all data is processed.

Visualization:
stalePartitionInfoThreshold

Checklist

Fixes #4250

@troydn troydn requested a review from a team as a code owner October 24, 2023 19:35
@github-actions
Copy link

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

While you are waiting, make sure to:

Learn more about:

@troydn troydn force-pushed the fix-azure-eh-scaling branch from 499ab66 to e17f322 Compare October 24, 2023 20:09
@JorTurFer
Copy link
Member

@tomkerkhove , could you pull a revision from any folk expert in event hub?

@zroubalik
Copy link
Member

@tomkerkhove FYI

@JorTurFer
Copy link
Member

I think is ready to merge. Could you solve merge conflicts please @troydn ? 🙏

@troydn troydn force-pushed the fix-azure-eh-scaling branch from e17f322 to ff0d4ee Compare January 3, 2024 22:27
@JorTurFer
Copy link
Member

JorTurFer commented Jan 4, 2024

/run-e2e azure
Update: You can check the progress here

@JorTurFer JorTurFer enabled auto-merge (squash) January 4, 2024 01:37
@JorTurFer JorTurFer merged commit 9be8ee6 into kedacore:main Jan 4, 2024
19 checks passed
toniiiik pushed a commit to toniiiik/keda that referenced this pull request Jan 15, 2024
* Update Azure EventHub scaling

Signed-off-by: Troy <[email protected]>

* Update changelog

Signed-off-by: Troy <[email protected]>

* Remove unused context

Signed-off-by: Troy <[email protected]>

---------

Signed-off-by: Troy <[email protected]>
Signed-off-by: anton.lysina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EventHub plugin reports strange values periodically
4 participants