Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple beat.Clients #37657

Closed
wants to merge 14 commits into from
Closed

Conversation

kcreddy
Copy link
Contributor

@kcreddy kcreddy commented Jan 17, 2024

Proposed commit message

Improve GCP PubSub input performance by creating multiple beat.Clients in a pool for each pubsub input instance. Each client creates its own subscription to the topic. Input being blocked due to max_outstanding_message < flush.min_events is fixed by increasing default max_outstanding_message to atleast default value of flush.min_events. Increased default go_routines to 2 to improve ingestion performance but not too high to cause high CPU usage.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Related issues

Logs

Performance measurement

The tests are executed by creating a GCP PubSub subscription. The output section is configured with file output.
filebeat-8.13.1 is the latest beats version used as a baseline, and filebeat-8.14.0 is the new code containing multiple beat.Clients.

Acronyms used:

g   - number of goroutines
m   - max outstanding messages
r   - run time of the test
qft - queue flush time

2 types of measurements are taken.

1. Based on generated output events (approximation)

This measurement comes from number of output messages inside the files generated from file output.

Run filebeat-8.13.1 filebeat-8.14.0
g1m1000r1m 6k
g1m1000r2m 12k
g2m1000r1m 6k
g2m1000r2m 13k
g1m1000r1m-qft30 2k
g1m1600r1m 180k
g1m2000r30s 90k 40k
g1m3000r30s 90k
g2m2000r30s 172k 312k
g2m2000r2m 750k
g3m1000r2m 13k
g3m2000r30s 202k 499k
g4m2000r30s 257k 580k
g5m2000r30s 647k

2. Based on values inside "Total Metrics" log message.

This log/metrics message appears on stopping filebeat. These are accurate compared to file output counts and they match against GCP PubSub Acked Messages metric.

Run filebeat-8.13.1 filebeat-8.14.0
g1m2000r30s 39k 39k
g2m2000r30s 148k 285k
g3m2000r30s 205k 501k
g4m2000r30s 258k 585k
g5m2000r30s 651k

Results:

  1. With current version 8.13.1, comparing g1m1000r1m and g1m1000r1m-qft30s, shows that when max_outstanding_messages set to 1000 i.e., less than default flush.min_events, there is a clear indication of input being blocked. It is similar to AWS SQS issue .
    • Increasing max_outstanding_messages to atleast flush.min_events is suggested to overcome this. Doing so, the performance improvement is evident in g1m1600r1m. Subsequent increase to max_outstanding_messages to 2000 in g1m2000r30s and 3000 in g1m3000r30s didn't improve the ingestion performance as the flush.min_events is still set at 1600.
  2. There is a huge difference in filebeat-8.13.1-g1m2000r30s between messages in file output (90k) vs Total Metrics log message (39k). Although Total Metrics seems to be accurate, this huge difference is unaccounted.
  3. There is a linear growth as number of go routines increases in both 8.13.1 and 8.14.0 versions, but the overall ingestion rates are much higher in 8.14.0 than its corresponding run in 8.13.0 version.
    • This suggests that having multiple beat.Clients is worth going for due to increased ingestion performance.

Mutex profiles:

8.13.1.g3m2000r30s.pprof.filebeat.contentions.delay.001.pb.gz
8.14.0.g3m2000r30s.pprof.filebeat.contentions.delay.001.pb.gz

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 17, 2024
Copy link
Contributor

mergify bot commented Jan 17, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @kcreddy? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-17T10:34:11.450+0000

  • Duration: 136 min 1 sec

Test stats 🧪

Test Results
Failed 0
Passed 3243
Skipped 176
Total 3419

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-19T15:17:40.686+0000

  • Duration: 6 min 44 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 19, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 137 min 43 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 22, 2024
Copy link
Contributor

mergify bot commented Feb 5, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @kcreddy? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@narph narph added Team:Security-Service Integrations Security Service Integrations Team and removed Team:Security-External Integrations labels Feb 8, 2024
@elasticmachine
Copy link
Collaborator

@elasticmachine
Copy link
Collaborator

@elasticmachine
Copy link
Collaborator

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @kcreddy

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @kcreddy

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @kcreddy

@kcreddy kcreddy added the Filebeat Filebeat label Apr 10, 2024
@kcreddy kcreddy marked this pull request as ready for review April 10, 2024 12:25
@kcreddy kcreddy requested a review from a team as a code owner April 10, 2024 12:25
@kcreddy kcreddy requested a review from a team as a code owner April 10, 2024 12:25
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@kcreddy kcreddy requested a review from andrewkroh April 10, 2024 12:25
@cmacknz cmacknz requested a review from faec April 10, 2024 20:33
x-pack/filebeat/input/gcppubsub/config.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/gcppubsub/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/gcppubsub/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/gcppubsub/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/gcppubsub/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/gcppubsub/input.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/gcppubsub/pubsub_test.go Outdated Show resolved Hide resolved
@kcreddy kcreddy requested a review from efd6 April 11, 2024 16:07
Comment on lines 80 to 82
// The input gets blocked until flush.min_events or flush.timeout is reached.
// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.
c.Subscription.MaxOutstandingMessages = 1600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that there is not a globally visible variable/constant that could be used for this that would ensure that this is both explained in code an robust to source mutation.

Is there a way for the input to know flush.min_events and adjust this to be max(flush.min_events, max_outstanding_messages)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the queue config is defined inside publisher, I don't see any quick way to get this check without a considerable refactor of Input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. It was a wistful desire rather than an expectation of change.

// It is not increased too high to cause high CPU usage.
c.Subscription.NumGoroutines = 2
// The input gets blocked until flush.min_events or flush.timeout is reached.
// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.
// Hence max_outstanding_message has to be at least flush.min_events to avoid this blockage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as suggested

Copy link
Contributor

mergify bot commented Apr 15, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b gcppubsub-improv upstream/gcppubsub-improv
git merge upstream/main
git push upstream gcppubsub-improv

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes in this PR create N pub/sub readers based on the subscription.num_goroutines value. This overloads the meaning of this config option. It would mean that subscription.num_goroutines * subscription.num_goroutines (i.e squared) goroutines are created.

I wonder if the same gains could be achieved by increasing the receive settings in a manner that is the same as running N readers. The pub/sub client library appears to be built to handle this parallelism for us.

For example, settings equivalent to what you were using with these code changes would be

# Multiply everything by the number of pub/sub clients you were previously creating:
sub.ReceiveSettings.NumGoroutines = in.Subscription.NumGoroutines * in.Subscription.NumGoroutines
sub.ReceiveSettings.MaxOutstandingMessages = in.Subscription.MaxOutstandingMessages * in.Subscription.NumGoroutines 

Basically, the question is whether the same gains can be achieve just by config changes.


filebeat-8.13.1 is the latest beats version used as a baseline, and filebeat-8.14.0 is the new code containing multiple beat.Clients.

IIUC this introduces more changes into the test than is necessary. For a controlled experiment I would have expected the baseline to have been performed on some commit (e.g. v8.13.1), and then the only changes introduced for the experiment are the ones under test. It sounds like the experiment was performed with all changes between v8.13.1..v8.14.0 plus what's in this PR. (for future reference)


Increasing max_outstanding_messages to atleast flush.min_events is suggested to overcome this.

Very good. This is an important improvement to the defaults. I think it would be valuable to mention about this in the pub/sub input docs.


Add multiple beat.Clients

I don't see the multiple beat.Clients in the code. It still only has one AFAICT. I compared the mutux profiles you gave me

Before (8.13.1.g2m2000r30s.pprof.filebeat.contentions.delay.001.pb)

2024-04-15_pubsub_before

After (8.14.0.g2m2000r30s.pprof.filebeat.contentions.delay.001.pb)

2024-04-15_pubsub_after

In both before and after there appears to be some lock contention (and delay) caused by the Publish lock. Like roughly 5s delay in the 30s profile.

}()
if err != nil {
in.log.Errorw("failed to create pub/sub client: ", "error", err)
cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this should be calling return after cancel()?

var workerWg sync.WaitGroup

for ctx.Err() == nil {
workers, err := in.workerSem.AcquireContext(numGoRoutines, ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should proceed the approach of adding multiple pub/sub readers (pending resolution to previous comments), but if we do, then:

I think this can be simplified to not use a semaphore. A plain for loop (e.g. for i := 0; i < number of readers; i++ { go runSinglePubSubClient() }) seems to achieve the same result given that if any one "worker" fails they all stop due to cancel().

Copy link
Contributor

mergify bot commented Apr 16, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b gcppubsub-improv upstream/gcppubsub-improv
git merge upstream/main
git push upstream gcppubsub-improv

@kcreddy
Copy link
Contributor Author

kcreddy commented Apr 17, 2024

Closing the PR as per the comments: #37657 (review).

This PR aims at creating multiple pubsub clients and not the beat pipeline clients. Although there is performance benefits observed, it is only because of the additional go_routines being created underneath. This is verified by increasing existing config options alone and we don't require additional pubsub clients to get this performance benefit.

The input blockage issue is resolved by increasing the default value of max_outstanding_messages in #38985.

@kcreddy kcreddy closed this Apr 17, 2024
@kcreddy kcreddy mentioned this pull request Apr 17, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Filebeat Filebeat Team:Security-Service Integrations Security Service Integrations Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCP PubSub Input Performance
6 participants