Add multiple beat.Clients #37657

kcreddy · 2024-01-17T10:33:55Z

Proposed commit message

Improve GCP PubSub input performance by creating multiple beat.Clients in a pool for each pubsub input instance. Each client creates its own subscription to the topic. Input being blocked due to max_outstanding_message < flush.min_events is fixed by increasing default max_outstanding_message to atleast default value of flush.min_events. Increased default go_routines to 2 to improve ingestion performance but not too high to cause high CPU usage.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Related issues

Closes GCP PubSub Input Performance #35029

Logs

Performance measurement

The tests are executed by creating a GCP PubSub subscription. The output section is configured with file output.
filebeat-8.13.1 is the latest beats version used as a baseline, and filebeat-8.14.0 is the new code containing multiple beat.Clients.

Acronyms used:

g   - number of goroutines
m   - max outstanding messages
r   - run time of the test
qft - queue flush time

2 types of measurements are taken.

1. Based on generated output events (approximation)

This measurement comes from number of output messages inside the files generated from file output.

Run	filebeat-8.13.1	filebeat-8.14.0
`g1m1000r1m`	6k
`g1m1000r2m`	12k
`g2m1000r1m`	6k
`g2m1000r2m`	13k
`g1m1000r1m-qft30`	2k
`g1m1600r1m`	180k
`g1m2000r30s`	90k	40k
`g1m3000r30s`	90k
`g2m2000r30s`	172k	312k
`g2m2000r2m`	750k
`g3m1000r2m`	13k
`g3m2000r30s`	202k	499k
`g4m2000r30s`	257k	580k
`g5m2000r30s`		647k

2. Based on values inside "Total Metrics" log message.

This log/metrics message appears on stopping filebeat. These are accurate compared to file output counts and they match against GCP PubSub Acked Messages metric.

Run	filebeat-8.13.1	filebeat-8.14.0
`g1m2000r30s`	39k	39k
`g2m2000r30s`	148k	285k
`g3m2000r30s`	205k	501k
`g4m2000r30s`	258k	585k
`g5m2000r30s`		651k

Results:

With current version 8.13.1, comparing g1m1000r1m and g1m1000r1m-qft30s, shows that when max_outstanding_messages set to 1000 i.e., less than default flush.min_events, there is a clear indication of input being blocked. It is similar to AWS SQS issue .
- Increasing max_outstanding_messages to atleast flush.min_events is suggested to overcome this. Doing so, the performance improvement is evident in g1m1600r1m. Subsequent increase to max_outstanding_messages to 2000 in g1m2000r30s and 3000 in g1m3000r30s didn't improve the ingestion performance as the flush.min_events is still set at 1600.
There is a huge difference in filebeat-8.13.1-g1m2000r30s between messages in file output (90k) vs Total Metrics log message (39k). Although Total Metrics seems to be accurate, this huge difference is unaccounted.
There is a linear growth as number of go routines increases in both 8.13.1 and 8.14.0 versions, but the overall ingestion rates are much higher in 8.14.0 than its corresponding run in 8.13.0 version.
- This suggests that having multiple beat.Clients is worth going for due to increased ingestion performance.

Mutex profiles:

8.13.1.g3m2000r30s.pprof.filebeat.contentions.delay.001.pb.gz
8.14.0.g3m2000r30s.pprof.filebeat.contentions.delay.001.pb.gz

mergify · 2024-01-17T10:34:33Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @kcreddy? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-01-17T12:50:20Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2024-01-17T10:34:11.450+0000
Duration: 136 min 1 sec

Test stats 🧪

Test	Results
Failed	0
Passed	3243
Skipped	176
Total	3419

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2024-01-19T15:23:29Z

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2024-01-19T15:17:40.686+0000
Duration: 6 min 44 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2024-01-19T17:34:37Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 137 min 43 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2024-02-05T15:57:53Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @kcreddy? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-03-05T15:20:10Z

💔 Build Failed

Buildkite Build
Commit: e5c4e55

Failed CI Steps

:pipeline:⬆️ Upload Pipeline: .buildkite/libbeat/pipeline.libbeat.yml

cc @kcreddy

elasticmachine · 2024-03-05T15:20:14Z

💔 Build Failed

Buildkite Build
Commit: e5c4e55

Failed CI Steps

:pipeline:⬆️ Upload Pipeline: .buildkite/winlogbeat/pipeline.winlogbeat.yml

cc @kcreddy

elasticmachine · 2024-03-05T15:20:15Z

💔 Build Failed

Buildkite Build
Commit: e5c4e55

Failed CI Steps

:pipeline:⬆️ Upload Pipeline: .buildkite/packetbeat/pipeline.packetbeat.yml

cc @kcreddy

elasticmachine · 2024-03-05T15:21:04Z

💚 Build Succeeded

Buildkite Build
Commit: e5c4e55

cc @kcreddy

elasticmachine · 2024-03-05T15:21:19Z

💚 Build Succeeded

Buildkite Build
Commit: e5c4e55

cc @kcreddy

elasticmachine · 2024-03-05T15:21:28Z

💚 Build Succeeded

Buildkite Build
Commit: e5c4e55

cc @kcreddy

…reased performance

elasticmachine · 2024-04-10T12:25:17Z

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

x-pack/filebeat/input/gcppubsub/config.go

x-pack/filebeat/input/gcppubsub/input.go

x-pack/filebeat/input/gcppubsub/pubsub_test.go

efd6 · 2024-04-14T21:40:59Z

x-pack/filebeat/input/gcppubsub/config.go

+	// The input gets blocked until flush.min_events or flush.timeout is reached.
+	// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.
 	c.Subscription.MaxOutstandingMessages = 1600


It's unfortunate that there is not a globally visible variable/constant that could be used for this that would ensure that this is both explained in code an robust to source mutation.

Is there a way for the input to know flush.min_events and adjust this to be max(flush.min_events, max_outstanding_messages)?

Since the queue config is defined inside publisher, I don't see any quick way to get this check without a considerable refactor of Input.

Indeed. It was a wistful desire rather than an expectation of change.

efd6 · 2024-04-14T21:41:29Z

x-pack/filebeat/input/gcppubsub/config.go

+	// It is not increased too high to cause high CPU usage.
+	c.Subscription.NumGoroutines = 2
+	// The input gets blocked until flush.min_events or flush.timeout is reached.
+	// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.


Suggested change

// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.

// Hence max_outstanding_message has to be at least flush.min_events to avoid this blockage.

updated as suggested

mergify · 2024-04-15T05:59:41Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b gcppubsub-improv upstream/gcppubsub-improv
git merge upstream/main
git push upstream gcppubsub-improv

andrewkroh

The code changes in this PR create N pub/sub readers based on the subscription.num_goroutines value. This overloads the meaning of this config option. It would mean that subscription.num_goroutines * subscription.num_goroutines (i.e squared) goroutines are created.

I wonder if the same gains could be achieved by increasing the receive settings in a manner that is the same as running N readers. The pub/sub client library appears to be built to handle this parallelism for us.

For example, settings equivalent to what you were using with these code changes would be

# Multiply everything by the number of pub/sub clients you were previously creating:
sub.ReceiveSettings.NumGoroutines = in.Subscription.NumGoroutines * in.Subscription.NumGoroutines
sub.ReceiveSettings.MaxOutstandingMessages = in.Subscription.MaxOutstandingMessages * in.Subscription.NumGoroutines

Basically, the question is whether the same gains can be achieve just by config changes.

filebeat-8.13.1 is the latest beats version used as a baseline, and filebeat-8.14.0 is the new code containing multiple beat.Clients.

IIUC this introduces more changes into the test than is necessary. For a controlled experiment I would have expected the baseline to have been performed on some commit (e.g. v8.13.1), and then the only changes introduced for the experiment are the ones under test. It sounds like the experiment was performed with all changes between v8.13.1..v8.14.0 plus what's in this PR. (for future reference)

Increasing max_outstanding_messages to atleast flush.min_events is suggested to overcome this.

Very good. This is an important improvement to the defaults. I think it would be valuable to mention about this in the pub/sub input docs.

Add multiple beat.Clients

I don't see the multiple beat.Clients in the code. It still only has one AFAICT. I compared the mutux profiles you gave me

Before (8.13.1.g2m2000r30s.pprof.filebeat.contentions.delay.001.pb)

After (8.14.0.g2m2000r30s.pprof.filebeat.contentions.delay.001.pb)

In both before and after there appears to be some lock contention (and delay) caused by the Publish lock. Like roughly 5s delay in the 30s profile.

andrewkroh · 2024-04-15T14:52:37Z

x-pack/filebeat/input/gcppubsub/input.go

+				}()
+				if err != nil {
+					in.log.Errorw("failed to create pub/sub client: ", "error", err)
+					cancel()


It looks like this should be calling return after cancel()?

andrewkroh · 2024-04-15T16:15:55Z

x-pack/filebeat/input/gcppubsub/input.go

+	var workerWg sync.WaitGroup
+
+	for ctx.Err() == nil {
+		workers, err := in.workerSem.AcquireContext(numGoRoutines, ctx)


I'm not sure we should proceed the approach of adding multiple pub/sub readers (pending resolution to previous comments), but if we do, then:

I think this can be simplified to not use a semaphore. A plain for loop (e.g. for i := 0; i < number of readers; i++ { go runSinglePubSubClient() }) seems to achieve the same result given that if any one "worker" fails they all stop due to cancel().

mergify · 2024-04-16T18:50:23Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b gcppubsub-improv upstream/gcppubsub-improv
git merge upstream/main
git push upstream gcppubsub-improv

kcreddy · 2024-04-17T06:19:31Z

Closing the PR as per the comments: #37657 (review).

This PR aims at creating multiple pubsub clients and not the beat pipeline clients. Although there is performance benefits observed, it is only because of the additional go_routines being created underneath. This is verified by increasing existing config options alone and we don't require additional pubsub clients to get this performance benefit.

The input blockage issue is resolved by increasing the default value of max_outstanding_messages in #38985.

Add multiple beat.Clients

e23799d

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 17, 2024

mergify bot assigned kcreddy Jan 17, 2024

kcreddy and others added 2 commits January 19, 2024 20:46

update test err

477173c

Merge branch 'main' into gcppubsub-improv

407a984

narph added the Team:Security-External Integrations label Jan 22, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 22, 2024

narph added Team:Security-Service Integrations Security Service Integrations Team and removed Team:Security-External Integrations labels Feb 8, 2024

move error

e5c4e55

kcreddy added 3 commits April 8, 2024 16:05

Merge remote-tracking branch 'upstream/main' into gcppubsub-improv

13f4901

change default num_goroutines and max_outstanding_messages due to inc…

17fe15b

…reased performance

Merge remote-tracking branch 'upstream/main' into gcppubsub-improv

34d4bd7

kcreddy added the enhancement label Apr 9, 2024

kcreddy added 2 commits April 10, 2024 17:31

Merge remote-tracking branch 'upstream/main' into gcppubsub-improv

44bded7

fix checks with mage update

5880235

kcreddy added the Filebeat Filebeat label Apr 10, 2024

kcreddy marked this pull request as ready for review April 10, 2024 12:25

kcreddy requested a review from a team as a code owner April 10, 2024 12:25

kcreddy requested review from belimawr and fearful-symmetry April 10, 2024 12:25

kcreddy requested a review from andrewkroh April 10, 2024 12:25

cmacknz requested a review from faec April 10, 2024 20:33

efd6 reviewed Apr 10, 2024

View reviewed changes

address pr comments.

ee3f7af

kcreddy requested a review from efd6 April 11, 2024 16:07

kcreddy added 2 commits April 12, 2024 12:37

fix linting error

36b90f0

Merge remote-tracking branch 'upstream/main' into gcppubsub-improv

e7cf429

belimawr approved these changes Apr 12, 2024

View reviewed changes

efd6 reviewed Apr 14, 2024

View reviewed changes

address pr comment

82345b4

Merge remote-tracking branch 'upstream/main' into gcppubsub-improv

9ece5ef

efd6 approved these changes Apr 15, 2024

View reviewed changes

andrewkroh reviewed Apr 15, 2024

View reviewed changes

kcreddy mentioned this pull request Apr 16, 2024

x-pack/filebeat/input/gcppubsub: Prevent input blockage by increasing default max_outstanding_messages #38985

Merged

6 tasks

kcreddy closed this Apr 17, 2024

kcreddy mentioned this pull request Apr 17, 2024

GCP PubSub Input Performance #35029

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiple beat.Clients #37657

Add multiple beat.Clients #37657

kcreddy commented Jan 17, 2024 •

edited

Loading

mergify bot commented Jan 17, 2024

elasticmachine commented Jan 17, 2024

Build stats

Test stats 🧪

elasticmachine commented Jan 19, 2024

Build stats

elasticmachine commented Jan 19, 2024 •

edited

Loading

Build stats

mergify bot commented Feb 5, 2024

elasticmachine commented Mar 5, 2024

elasticmachine commented Mar 5, 2024

elasticmachine commented Mar 5, 2024

elasticmachine commented Mar 5, 2024

elasticmachine commented Mar 5, 2024

elasticmachine commented Mar 5, 2024

elasticmachine commented Apr 10, 2024

efd6 Apr 14, 2024

kcreddy Apr 15, 2024

efd6 Apr 15, 2024

efd6 Apr 14, 2024

kcreddy Apr 15, 2024

mergify bot commented Apr 15, 2024

andrewkroh left a comment •

edited

Loading

andrewkroh Apr 15, 2024

andrewkroh Apr 15, 2024

mergify bot commented Apr 16, 2024

kcreddy commented Apr 17, 2024

	// Hence max_outstanding_message has to be atleast flush.min_events to avoid this blockage.
	// Hence max_outstanding_message has to be at least flush.min_events to avoid this blockage.

Add multiple beat.Clients #37657

Add multiple beat.Clients #37657

Conversation

kcreddy commented Jan 17, 2024 • edited Loading

Proposed commit message

Checklist

How to test this PR locally

Related issues

Logs

Performance measurement

1. Based on generated output events (approximation)

2. Based on values inside "Total Metrics" log message.

Results:

Mutex profiles:

mergify bot commented Jan 17, 2024

elasticmachine commented Jan 17, 2024

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Jan 19, 2024

❕ Build Aborted

Build stats

🤖 GitHub comments

elasticmachine commented Jan 19, 2024 • edited Loading

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

mergify bot commented Feb 5, 2024

elasticmachine commented Mar 5, 2024

💔 Build Failed

Failed CI Steps

elasticmachine commented Mar 5, 2024

💔 Build Failed

Failed CI Steps

elasticmachine commented Mar 5, 2024

💔 Build Failed

Failed CI Steps

elasticmachine commented Mar 5, 2024

💚 Build Succeeded

elasticmachine commented Mar 5, 2024

💚 Build Succeeded

elasticmachine commented Mar 5, 2024

💚 Build Succeeded

elasticmachine commented Apr 10, 2024

efd6 Apr 14, 2024

Choose a reason for hiding this comment

kcreddy Apr 15, 2024

Choose a reason for hiding this comment

efd6 Apr 15, 2024

Choose a reason for hiding this comment

efd6 Apr 14, 2024

Choose a reason for hiding this comment

kcreddy Apr 15, 2024

Choose a reason for hiding this comment

mergify bot commented Apr 15, 2024

andrewkroh left a comment • edited Loading

Choose a reason for hiding this comment

andrewkroh Apr 15, 2024

Choose a reason for hiding this comment

andrewkroh Apr 15, 2024

Choose a reason for hiding this comment

mergify bot commented Apr 16, 2024

kcreddy commented Apr 17, 2024

kcreddy commented Jan 17, 2024 •

edited

Loading

elasticmachine commented Jan 19, 2024 •

edited

Loading

andrewkroh left a comment •

edited

Loading