Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x-pack/filebeat/netflow] implement netflow with multiple workers #40122

Merged

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jul 5, 2024

Proposed commit message

This PR introduces scaling up support for Netflow input. To accommodate for the parallel processing of template definitions/options and data records causing eventual consistency, Netflow v9 and IPFIX decoders utilize a short-term LRU (Least Recently Used) cache. This cache is designed to temporarily store events whose templates have not yet been processed, ensuring that these events can be properly handled and eventually sent out once the corresponding template is available. Please read also the performance results. TL;DR for outputs with just 1 worker there are no performance gains when scaling. Gains can be seen for outputs with 4 and 8 workers but there is a plateau for higher numbers of workers.

Data Flow:

flowchart TD
    classDef default line-height:1.5,text-align:center;
    U([UDP Receiver])
    Q[(Buffered Channel)]
    N["Netflow Decoder<br>(same instance across workers)"]
    IQ[(Internal Memory Queue)]
    CP["Pipeline Client<br>(dedicated client per worker)"]
    O(Output)

    U -->|push| C{Channel Full?}
    C --> Q
    Q -->|read| N
    subgraph input[M input workers]
        N --> CP
    end
    style input stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5
    CP --> IQ
    C --> D([Drop])
    IQ --> O
    subgraph N output workers
        O
    end
Loading

Performance results:

The performance of this PR was evaluated using a local Elasticsearch cluster running with mage docker:composeUp (minimized network latencies) and increasing the buffer the OS UDP buffer to guarantee no packets drops at this level with

sudo sysctl -w net.core.rmem_max=26214400
sudo sysctl -w net.core.rmem_default=26214400

The image below, titled "Netflow Performance [15000 packets/sec, 100 Input workers, scaling Output workers]," provides an analysis of the system's performance with a fixed number of 100 input workers while varying the number of output workers. The x-axis again represents time over 16 seconds, and the y-axis shows the total flows published as reported by Elasticsearch. The scenarios include a mocked pipeline for maximum performance and several real pipeline configurations with 100 netflow workers paired with 1, 4, 8, 16, and 32 output workers. Additionally, it includes performance metrics for the prior to this PR netflow implementation under similar conditions. The mocked pipeline, as expected, achieves the highest performance, serving as an upper benchmark. For the real pipeline and 1 output worker, the existing implementation and the scaling one introduced by this PR exhibit the same performance. Performance improves noticeably as the number of output workers increases up to 8. But for output workers more than 8, such as 16, 32, there are no apparent extra performance gains.

Netflow Performance  15000 Packets_sec, 100 Input workers, scaling Output workers

The next image below, titled "Netflow Performance [15000 packets/sec, 32 Output workers, scaling Input workers]," illustrates the performance of a Netflow system under various configurations of input workers while keeping the number of output workers constant at 32. I conducted this experiment to validate that the performance plateau observed in previous image for workers 8+ is not because the input rate is maxed out. Once again, the x-axis represents time over 16 seconds, and the y-axis shows the total flows published as reported by Elasticsearch. The graph includes a mocked pipeline, which serves as the benchmark for maximum system performance with zero publish overhead, and three real pipeline configurations with 100, 200, and 300 netflow workers respectively. We observe that scaling input workers does not provide any noticeable performance gains, suggesting that factors other than the number of netflow workers might be limiting their performance.

Netflow Performance  15000 packets_sec, 32 Output workers, scaling Input workers

The take home point is that for outputs with just 1 worker there is no performance gains when scaling. The gains of scaling can be seen for outputs with 4 and 8 workers but there is a plateau. More details about the reasoning of this plateau can be found here

Testing:

As shown in the pictures below, the effectiveness of the LRU cache introduced for Netflow v9 and IPFIX decoders is tested with a new pcap file, namely 'ipfix_cisco.reversed.pcap', that holds the same packets as 'ipfix_cisco.pcap' but in reversed order; data records come first and the template ones last.

ipfix_cisco pcap ipfix_cisco reversed pcap

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

N/A

Author's Checklist

N/A

How to test this PR locally

cd x-pack/filebeat
mage goUnitTest
mage goIntegTest

Related issues

Use cases

N/A

Screenshots

N/A

Logs

N/A

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 5, 2024
Copy link
Contributor

mergify bot commented Jul 5, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @pkoutsovasilis? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@pkoutsovasilis pkoutsovasilis changed the title feat: implement netflow with multiple workers [do not merge] implement netflow with multiple workers Jul 5, 2024
@pkoutsovasilis pkoutsovasilis added enhancement Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution labels Jul 5, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 5, 2024
@pkoutsovasilis pkoutsovasilis force-pushed the pkoutsovasilis/scale_netflow branch from 2121465 to 78c69b9 Compare July 5, 2024 15:48
@pkoutsovasilis pkoutsovasilis changed the title [do not merge] implement netflow with multiple workers [x-pack/filebeat/netflow] implement netflow with multiple workers Jul 8, 2024
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review July 8, 2024 09:12
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner July 8, 2024 09:12
@elasticmachine
Copy link
Collaborator

Pinging @elastic/sec-deployment-and-devices (Team:Security-Deployment and Devices)

@pkoutsovasilis pkoutsovasilis added the Filebeat Filebeat label Jul 8, 2024
@pkoutsovasilis
Copy link
Contributor Author

run docs-build

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jul 8, 2024

this commit c0dbb72 fixes an issue with the overgrowing under-the-hood slice of LRU that was brought to my attention offline by @aleksmaus; ty 🙏

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great write up and charts

x-pack/filebeat/input/netflow/config.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/config/config.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/config.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/v9/lru.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/v9/lru.go Outdated Show resolved Hide resolved
@pkoutsovasilis
Copy link
Contributor Author

run docs-build

1 similar comment
@pkoutsovasilis
Copy link
Contributor Author

run docs-build

@pkoutsovasilis
Copy link
Contributor Author

@andrewkroh @aleksmaus any more feedback on this PR? 🙂 From my experimental analysis I can see the benefits of scaling where we can sustain 5000 packets/sec with 4 workers under certain hardware and network characteristics.

x-pack/filebeat/input/netflow/decoder/v9/lru.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/v9/lru.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/v9/v9.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/v9/lru.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/netflow/decoder/v9/v9.go Outdated Show resolved Hide resolved
Copy link
Member

@aleksmaus aleksmaus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblocking. Would appreciate if @andrewkroh takes a look as well

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Jul 25, 2024

Unblocking. Would appreciate if @andrewkroh takes a look as well

ty @aleksmaus! I would also appreciate it if @andrewkroh takes a look as well 🙂

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing new from me. Great work!

x-pack/filebeat/input/netflow/input.go Show resolved Hide resolved
x-pack/filebeat/docs/inputs/input-netflow.asciidoc Outdated Show resolved Hide resolved
@pkoutsovasilis pkoutsovasilis merged commit 6c400f1 into elastic:main Jul 26, 2024
19 checks passed
@pkoutsovasilis pkoutsovasilis deleted the pkoutsovasilis/scale_netflow branch July 26, 2024 13:25
@andrewkroh
Copy link
Member

The fleet netflow package will need updated to expose workers configuration.

@pkoutsovasilis
Copy link
Contributor Author

The fleet netflow package will need updated to expose workers configuration.

yy I had that in mind to do that but thx for the reminder @andrewkroh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Filebeat Filebeat Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants