Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(socket sink): support unix datagram mode #21762

Conversation

jpovixwm
Copy link
Contributor

@jpovixwm jpovixwm commented Nov 11, 2024

Summary

Adds a unix_mode option to the socket sink.
The implementation is somewhat based on how the statsd sink handles this, but not to the full extent, as that felt like too big of a refactor for me to deal with.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

make test on x86_64 Ubuntu 22.04 running within WSL, and then

cargo test --lib --no-default-features --features=sinks-socket --target=x86_64-pc-windows-gnu

to ensure I didn't forget any of the #[cfg(unix)] attributes.
Other than this, I also performed some basic tests with a trivial UDS listener implemented in Python.

sources:
  dummy_logs:
    type: "demo_logs"
    format: "syslog"
    count: 3

sinks:
  emit_console:
    inputs:
      - "dummy_logs"
    type: "console"
    encoding:
      codec: "text"
  emit_syslog:
    inputs:
      - "dummy_logs"
    type: "socket"
    mode: "unix"
    # address: "127.0.0.1:5006"
    unix_mode: "Datagram"
    path: "/tmp/test_socket"
    encoding:
      codec: "text"
    framing:
      method: "bytes"

Listener:
nc -lkuU /tmp/test_socket (from package netcat-openbsd)

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • Please read our Vector contributor resources.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

@jpovixwm jpovixwm requested review from a team as code owners November 11, 2024 18:18
@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation labels Nov 11, 2024
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ci-run-component-features

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jpovixwm, I did a quick pass. Left some suggestions.

src/sinks/util/unix.rs Outdated Show resolved Hide resolved
src/sinks/util/unix.rs Outdated Show resolved Hide resolved
src/sinks/util/unix.rs Outdated Show resolved Hide resolved
src/sinks/util/unix.rs Outdated Show resolved Hide resolved
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpovixwm!

@jpovixwm
Copy link
Contributor Author

@pront I'm not sure how to best resolve the build errors.
The easy part is to un-gate UnixSocketSendError in src/internal_events/unix.rs, similar to how UnixSendIncompleteError is already not gated, as the impact of this should be minimal.

But then we can either:

  1. Trivially relax the gate
    #[cfg(all(any(feature = "sinks-socket", feature = "sinks-statsd"), unix))]
    pub mod unix;
    so that it becomes just #[cfg(unix)], which I guess will make the binary a bit bigger than necessary for builds without either of those sinks enabled.
  2. Move the UnixEither enum back where it came from, and make it public, so that it can be used in src/sinks/util/unix.rs
  3. Or "Find & replace" all occurrences of #[cfg(unix)] (12 hits) with #[cfg(all(unix, any(feature = "sinks-socket", feature = "sinks-statsd")))] in src/sinks/util/service/net/mod.rs, which is kind of ugly, but does appear to work.

@pront
Copy link
Member

pront commented Nov 14, 2024

@pront I'm not sure how to best resolve the build errors. The easy part is to un-gate UnixSocketSendError in src/internal_events/unix.rs, similar to how UnixSendIncompleteError is already not gated, as the impact of this should be minimal.

But then we can either:

  1. Trivially relax the gate
    #[cfg(all(any(feature = "sinks-socket", feature = "sinks-statsd"), unix))]
    pub mod unix;

    so that it becomes just #[cfg(unix)], which I guess will make the binary a bit bigger than necessary for builds without either of those sinks enabled.
  2. Move the UnixEither enum back where it came from, and make it public, so that it can be used in src/sinks/util/unix.rs
  3. Or "Find & replace" all occurrences of #[cfg(unix)] (12 hits) with #[cfg(all(unix, any(feature = "sinks-socket", feature = "sinks-statsd")))] in src/sinks/util/service/net/mod.rs, which is kind of ugly, but does appear to work.

@jpovixwm I am OK with (1)

@jpovixwm
Copy link
Contributor Author

Verified locally with

cargo test --lib --no-default-features --features=sinks-amqp
cargo test
cargo test --lib --no-default-features --features=sinks-socket --target=x86_64-pc-windows-gnu

Hope it works this time🤞

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ci-run-component-features

@pront
Copy link
Member

pront commented Nov 15, 2024

/ci-run-component-features

Turns out that the above workflow doesn't run on fork branches.

I manually created: https://github.com/vectordotdev/vector/actions/runs/11860591369
If it passes, we can go ahead and merge this PR.

@pront pront enabled auto-merge November 15, 2024 21:48
@pront pront added this pull request to the merge queue Nov 15, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 15, 2024
@jpovixwm
Copy link
Contributor Author

I'm unable to reproduce the unit test failure on my machine, which makes it more difficult for me to fix it.
Here's a potential modification that still passes locally for me:

let receiver = std::os::unix::net::UnixDatagram::bind(out_path.clone()).unwrap();

let handle = tokio::task::spawn_blocking(move || {
    let mut output_lines = Vec::<String>::with_capacity(num_lines);

    for _ in 0..num_lines {
        let mut buf = [0; 100];
        let (size, _) = receiver
            .recv_from(&mut buf)
            .expect("Did not receive message");
        let line = String::from_utf8_lossy(&buf[..size]).to_string();
        output_lines.push(line);
    }

    output_lines
});

Tokio's UnixDatagram is replaced with the stdlib one, so that we can do a blocking receive (as spawn_blocking doesn't allow async functions), in the hopes that the receiver will be able to keep up with the sender if it gets to run in its own thread.
Should I commit this?

@pront
Copy link
Member

pront commented Nov 18, 2024

Let me check this out locally and run the failing check.

@pront
Copy link
Member

pront commented Nov 18, 2024

Edited: Hmm, basic_unix_sink hangs on jpovixwm:feature/5269-support-unix-datagram-mode-in-socket-sink on my macOS machine. I don't have time to dive into right now, but this PR shouldn't affect existing behavior at all.

@jpovixwm
Copy link
Contributor Author

@pront can you tell me the invocation that fails for you locally?

@pront
Copy link
Member

pront commented Nov 18, 2024

@pront can you tell me the invocation that fails for you locally?

You can do:

# from the repo root
cd src/sinks/util 
cargo test

I did:

# ide
cargo test --color=always --package vector --lib sinks::util::unix::tests::basic_unix_datagram_sink --no-fail-fast --config env.RUSTC_BOOTSTRAP=\"1\" -- --format=json --exact -Z unstable-options --show-output

@jpovixwm
Copy link
Contributor Author

You can do:

# from the repo root
cd src/sinks/util 
cargo test
user@hostname:~/vector/src/sinks/util$ git status && git rev-parse HEAD
On branch feature/5269-support-unix-datagram-mode-in-socket-sink
Your branch is up to date with 'remotes/jpovixwm/feature/5269-support-unix-datagram-mode-in-socket-sink'.

nothing to commit, working tree clean
543d33f393a294342fda20beca8040020840d4ca
user@hostname:~/vector/src/sinks/util$ cargo test
    Finished `test` profile [unoptimized + debuginfo] target(s) in 2.21s
     Running unittests src/lib.rs (/home/user/vector/target/debug/deps/vector-8dfd964da1bd40f7)

running 1638 tests
test api::schema::components::sink::tests::sort_component_id_asc ... ok
[...]
test transforms::log_to_metric::tests::parse_failure ... ok

test result: ok. 1635 passed; 0 failed; 3 ignored; 0 measured; 0 filtered out; finished in 19.14s

     Running tests/e2e/mod.rs (/home/user/vector/target/debug/deps/e2e-6fdf1afbfae3c336)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/integration/lib.rs (/home/user/vector/target/debug/deps/integration-fc9efbfa10dceb61)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

   Doc-tests vector

running 2 tests
test src/expiring_hash_map.rs - expiring_hash_map::ExpiringHashMap<K,V>::next_expired (line 141) ... ignored
test src/expiring_hash_map.rs - expiring_hash_map::ExpiringHashMap<K,V>::next_expired (line 158) ... ok

test result: ok. 1 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out; finished in 42.10s

I did:

# ide
cargo test --color=always --package vector --lib sinks::util::unix::tests::basic_unix_datagram_sink --no-fail-fast --config env.RUSTC_BOOTSTRAP=\"1\" -- --format=json --exact -Z unstable-options --show-output
user@hostname:~/vector/src/sinks/util$ cargo test --color=always --package vector --lib sinks::util::unix::tests::basic_unix_datagram_sink --no-fail-fast --config env.RUSTC_BOOTSTRAP=\"1\" -- --format=json --exact -Z unstable-options --show-output
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.42s
     Running unittests src/lib.rs (/home/user/vector/target/debug/deps/vector-8dfd964da1bd40f7)
{ "type": "suite", "event": "started", "test_count": 1 }
{ "type": "test", "event": "started", "name": "sinks::util::unix::tests::basic_unix_datagram_sink" }
{ "type": "test", "name": "sinks::util::unix::tests::basic_unix_datagram_sink", "event": "ok" }
{ "type": "suite", "event": "ok", "passed": 1, "failed": 0, "ignored": 0, "measured": 0, "filtered_out": 1637, "exec_time": 0.056496991 }

"Works on my machine". I'm afraid there's not much I can do on my end. I guess I could try it with a proper Linux VM if I can find the time.
FYI:

user@hostname:~/vector$ uname -r
5.15.153.1-microsoft-standard-WSL2
user@hostname:~/vector$ lsb_release -d
Description:    Ubuntu 22.04.5 LTS

So maybe my setup is a bit exotic.

@pront
Copy link
Member

pront commented Nov 19, 2024

Hope this helps:

cargo nextest run --retries 10000 --test-threads 12 -p vector
.
.
.
     Summary [ 134.802s] 1621 tests run: 1621 passed (1 flaky), 3 skipped
FLAKY 2/10001 [   0.348s] vector sinks::util::unix::tests::basic_unix_datagram_sink

Which means the test is less flakey, but still flakey.
I think it's worth pursuing the channel synchronization approach.

@jpovixwm
Copy link
Contributor Author

I'd also say that the buffer should be larger than 100 bytes (even just 101 bytes should be alright) so that in the future it can catch a potential regression where some additional garbage data is sent to the socket after the expected payload. For example, if you replace the bytes framer with a newline delimited one, there will be a newline character after the first 100 expected bytes.

@jpovixwm
Copy link
Contributor Author

It's not flaky for me, so I'm essentially fixing it blind, but still:

user@hostname:~/vector$ git status && git rev-parse HEAD
On branch feature/5269-support-unix-datagram-mode-in-socket-sink
Your branch is ahead of 'remotes/jpovixwm/feature/5269-support-unix-datagram-mode-in-socket-sink' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
09dd31b6da377cffbaaabf407c8f036e3881ee5a
user@hostname:~/vector$ rust-flaker vector basic_unix_datagram_sink
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.39s
Found test binary: target/debug/deps/vector-390aa0364275b128
Running test once...

running 1 test
test sinks::util::unix::tests::basic_unix_datagram_sink ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1637 filtered out; finished in 0.09s

Running test repeatedly...
1 runs so far, 0 failures (100.00% pass rate). 153.486541ms avg, 153.486541ms max, 153.486541ms min
[...]
10201 runs so far, 0 failures (100.00% pass rate). 145.68302ms avg, 227.653805ms max, 114.913275ms min

FYI: rust-flaker

@pront
Copy link
Member

pront commented Nov 19, 2024

TRY 1 TRMNTG [>120.000s] vector sinks::util::unix::tests::basic_unix_datagram_sink
   TRY 1 TMT [ 120.014s] vector sinks::util::unix::tests::basic_unix_datagram_sink

--- TRY 1 STDOUT:        vector sinks::util::unix::tests::basic_unix_datagram_sink ---

running 1 test
2024-11-19T21:20:40.988802Z ERROR vector::internal_events::unix: Unix socket send error. error=No buffer space available (os error 55) path="/var/folders/b4/11qq92411398933v4925cq380000gn/T/.tmp1sxztf/unix_datagram_test" error_type="writer_failed" stage="sending" internal_log_rate_limit=true
2024-11-19T21:20:40.988903Z ERROR vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Unix socket send error." internal_log_rate_limit=true
test sinks::util::unix::tests::basic_unix_datagram_sink has been running for over 60 seconds

   RETRY 2/4 [         ] vector sinks::util::unix::tests::basic_unix_datagram_sink
  TRY 2 PASS [   0.076s] vector sinks::util::unix::tests::basic_unix_datagram_sink
------------
     Summary [ 135.078s] 1621 tests run: 1621 passed (1 flaky), 3 skipped
   FLAKY 2/4 [   0.076s] vector sinks::util::unix::tests::basic_unix_datagram_sink

Sorry, this is still flakey and not ready to be merged. Which is a shame, since this is a great PR.

There is an interesting trace if you are motivated to investigate:

2024-11-19T21:20:40.988802Z ERROR vector::internal_events::unix: Unix socket send error. error=No buffer space available (os error 55) path="/var/folders/b4/11qq92411398933v4925cq380000gn/T/.tmp1sxztf/unix_datagram_test" error_type="writer_failed" stage="sending" internal_log_rate_limit=true

Otherwise, we can leave this open for me or someone else who has access to a macOS machine to pick it up in the future since it's impossible to iterate on it without one.

@jpovixwm
Copy link
Contributor Author

It's what I suspected earlier: joshtriplett/io-mux#7
The send semantics are just different between Linux and other non-Linux unices. Linux blocks when the queue is full, but BSD and macOS return an error.
Considering vector's reliability, I think the proper thing to do would be to change the sink implementation itself (rather than tweaking the unit test) so that it can handle this case internally and buffer the events instead of outright dropping them. But I have no idea how to make these send calls block on non-Linux unices (and judging by the referenced issue, I'm not alone).
So maybe the sink could catch this particular error and infinitely retry sending the event with exponential backoff. Each successful send would then reset the backoff timeout.

@pront
Copy link
Member

pront commented Nov 22, 2024

So maybe the sink could catch this particular error and infinitely retry sending the event with exponential backoff. Each successful send would then reset the backoff timeout.

Are you willing to dig into this?

An alternative is to document this big difference in semantics with macOS and exclude it with #[cfg(not(target_os = "macos"))].

@jpovixwm
Copy link
Contributor Author

jpovixwm commented Nov 23, 2024

Are you willing to dig into this?

Yeah. I looked into it today and got this locally:

let sent = match socket {
    DatagramSocket::Udp(udp) => udp.send(buf).await?,
    #[cfg(unix)]
    DatagramSocket::Unix(uds, _) => {
        let mut backoff = ExponentialBackoff::from_millis(2)
            .factor(5)
            .max_delay(Duration::from_secs(60));
        loop {
            match uds.send(buf).await {
                Ok(sent) => break sent,
                Err(error) if error.raw_os_error() == Some(libc::ENOBUFS) => {
                    let delay = backoff.next().unwrap();
                    warn!(
                        message = "Throttling due to ENOBUFS.",
                        delay_secs = delay.as_secs_f64(),
                        internal_log_rate_limit = true
                    );
                    sleep(delay).await
                }
                Err(error) => return Err(error),
            }
        }
    }
};

I'm testing my changes in FreeBSD 14 via Vagrant by building Vector like this:

cross test --release --target x86_64-unknown-freebsd --no-default-features --features sinks-socket --lib --no-run

(This won't build out of the box due to the dependency on heim-disk, so I had to remove a bunch of code from src/config/validation.rs which checks whether there's enough storage space on the host machine).

Without this explicit handling of ENOBUFS, I was able to reproduce the flakiness we saw earlier on macOS. But with the loop in place, the unit test reliably succeeds.
I also tried increasing the size of individual messages to see if this could potentially get stuck in the loop forever, but in the case of too big messages I'm getting EMSGSIZE which does break out of the loop.

I have yet to check whether this issue also affects UDP. If it does, I think this retrying logic should be applied to both UDP and Unix Datagram.

BTW, do you think the number of ENOBUFS' encountered should be exposed as a metric from the sink, or is it enough to just log a warning? (the current message is provisional, I'll replace it with something clearer)

@pront
Copy link
Member

pront commented Nov 25, 2024

Are you willing to dig into this?

Yeah. I looked into it today ...

First of all, thanks!

Secondly, with these changes, we now have a lot of new scope in this PR. What do you think about targeting supporting this new unix_mode only on unix + not macOS for this PR and then following up in another PR with the changes you outlined?

Without this explicit handling of ENOBUFS, I was able to reproduce the flakiness we saw earlier on macOS. But with the loop in place, the unit test reliably succeeds. I also tried increasing the size of individual messages to see if this could potentially get stuck in the loop forever, but in the case of too big messages I'm getting EMSGSIZE which does break out of the loop.

👍

BTW, do you think the number of ENOBUFS' encountered should be exposed as a metric from the sink, or is it enough to just log a warning? (the current message is provisional, I'll replace it with something clearer)

Yes, good idea. I can see this as something a user would like to monitor.

However, I am happy to just wait for you finalize everything in PR and then review again. Up to you.

@jpovixwm
Copy link
Contributor Author

What do you think about targeting supporting this new unix_mode only on unix + not macOS for this PR and then following up in another PR with the changes you outlined?

I'm OK with that. Though, to be fair, the follow-up PR may never come (from me, anyway) because I don't really have anything to gain from this feature being available on macOS.

I've just pushed b648cfe containing minimal changes to disable this feature on macOS. Regarding the unit tests, I only gated basic_unix_datagram_sink because it seems to be the only one that's flaky on macOS, so the remaining tests shouldn't do any harm.

Lastly, on UDP: I didn't manage to get a single ENOBUFS on FreeBSD when I tried that. Instead, the packets were silently dropped, resulting in equally flaky tests but with no obvious way forward. But for UDP that's probably acceptable, because it should be fairly obvious to anyone that UDP doesn't make any guarantees.

@pront
Copy link
Member

pront commented Nov 27, 2024

I'm OK with that. Though, to be fair, the follow-up PR may never come (from me, anyway) because I don't really have anything to gain from this feature being available on macOS.

That's absolutely fine 👍

I am going to run the Component Features - Linux job to ensure the feature gates are OK. Otherwise this looks good to go.

@pront
Copy link
Member

pront commented Nov 27, 2024

@pront pront added this pull request to the merge queue Nov 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2024
@jszwedko jszwedko added this pull request to the merge queue Dec 18, 2024
Merged via the queue into vectordotdev:master with commit 7c6d0c9 Dec 19, 2024
41 checks passed
@jszwedko
Copy link
Member

jszwedko commented Jan 2, 2025

I missed this before, but I think we should have matched the socket source here and, instead of having an additional unix_mode field, instead added additional enum values to the existing mode option.

@pront pront self-assigned this Jan 2, 2025
@pront
Copy link
Member

pront commented Jan 2, 2025

I missed this before, but I think we should have matched the socket source here and, instead of having an additional unix_mode field, instead added additional enums to the mode option.

Ah thank you @jszwedko for catching this!

@jpovixwm would you be open to do this change please? Our resources are quite limited at the moment and I am not sure when we can get to fixing this ourselves.

@pront
Copy link
Member

pront commented Jan 2, 2025

Also, FYI the next release is scheduled for Monday Jan 13th.

github-merge-queue bot pushed a commit that referenced this pull request Jan 3, 2025
Revert "enhancement(socket sink): support unix datagram mode (#21762)"

This reverts commit 7c6d0c9.

Co-authored-by: Jesse Szwedko <[email protected]>
titaneric pushed a commit to titaneric/vector that referenced this pull request Jan 15, 2025
)

* enhancement(socket sink): support unix datagram mode

* 5269_support_unix_datagram_mode_in_socket_sink.enhancement.md: fix linter error

* sinks/util/{udp,unix}.rs: abstract out common logic into sinks/util/datagram.rs

* sinks/util/service/net/unix: use sinks/util/unix/UnixEither and move impls there

* remove problematic feature gates for 'sinks-socket' and 'sinks-statsd'

* use std type and spawn blocking

* basic_unix_datagram_sink: attempt to reduce flakiness

* socket sink: ignore unix_mode on macOS

---------

Co-authored-by: Pavlos Rontidis <[email protected]>
titaneric pushed a commit to titaneric/vector that referenced this pull request Jan 15, 2025
…ev#22113)

Revert "enhancement(socket sink): support unix datagram mode (vectordotdev#21762)"

This reverts commit 7c6d0c9.

Co-authored-by: Jesse Szwedko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for UNIX datagram mode to socket sink
4 participants