Memory leak with kafka source, http sink and `429 code` responses #17276

Ilmarii · 2023-05-02T08:29:18Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have a configuration with a kafka source and an http sink, and with acknowledgements enabled. And it works fine when the http sink only receives successful responses, but when the sink receives 429 responses, the memory starts growing until the vector is killed by the OOM.

Also I tried turning on allocation-tracing and the graph looks like there is a memory leak in the kafka source.

Some graphs:
Memory usage from vector node

Internal allocation tracing, all components:

Internal allocation tracing, only kafka source:

Rate of http sink requests with response code:

Configuration

data_dir: /tmp/vector-data-dir
api:
  enabled: true
  address: "127.0.0.1:8686"
  playground: true
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: timestamp
sources:
  main_input_kafka_src:
    type: kafka
    bootstrap_servers: "bootstrap.brokers.kafka:443"
    group_id: vector_test_local
    auto_offset_reset: earliest
    topics:
      - test-topic
    librdkafka_options:
      fetch.message.max.bytes: "10485760" # 10 MB
      fetch.max.bytes: "104857600" # 100 MB
sinks:
  test_sink:
    type: "http"
    inputs:
      - main_input_kafka_src
    uri: "http://localhost:8000"
    method: "post"
    acknowledgements:
      enabled: true
    buffer:
      type: "memory"
      max_events: 100000
      when_full: "block"
    batch:
      max_bytes: 20971520
      timeout_secs: 10
    encoding:
      codec: "json"
    request:
      concurrency: 250

Version

0.29.1

Debug Output

2023-05-02T05:33:20.934810Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=817}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true
2023-05-02T05:33:21.419138Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=636}: vector::sinks::util::retries: Internal log [Retrying after response.] is being rate limited.
2023-05-02T05:33:31.622858Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Internal log [Retrying after response.] has been rate limited 33 times.
2023-05-02T05:33:31.622918Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true

Example Data

No response

Additional Context

For MRE: I created an http server for http sink with simple logic: when concurrent requests < 20 then sleeps for 40 seconds and returns success, otherwise returns 429 immediately.
The content of the kafka topic is not important, just need enough amount.
Vector running in k8s.

References

No response

The text was updated successfully, but these errors were encountered:

spencergilbert · 2023-05-08T20:43:51Z

👋 Thanks for the detailed report @Ilmarii!

I'm working on reproducing this locally. I'm curious if you still see this behavior if acknowledgements are disabled, or if you've tried switching from a memory buffer to a disk buffer

spencergilbert · 2023-05-09T18:42:48Z

Noting I wasn't able to reproduce this locally when only returning 429s from the target HTTP server.

EDIT: with some 202s sprinkled in I've gotten the allocation tracking to show ~2.5GB of memory usage for the kafka source in ~15minutes

spencergilbert · 2023-05-10T14:51:22Z

@Ilmarii it looks to me, based on my local testing that if a disk buffer is used, or acknowledgements are disabled, this memory use isn't observed. I'll continue to track down the cause, but in the interim you could use that to workaround the issue in your deployment.

spencergilbert · 2023-05-12T20:17:20Z

Fairly certain I've ~~solved this~~ improved this - confirming locally, and should have a PR to close it later today if testing goes well.

@Ilmarii would you be able to run a nightly version to see if my changes help enough in your environment

Ilmarii · 2023-05-15T07:44:50Z

@spencergilbert Thanks for your fix! I will try the fix and answer with the results.

Ilmarii · 2023-05-15T08:03:46Z

As for the disk buffer, we switched out, since with the disk buffer we have a strong underutilization of the CPU (only 1-2 cores out of 8 were used). And now we use a memory buffer + acknowledgements for some protection against data loss.

Ilmarii · 2023-05-15T15:09:23Z

@spencergilbert I built and tested vector from spencer/improve-selects and the memory leak still exists.

spencergilbert · 2023-05-15T15:11:20Z

Thanks! Curious was there any improvement at all on your end?

Ilmarii · 2023-05-15T15:14:52Z

I found no improvement, leak size is still the same. But I tested a273ff7, without you latest two commits.

Ilmarii · 2023-05-15T15:16:05Z

Now I testing with acknowledges disabled, maybe this would helpful.

Ilmarii · 2023-05-15T17:52:51Z

Well, with acknowledges disabled there is no memory leak :-)

spencergilbert · 2023-05-17T11:31:21Z

Another thought we had is that there's an issue with how acknowledgements are handled by the old style sink(s) and rewriting http to the new driver style sinks could resolve this as well.

jszwedko · 2023-05-19T19:34:09Z

@Ilmarii any chance you could try some variations of your configuration to try to isolate the issue? We could try swapping out the http sink with a blackhole sink or the kafka source with the demo_logs source. If one of those combinations still experiences the leak that could isolate the component with the bug.

Ilmarii · 2023-05-22T08:13:18Z

@jszwedko Yes, I will try it.

pront · 2023-08-07T16:37:53Z

Hi @Ilmarii, did you have a chance to try any of the combinations suggested above?

m0untains · 2023-08-22T01:13:06Z

I believe we may also be hitting this issue with theaws_kinesis_streams sink. We have a kubernetes_logs source, and noticed a strong correlation between vector OOMing very quickly and being throttled by AWS.

Once the throttling subsides, so does the OOMing. I can also remove the kinesis sink (but keep the k8s logs source), and the problem disappears (which points to the issue being on the sink side).

jszwedko · 2023-08-22T20:12:57Z

I believe we may also be hitting this issue with theaws_kinesis_streams sink. We have a kubernetes_logs source, and noticed a strong correlation between vector OOMing very quickly and being throttled by AWS.

Once the throttling subsides, so does the OOMing. I can also remove the kinesis sink (but keep the k8s logs source), and the problem disappears (which points to the issue being on the sink side).

That would be consistent with the sink building up a backlog of requests to process due to the throttling 🤔

m0untains · 2023-08-23T18:16:58Z

I could also add: the explosion in memory happens quite quickly. In a steady-state, when things are good, per-pod memory usage is a couple of hundred Mi. We have the pod limit set to 3Gi.

Monitoring kubectl top pods suggests that when throttled, many times the memory usage exceeds 3Gi within the scraping window (I'm not sure exactly what it is for top pods, but I assume < 60 seconds).

m0untains · 2023-08-23T21:30:45Z

Observing this via a worker node, once the journal entry for kubernetes_logs is accessed, it dies in a matter of seconds.

It happens so quick, it's hard to debug. Right before the OOM, I did notice about ~50 additional sockets under /proc/<vector_pid>/fd/. Presumably these are to AWS, however I am using the default request.concurrency of none, which is supposed to be 1.

jszwedko · 2023-08-31T18:51:32Z

It happens so quick, it's hard to debug. Right before the OOM, I did notice about ~50 additional sockets under /proc/<vector_pid>/fd/. Presumably these are to AWS, however I am using the default request.concurrency of none, which is supposed to be 1.

Interesting, yeah, if you are using a concurrency of 1 then my theory about back-pressure building up in the sink seems less plausible. One thing to do might be to validate the assumption here that the concurrency is being correctly applied. You could use debug logs to see if requests are being issued concurrently.

Could you share your configuration, as well?

m0untains · 2023-09-11T16:12:29Z

Will enable debug logs and see if we can ascertain the concurrency level.

I also think the rate-limiting is a red-herring, as after further testing, this behavior is also observed when we are not rate-limited by AWS.

Also, apologies for hijacking this thread. I've opened a separate ticket for this issue (I should have done this initially), and uploaded the configuration as well: #18397

jszwedko · 2023-09-14T07:24:04Z

I suspect the request.concurrency is incorrectly documented and is not actually a default of none but is a default of adaptive. You could try setting request.concurrency = 1 to see if it changes the behavior for you. We are looking into validating the default and updating the docs if it is incorrect.

neuronull · 2023-10-24T19:47:11Z

Reading up on this issue, it sounds like we are still waiting to hear back from @Ilmarii on trying out the adjustments to the config in order to help isolate the issue.

dsmith3197 · 2023-11-08T19:54:18Z

Hi @Ilmarii,

I believe this should be resolved in the latest version (v0.34.0). We fixed a memory leak in the http sink (#18637) that would be triggered when the downstream service returns 429 and also refactored the Kafka source to better handle acknowledgements (#17497). Let us know if you still experience issues after upgrading.

Ilmarii added the type: bug A code related bug. label May 2, 2023

spencergilbert mentioned this issue May 12, 2023

Improve tokio::select behavior for kafka source and finalizers #17380

Merged

spencergilbert closed this as completed in #17380 May 15, 2023

spencergilbert reopened this May 15, 2023

dsmith3197 closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with kafka source, http sink and `429 code` responses #17276

Memory leak with kafka source, http sink and `429 code` responses #17276

Ilmarii commented May 2, 2023

spencergilbert commented May 8, 2023

spencergilbert commented May 9, 2023 •

edited

Loading

spencergilbert commented May 10, 2023

spencergilbert commented May 12, 2023 •

edited

Loading

Ilmarii commented May 15, 2023

Ilmarii commented May 15, 2023 •

edited

Loading

Ilmarii commented May 15, 2023

spencergilbert commented May 15, 2023

Ilmarii commented May 15, 2023

Ilmarii commented May 15, 2023

Ilmarii commented May 15, 2023

spencergilbert commented May 17, 2023

jszwedko commented May 19, 2023

Ilmarii commented May 22, 2023

pront commented Aug 7, 2023

m0untains commented Aug 22, 2023

jszwedko commented Aug 22, 2023

m0untains commented Aug 23, 2023

m0untains commented Aug 23, 2023

jszwedko commented Aug 31, 2023

m0untains commented Sep 11, 2023

jszwedko commented Sep 14, 2023

neuronull commented Oct 24, 2023

dsmith3197 commented Nov 8, 2023

Memory leak with kafka source, http sink and 429 code responses #17276

Memory leak with kafka source, http sink and 429 code responses #17276

Comments

Ilmarii commented May 2, 2023

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

spencergilbert commented May 8, 2023

spencergilbert commented May 9, 2023 • edited Loading

spencergilbert commented May 10, 2023

spencergilbert commented May 12, 2023 • edited Loading

Ilmarii commented May 15, 2023

Ilmarii commented May 15, 2023 • edited Loading

Ilmarii commented May 15, 2023

spencergilbert commented May 15, 2023

Ilmarii commented May 15, 2023

Ilmarii commented May 15, 2023

Ilmarii commented May 15, 2023

spencergilbert commented May 17, 2023

jszwedko commented May 19, 2023

Ilmarii commented May 22, 2023

pront commented Aug 7, 2023

m0untains commented Aug 22, 2023

jszwedko commented Aug 22, 2023

m0untains commented Aug 23, 2023

m0untains commented Aug 23, 2023

jszwedko commented Aug 31, 2023

m0untains commented Sep 11, 2023

jszwedko commented Sep 14, 2023

neuronull commented Oct 24, 2023

dsmith3197 commented Nov 8, 2023

Memory leak with kafka source, http sink and `429 code` responses #17276

Memory leak with kafka source, http sink and `429 code` responses #17276

spencergilbert commented May 9, 2023 •

edited

Loading

spencergilbert commented May 12, 2023 •

edited

Loading

Ilmarii commented May 15, 2023 •

edited

Loading