Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with kafka source, http sink and 429 code responses #17276

Closed
Ilmarii opened this issue May 2, 2023 · 24 comments · Fixed by #17380
Closed

Memory leak with kafka source, http sink and 429 code responses #17276

Ilmarii opened this issue May 2, 2023 · 24 comments · Fixed by #17380
Labels
type: bug A code related bug.

Comments

@Ilmarii
Copy link
Contributor

Ilmarii commented May 2, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have a configuration with a kafka source and an http sink, and with acknowledgements enabled. And it works fine when the http sink only receives successful responses, but when the sink receives 429 responses, the memory starts growing until the vector is killed by the OOM.

Also I tried turning on allocation-tracing and the graph looks like there is a memory leak in the kafka source.

Some graphs:
Memory usage from vector node
Memory usage from vector node

Internal allocation tracing, all components:
image

Internal allocation tracing, only kafka source:
image

Rate of http sink requests with response code:
image

Configuration

data_dir: /tmp/vector-data-dir
api:
  enabled: true
  address: "127.0.0.1:8686"
  playground: true
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: timestamp
sources:
  main_input_kafka_src:
    type: kafka
    bootstrap_servers: "bootstrap.brokers.kafka:443"
    group_id: vector_test_local
    auto_offset_reset: earliest
    topics:
      - test-topic
    librdkafka_options:
      fetch.message.max.bytes: "10485760" # 10 MB
      fetch.max.bytes: "104857600" # 100 MB
sinks:
  test_sink:
    type: "http"
    inputs:
      - main_input_kafka_src
    uri: "http://localhost:8000"
    method: "post"
    acknowledgements:
      enabled: true
    buffer:
      type: "memory"
      max_events: 100000
      when_full: "block"
    batch:
      max_bytes: 20971520
      timeout_secs: 10
    encoding:
      codec: "json"
    request:
      concurrency: 250

Version

0.29.1

Debug Output

2023-05-02T05:33:20.934810Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=817}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true
2023-05-02T05:33:21.419138Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=636}: vector::sinks::util::retries: Internal log [Retrying after response.] is being rate limited.
2023-05-02T05:33:31.622858Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Internal log [Retrying after response.] has been rate limited 33 times.
2023-05-02T05:33:31.622918Z  WARN sink{component_kind="sink" component_id=test_sink component_type=http component_name=test_sink}:request{request_id=284}: vector::sinks::util::retries: Retrying after response. reason=too many requests internal_log_rate_limit=true

Example Data

No response

Additional Context

For MRE: I created an http server for http sink with simple logic: when concurrent requests < 20 then sleeps for 40 seconds and returns success, otherwise returns 429 immediately.
The content of the kafka topic is not important, just need enough amount.
Vector running in k8s.

References

No response

@Ilmarii Ilmarii added the type: bug A code related bug. label May 2, 2023
@spencergilbert
Copy link
Contributor

👋 Thanks for the detailed report @Ilmarii!

I'm working on reproducing this locally. I'm curious if you still see this behavior if acknowledgements are disabled, or if you've tried switching from a memory buffer to a disk buffer

@spencergilbert
Copy link
Contributor

spencergilbert commented May 9, 2023

Noting I wasn't able to reproduce this locally when only returning 429s from the target HTTP server.

EDIT: with some 202s sprinkled in I've gotten the allocation tracking to show ~2.5GB of memory usage for the kafka source in ~15minutes

@spencergilbert
Copy link
Contributor

@Ilmarii it looks to me, based on my local testing that if a disk buffer is used, or acknowledgements are disabled, this memory use isn't observed. I'll continue to track down the cause, but in the interim you could use that to workaround the issue in your deployment.

@spencergilbert
Copy link
Contributor

spencergilbert commented May 12, 2023

Fairly certain I've solved this improved this - confirming locally, and should have a PR to close it later today if testing goes well.

@Ilmarii would you be able to run a nightly version to see if my changes help enough in your environment

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 15, 2023

@spencergilbert Thanks for your fix! I will try the fix and answer with the results.

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 15, 2023

As for the disk buffer, we switched out, since with the disk buffer we have a strong underutilization of the CPU (only 1-2 cores out of 8 were used). And now we use a memory buffer + acknowledgements for some protection against data loss.

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 15, 2023

@spencergilbert I built and tested vector from spencer/improve-selects and the memory leak still exists.

@spencergilbert
Copy link
Contributor

Thanks! Curious was there any improvement at all on your end?

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 15, 2023

I found no improvement, leak size is still the same. But I tested a273ff7, without you latest two commits.

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 15, 2023

Now I testing with acknowledges disabled, maybe this would helpful.

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 15, 2023

Well, with acknowledges disabled there is no memory leak :-)

@spencergilbert
Copy link
Contributor

Another thought we had is that there's an issue with how acknowledgements are handled by the old style sink(s) and rewriting http to the new driver style sinks could resolve this as well.

@jszwedko
Copy link
Member

@Ilmarii any chance you could try some variations of your configuration to try to isolate the issue? We could try swapping out the http sink with a blackhole sink or the kafka source with the demo_logs source. If one of those combinations still experiences the leak that could isolate the component with the bug.

@Ilmarii
Copy link
Contributor Author

Ilmarii commented May 22, 2023

@jszwedko Yes, I will try it.

@pront
Copy link
Member

pront commented Aug 7, 2023

Hi @Ilmarii, did you have a chance to try any of the combinations suggested above?

@m0untains
Copy link

I believe we may also be hitting this issue with theaws_kinesis_streams sink. We have a kubernetes_logs source, and noticed a strong correlation between vector OOMing very quickly and being throttled by AWS.

Once the throttling subsides, so does the OOMing. I can also remove the kinesis sink (but keep the k8s logs source), and the problem disappears (which points to the issue being on the sink side).

@jszwedko
Copy link
Member

I believe we may also be hitting this issue with theaws_kinesis_streams sink. We have a kubernetes_logs source, and noticed a strong correlation between vector OOMing very quickly and being throttled by AWS.

Once the throttling subsides, so does the OOMing. I can also remove the kinesis sink (but keep the k8s logs source), and the problem disappears (which points to the issue being on the sink side).

That would be consistent with the sink building up a backlog of requests to process due to the throttling 🤔

@m0untains
Copy link

I could also add: the explosion in memory happens quite quickly. In a steady-state, when things are good, per-pod memory usage is a couple of hundred Mi. We have the pod limit set to 3Gi.

Monitoring kubectl top pods suggests that when throttled, many times the memory usage exceeds 3Gi within the scraping window (I'm not sure exactly what it is for top pods, but I assume < 60 seconds).

@m0untains
Copy link

Observing this via a worker node, once the journal entry for kubernetes_logs is accessed, it dies in a matter of seconds.

It happens so quick, it's hard to debug. Right before the OOM, I did notice about ~50 additional sockets under /proc/<vector_pid>/fd/. Presumably these are to AWS, however I am using the default request.concurrency of none, which is supposed to be 1.

@jszwedko
Copy link
Member

It happens so quick, it's hard to debug. Right before the OOM, I did notice about ~50 additional sockets under /proc/<vector_pid>/fd/. Presumably these are to AWS, however I am using the default request.concurrency of none, which is supposed to be 1.

Interesting, yeah, if you are using a concurrency of 1 then my theory about back-pressure building up in the sink seems less plausible. One thing to do might be to validate the assumption here that the concurrency is being correctly applied. You could use debug logs to see if requests are being issued concurrently.

Could you share your configuration, as well?

@m0untains
Copy link

Will enable debug logs and see if we can ascertain the concurrency level.

I also think the rate-limiting is a red-herring, as after further testing, this behavior is also observed when we are not rate-limited by AWS.

Also, apologies for hijacking this thread. I've opened a separate ticket for this issue (I should have done this initially), and uploaded the configuration as well: #18397

@jszwedko
Copy link
Member

I suspect the request.concurrency is incorrectly documented and is not actually a default of none but is a default of adaptive. You could try setting request.concurrency = 1 to see if it changes the behavior for you. We are looking into validating the default and updating the docs if it is incorrect.

@neuronull
Copy link
Contributor

Reading up on this issue, it sounds like we are still waiting to hear back from @Ilmarii on trying out the adjustments to the config in order to help isolate the issue.

@dsmith3197
Copy link
Contributor

Hi @Ilmarii,

I believe this should be resolved in the latest version (v0.34.0). We fixed a memory leak in the http sink (#18637) that would be triggered when the downstream service returns 429 and also refactored the Kafka source to better handle acknowledgements (#17497). Let us know if you still experience issues after upgrading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants