Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta node can only handle ~600 Kafka sources. #18949

Open
ka-weihe opened this issue Oct 16, 2024 · 1 comment
Open

Meta node can only handle ~600 Kafka sources. #18949

ka-weihe opened this issue Oct 16, 2024 · 1 comment
Labels
type/bug Something isn't working
Milestone

Comments

@ka-weihe
Copy link
Contributor

ka-weihe commented Oct 16, 2024

Describe the bug

I attempted to create 700 Kafka Plain Avro sources. However, after successfully creating approximately 600 sources, the meta-node becomes unresponsive, and I encounter the following error:

ERROR librdkafka: librdkafka: THREAD [thrd:main]: Unable to create broker thread

Additionally, when I shell into the meta-node and try to execute any commands, I get:

bash: fork: retry: Resource temporarily unavailable

To investigate, I ran the command (ps -eLf | wc -l) to monitor the number of OS threads while creating the sources. The thread count reached 32,769 after I had created 629 sources, at which point I could no longer execute commands—likely due to the OS exhausting its available resources for creating threads.

This behavior seems like a bug. I would expect the number of threads to be significantly lower than the number of sources. Even if there was only one OS thread per source, while inefficient, it would still fit within our use case requirements, but we are seeing more than 50 OS threads per source. It’s also worth noting that none of the sources were being actively used at any time.

Error message/log

ERROR librdkafka: librdkafka: THREAD [thrd:main]: Unable to create broker thread

To Reproduce

  • Setup a Kafka cluster with 50 brokers in K8s
  • Create 1000 Avro topics with 16 partitions
  • Deploy a simple RisingWave with the RisingWave operator with Postgres metastore and MinIO statestore (in K8s)
  • Connect and create 1 source for each topic.

Expected behavior

I expect the sources to be created and the Meta node to not be unresponsive and use many resources.

How did you deploy RisingWave?

More or less like this:
https://github.com/risingwavelabs/risingwave-operator/blob/main/docs/manifests/risingwave/risingwave-postgresql-s3.yaml

But higher limits

The version of RisingWave

PostgreSQL 13.14.0-RisingWave-2.0.1 (0d15632)

Additional context

No response

@ka-weihe ka-weihe added the type/bug Something isn't working label Oct 16, 2024
@github-actions github-actions bot added this to the release-2.2 milestone Oct 16, 2024
@ka-weihe ka-weihe changed the title Meta node can only handle ~600 sources. Meta node can only handle ~600 Kafka sources. Oct 16, 2024
@xxchan
Copy link
Member

xxchan commented Oct 18, 2024

According to confluentinc/librdkafka#1600, it seems each kafka consumer will create ~num_brokers threads, and threads cannot be shared across client instances.

In Meta, currently each source has a KafkaSplitEnumerator, which has a Kafka consumer, so this looks like exactly your problem. (1 source ~ 50 threads) In theory, we could improve Meta by sharing consumer connecting to the same broker.

However, I'm concerned that after fixing Meta, Compute nodes still cannot handle it, since it will have even more Kafka consumers (multiply by parallelism).

I'm feeling 50 brokers might be too large. Is it possible for you to divide it into multiple smaller clusters with fewer brokers? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants