Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bridge crashes with OOM error when the Kafka topic contains large amounts of data with /records api #379

Open
HarshithBolar opened this issue Jan 28, 2020 · 17 comments

Comments

@HarshithBolar
Copy link

My consumer config -

{
	"name": "test",
	"auto.offset.reset": "earliest",
	"format": "json",
	"enable.auto.commit": false,
	"fetch.min.bytes": 512,
	"consumer.request.timeout.ms": 3000
}

When I publish large quantities of data (several GB) and hit the /records api to fetch messages, the bridge crashes with a Thread blocked error and finally out of memory - Java heap space error.

Looks like the bridge is trying to fetch too many messages from Kafka in a single request. Is there any way to control the max number of messages (or max size) that can be fetched per request?

Error logs - https://paste.ubuntu.com/p/9y6RS9yxBZ/

@sknot-rh
Copy link
Member

Hi,
for consumers, you can use all the settings from https://kafka.apache.org/documentation/#consumerconfigs
So I believe setting fetch.max.bytes should help there. @ppatierno can you confirm?

@HarshithBolar
Copy link
Author

HarshithBolar commented Jan 28, 2020

I get this error when I add the property "fetch.max.bytes": 1000000 while creating a consumer.

{
    "error_code": 400,
    "message": "Validation error on: body - $.fetch.max.bytes: is not defined in the schema and the schema does not allow additional properties"
}

I also tried adding kafka.fetch.max.bytes=1000000 to application.properties, but the bridge crashed with the same error again.

In the logs, I see that this property is getting picked up -

...
kafka_bridge_run.sh[18960]: exclude.internal.topics = true
kafka_bridge_run.sh[18960]: fetch.max.bytes = 1000000
kafka_bridge_run.sh[18960]: fetch.max.wait.ms = 500
...

So, the issue might be related to something else.

@HarshithBolar
Copy link
Author

So I sent a few messages to Kafka after setting the fetch.max.bytes to 1 MB in application.properties, and when I did a fetch using /records api. I recieved a response with size 2.5 MB for a single request. It looks like that property isn't affecting anything.

@ppatierno
Copy link
Member

Yes the specific consumer configuration supports just a limited properties so as you already did, the way is to set the property at bridge level in the application.properties.
I will double check it, what's your use case out of curiosity?

@sknot-rh
Copy link
Member

My memory was wrong... in bridge you can set only these options: https://strimzi.io/docs/bridge/latest/#_consumer

@ppatierno
Copy link
Member

@HarshithBolar can you provide a reproducer so that we can check?

@HarshithBolar
Copy link
Author

@ppatierno The use case is to deliver messages from Kafka to an external client, because we cannot expose our internal Kafka architecture, we're using the Strimzi bridge to expose an endpoint to them using which they can pull messages. The messages will be streaming at approximately 1000-1500 transactions/second and the size of each message will be around 300-500 KB. The functional tests worked very well, but we're currently facing this issue with performance testing.

The issue I see here is that the bridge is fetching more than what is specified in the fetch.max.bytes property. I have set it to 1 MB, but I'm getting responses that are over 5 MB. Is this expected?

I'm using version 0.14.0. This can be reproduced by setting kafka.fetch.max.bytes=1000000 in application.properties, sending around 5 MB data to a Kafka topic and trying to fetch it with /records. Kafka version is 2.0.

@ppatierno
Copy link
Member

Sorry if I ask for more insights.
I guess you are using the bridge for sending as well, not only for consuming right?
Is the producer sending this 5MB as binary data? so base64 encoded in the JSON?

@HarshithBolar
Copy link
Author

HarshithBolar commented Jan 28, 2020

No, the bridge is being used only for consuming. I'm using a different technology (Flink) for publishing messages to Kafka.

The producer is sending around 15-20 JSON messages that add up to 5 MB. The bridge consumes all these messages in a single GET request. The producer is sending it as normal JSON strings, not base64 encoded.

@ppatierno
Copy link
Member

ppatierno commented Jan 28, 2020

@HarshithBolar I see you wrote kafka.fetch.max.bytes while it should be kafka.consumer.fetch.max.bytes can you try it please? There is a consumer. prefix to use. This should be the reason why the limit is not set to the consumer.

@HarshithBolar
Copy link
Author

I changed the line to kafka.consumer.fetch.max.bytes=1000001. While creating the consumer, I see the property being picked up -

Jan 28 09:33:31 kafka_bridge_run.sh[28467]: fetch.max.bytes = 1000001

But it still has no effect, I ran the same test and the bridge still consumed 5 MB in a single request.

@ppatierno
Copy link
Member

So thinking more, the consumer on the bridge is working how it's expected for how Kafka works.
Stated by the official Kafka documentation about fetch.max.bytes:

Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress

Actually it seems that the batch is 5 MBs and anyway it's got by the consumer because it works in this way.
The fetch.max.bytes is useful when each batch size to receive is less than fetch.max.bytes and it limits the number of batches that the consumer can get (but anyway it will always get the first one).

Can I ask you the producer configuration as well (at Kafka level) ?
Then, because 5 MB should not be the cause of the OOM exception, did it happen with just 5 MBs or with more data? How many partitions has the topic? Are you running just one bridge?
I am asking this for having more context, because for example a topic with 100 partitions, if you have 100 consumers connecting to the bridge, which means 100 consumer instances on the bridge getting this big batches ... all this could cause OOM.

@HarshithBolar
Copy link
Author

The issue happens when I publish around 10 GB of data to the topic and try a fetch from Strimzi. The topic has 30 partitions. There is just one bridge running at the moment. For testing, I have created just one consumer on the bridge.

The producer has default configuration, so the batch.size should be 16384 bytes which is the default value.

If there is any specific producer property you're looking for or any more information from Kafka, I can get it for you.

After sending 10 GB of data and sending a request to /records api, I immediately see the following thread block errors. Do they mean anything?

Jan 28 15:54:54 kafka_bridge_run.sh[4242]: Jan 28, 2020 3:54:54 PM io.vertx.core.impl.BlockedThreadChecker
Jan 28 15:54:54 kafka_bridge_run.sh[4242]: WARNING: Thread Thread[vert.x-eventloop-thread-2,5,main]=Thread[vert.x-eventloop-thread-2,5,main] has been blocked for 3055 ms, time limit is 2000 ms
Jan 28 15:55:27 kafka_bridge_run.sh[4242]: io.vertx.core.VertxException: Thread blocked

@HarshithBolar
Copy link
Author

HarshithBolar commented Jan 29, 2020

I added a new property kafka.consumer.max.poll.records=1 to application.properties. This seems to have fixed it, the bridge now fetches just one message for every request to /records api so there are no OOM issues anymore.

@ppatierno
Copy link
Member

@HarshithBolar so can I close this one?

@HarshithBolar
Copy link
Author

@ppatierno Yes, I'll close this.

@HarshithBolar
Copy link
Author

Hi @ppatierno Reopening this issue as setting max.poll.records to 1 was just a work around. The issue still exists when I try to fetch large number of messages from Kafka. The property kafka.consumer.fetch.max.bytes which is set to 10 MB makes no difference. The bridge first throws a io.vertx.core.VertxException: Thread blocked error and finally fails with.

SEVERE: Unhandled exception
java.lang.OutOfMemoryError: Java heap space

@HarshithBolar HarshithBolar reopened this May 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants