-
-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent produce() sequence number fix #1050
Concurrent produce() sequence number fix #1050
Conversation
Read/update sequence numbers immediately before produce request
More info on dealing with failures with inflight request > 1. Although I don't think we need this full implementation, as I don't think we need to guarantee ordering for concurrent requests (or we at least state that we don't.) I think this PR as-is is an improvement to the current behaviour. But I think there are questions around concurrent invocations of produce() and inflight requests in the idempotent mode, especially dealing with errors. |
This is correct. We state that you need to set The KIP you linked to has some ideas about improving this by reassigning sequence numbers if there's an error - but I'd say that's a future improvement. The case we're trying to solve here is where people are essentially doing: await Promise.all([
producer.send(),
producer.send(),
producer.send()
]) While the order between these three is of course arbitrary, assuming that there are no errors, all three requests should succeed. Currently, this is very likely to fail as the sequence number for all three requests are likely to be the same. This PR will change that so that the sequence numbers will always be incremented as soon as they are assigned. In fact, a future refactoring could change the interface to the EoS manager so that getting a sequence number automatically increments it, instead of relying on updating it separately. What this PR doesn't solve are the failure cases - but as it also doesn't make them worse, I'm inclined to accept it anyway. Some notes about the failure cases below: These notes are not really related to this PR, but I'm putting them here to increase our shared understanding. From KIP-98:
There was a comment in #598 that indicated that the sequence number should increase by one per request. This statement from KIP-98 can be read that way, but that's not what they mean. What they mean is just that there's a sequence number per topic partition, not that it's incremented by one per produce request. Our implementation is correct in that sense, that it increases with the number of records produced. Nothing to do here, just clearing something up. It goes on:
This makes me think that the current design of the transactional producer is a bit wonky. What I guess should happen is that if we receive and There is also the gnarlier situation where we don't get an OUT_OF_ORDER_SEQUENCE_NUMBER, but the request fails anyway, for example due to a connection error where we run out of retries. In such a case, we can't know what the next sequence number is. In this case, it seems the only safe thing to do is to handle it in the same way as if we got an OUT_OF_ORDER_SEQUENCE_NUMBER error and reject any further use of the transactional producer, and require a clean slate with a new producer id etc. |
Test caess for idempotent producer
@Nevon I have added code that reverts the sequence numbers in case of a failure. Now at least it will recover in the normal case of sequential produce() calls (for a topic-partition.) For errors when there are concurrent produce() calls ongoing (for a topic-partition) the idempotent producer will die with a UNKNOWN_PRODUCER_ID or OUT_OF_ORDER_SEQUENCE_NUMBER error on any error. This could be made to work if the sequence number stuff was moved inside the mutex lock that limits maxInFlightRequests. It would be pretty simple to add another lock (per broker, or maybe even per partition?) in sendMessages.js Test cases added that capture the current behaviour. I just noticed kafkajs/src/producer/sendMessages.js Line 127 in 9cf3dfb
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor note about whether or not we can remove the sleeps from the tests, but overall it looks good to me. Thanks for your patience!
) | ||
}) | ||
|
||
it('concurrent produce() calls > all messages are written to the partition once', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about changing this, but just for future reference, you can just wrap these test cases in a describe block to group the tests together, instead of repeating the prefix in each test case:
describe('concurrent produce() calls', () => {
test('all messages are written to the partition at once')
})
* @template T | ||
*/ | ||
|
||
function allSettled(promises) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. We should actually bump the requirement on Node to 12 at this point, but that's outside the scope of this PR.
Hi, I have a question. I've Producer and Consumer written in Java. for (int i=0; i<numberOfRecords; i++) { But even for one Producer I'm getting: What is strange for me, that the first 2050 messages are sent in less than a second. Can you advise? |
Read and update sequence numbers immediately before the produce request.
Fixes #1005 Maybe #598
Putting this up for review and input but I'm still unsure as to the failure modes. I'm wondering whether we should decrement the sequence numbers if the broker.produce() call fails.