-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-16448: Fix raw record not being cached in store #16684
KAFKA-16448: Fix raw record not being cached in store #16684
Conversation
@loicgreffier Thanks for the investigation and the PR! It seems that the issue is when a record is stored and then forwarded independently of the current input record. That might happen in different places. The ones that come to my mind are:
For the buffers, I see that the raw record is not serialized to the changelog topic or the store. That means that the raw record will always be null after a failover for the records in the in-memory buffer. With the RocksDB-based buffer the records will never have a raw records attached. When records without the raw records are evicted from those buffers they will cause the @mjsax could you double-check my understanding? |
So maybe the fix brought by this PR should be:
Does it make sense? |
@loicgreffier |
I am just catching up more on the impl, and caching mechanism and others similar things, seems to be a general issue for passing in the original raw source data (also applies to Even if we ignore all DSL features (eg Thus, making a step back, I am wondering why we not just pass in the current key/value (or full While we want to use this new handler to build a DLQ, it's not the only way it can be used and thus we should not blindly optimize for the DLQ case, but try to make it useful for other cases as much as we can, too? (And we revisit this question what are serialized data we can pass into a DLQ handler on the DQL KIP and try to decouple the ProcessingExceptionHandler a little bit more from the DLQ KIP?) IRRC, we did have some discussion about this issue on the mailing list, but considered it a DSL issue that we might want to address in a follow up KIP. But maybe this assessment was wrong, and it would be better to address it right away (at least partially)? In the end, won't it be easier for the handler to determine what to do, if we pass in the current input record of the called |
Actually, we do pass in the current record into the processing exception handler. The issue here is that the error handler context also contains the raw source record which seems not to be straight-forward to get. |
Given that we pass into the processing exception handler the first record (i.e. the record that throws the error), the raw source record (i.e. the record read from the source topic partition of the current sub-topology of the task) in the error handler context would give the context for which source record the first record ran into the issue. I think this is valuable information, although not for each and every use case. However, I am wondering whether the topic, partition, and offset would not be enough to give context. In the end, those are the information to identify the raw source record. The good thing is that we have all of this information in the current caches and buffers as far as I understand. |
do we have to consider that the the first point of this comment #16684 (comment) is enough ? Checking in ProcessorNode#process if rawRecord != null before accessing the raw key and the raw value. It is the safest approach for me and will avoid crashing to NPE. The drawback is the processing exception handler can end up with no sourceRawKey nor sourceRawValue while the values are actually available in upstream |
Ah sorry. Mixed up the handler and the context... My bad.
Yes, that is what I meant here:
I think there is too many corner cases right now to address all of them properly quickly (and for some, we might never be able to fix it), and we might just want to go the path of least resistance. So just doing the
Well, if we find good way later, to set both correctly, it would just be an improvement we can do anytime. The API contract just says, may or may not be there (if we use |
Talking to @cadonna, he would prefer to just omit both If we need the information for DLQ, we can add both back via KIP-1034. |
@mjsax Got it, should we update KIP-1033 to remove any link with We'll update existing PRs, and open a new PR to remove This PR could be closed afterward |
Yes, we should update the KIP accordingly.
Seems you did this on the other PRs already, which I just merged. So the issue should be fixed and we can close this PR? |
@cadonna @mjsax
After #16093 has been merged, there is a scenario where processing exception handling ends with
NullPointerException
:This happened with the following topology:
Raw record, that has been added to
ProcessorRecordContext
, is lost when caching store. This PR fixes it. Looking forward to provide unit tests