Be more rigorous about ensuring the right statements get executed. #2435

rodesai · 2019-02-12T19:13:44Z

There is ongoing work (#2329) to decouple statement validation when executing the command topic from external validation. However, even our own internal validation can have different outcomes across different runs of the KSQL server. Consider the following scenario:

Recently there was a validation bug in KafkaStreams () where streams would accept windows larger than the default retention time. Such a window should not be accepted because it is not recoverable across node failures. Once the bug is fixed, KafkaStreams will reject such windows.

KSQL may have a command topic containing a statement with such a window that was accepted by the cluster. After an upgrade, the statement will be rejected, at which point KSQL's internal state may be corrupted. Though it would be reasonable to mark the query as failed, the final set of streams and tables should be the same.

We should be even more rigorous about this to ensure that the same statements are executed every time the command topic is replayed. I propose we do the following:
- When writing commands to the command topic, include a validation signature. This can be the command topic offset for which validation was done (we'd need some way to synchronize getting the sandbox with the executor thread in this case), or a checksum of all our state. I think the former would be more extensible.
- When executing statements, if the current validation signature is not the same as the one in the command, reject the statement. If the signature is the same, then apply KSQL's internal state changes even if the statement itself fails. If it fails we should have some way of communicating to the user that the query could not be started.

big-andy-coates · 2019-02-18T14:34:38Z

The server side pattern can be:

request received
create sandbox (which knows which offset its built from)
For each statement
- validate
- post to command topic. (tracking offset).
- if (offset statement was produced at != expected)
  - wait for executor to reach max tracked offset.
  - create a new sandbox, remove any distributed statements and then start looping around statements again.

i.e. the server side would know when it got gazumped when writing to the command topic because the offset it produced a message at would be higher than expected. So the server side can automatically handle resyncing state, re-validating and posting to the command topic again.

The executor side, as you say, will reject any statement that isn't marked as having been validated against the right version of state.

agavra · 2019-02-20T21:45:31Z

This issue may be more prevalent than @rodesai's original description. If we have multiple servers, we validate a command with respect to our local meta-store. Imagine the following commands:

1: CREATE STREAM foo WITH (kafka_topic='bar');
2: CREATE STREAM foo WITH (kafka_topic='baz');

At time t=0, server A gets command 1 and server B gets command 0. Both are able to validate the command and enqueue the CS statement. The command topic will now have conflicting statements. This proposal fixes that as well.

rodesai · 2019-02-22T18:54:57Z

In that case we'll actually do the right thing today - whichever statement gets queued second will fail every time, and we'll end up with the right internal state. But if (hypothetically) tomorrow we changed the semantics to implicitly drop the first stream and create the second (which would be a terrible idea that we should never do. just using it as an example here), having the validation described in this issue would guarantee we still get the same execution of the command topic.

agavra · 2019-03-19T16:23:42Z

Things get a little bit more complicated with multi-line statements when we use the offset-validation approach. We send each statement as a separate command, so it is possible that even though c1 and c2 are validated together, a command c3 is enqueued between them. This would make c2 fail. Not sure if this is desired behavior or not, but I will stick to this for the first pass.

…onfluentinc#2435)

agavra · 2019-12-03T23:49:35Z

Has this been fixed with #3660 @stevenpyzhang?

stevenpyzhang · 2019-12-04T00:27:07Z

This is fixed now. The scenario you mentioned here can still happen though since each command goes through the transaction protocol individually so users submitting multi-line statements could get partially successful results
#2435 (comment)

agavra · 2019-12-04T00:28:33Z

I think that's okay - we may want to create a new issue to track that, but I'm closing this one out for now. Thanks!

agavra mentioned this issue Feb 13, 2019

Process commands so that WITH clause is always resolved before executing it #2436

Merged

2 tasks

agavra added a commit to agavra/ksql that referenced this issue Mar 19, 2019

add checksum to enqueued commands (Fixes confluentinc#2435)

9cc3155

agavra added a commit to agavra/ksql that referenced this issue Mar 19, 2019

add checksum to enqueued commands (Fixes confluentinc#2435)

df07692

agavra added a commit to agavra/ksql that referenced this issue Mar 19, 2019

add checksum to enqueued commands (Fixes confluentinc#2435)

1de1e3e

agavra added a commit to agavra/ksql that referenced this issue Mar 19, 2019

add checksum to enqueued commands (Fixes confluentinc#2435)

709ff9c

agavra added a commit to agavra/ksql that referenced this issue Mar 19, 2019

add checksum to enqueued commands (Fixes confluentinc#2435)

825c9e0

agavra added a commit to agavra/ksql that referenced this issue Apr 12, 2019

add checksum to enqueued commands to prevent validation races (Fixes c…

0191bbf

…onfluentinc#2435)

agavra added a commit to agavra/ksql that referenced this issue Apr 13, 2019

add checksum to enqueued commands to prevent validation races (Fixes c…

bda7dab

…onfluentinc#2435)

agavra mentioned this issue Apr 16, 2019

add checksum to enqueued commands to prevent validation races (Fixes #2435) #2582

Closed

2 tasks

vcrfxia mentioned this issue Aug 25, 2019

Increment persistent query ID during recovery even if query fails validation #3269

Closed

rodesai mentioned this issue Aug 29, 2019

fix: increment query_id even on failed queries to prevent id offset during restore #3278

Closed

2 tasks

stevenpyzhang mentioned this issue Aug 30, 2019

Mismatched topic partition/replica values after server restart #3288

Closed

stevenpyzhang mentioned this issue Sep 11, 2019

feat: add command topic offset to commands and perform validation based on offset when executing commands #3330

Closed

2 tasks

stevenpyzhang mentioned this issue Oct 24, 2019

feat: Transactional Produces to Command Topic #3660

Merged

2 tasks

vcrfxia mentioned this issue Nov 1, 2019

Extend command messages to include cookie referencing request and use this to map to completion futures #1860

Closed

agavra closed this as completed Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more rigorous about ensuring the right statements get executed. #2435

Be more rigorous about ensuring the right statements get executed. #2435

rodesai commented Feb 12, 2019

big-andy-coates commented Feb 18, 2019 •

edited

Loading

agavra commented Feb 20, 2019

rodesai commented Feb 22, 2019

agavra commented Mar 19, 2019

agavra commented Dec 3, 2019

stevenpyzhang commented Dec 4, 2019

agavra commented Dec 4, 2019

Be more rigorous about ensuring the right statements get executed. #2435

Be more rigorous about ensuring the right statements get executed. #2435

Comments

rodesai commented Feb 12, 2019

big-andy-coates commented Feb 18, 2019 • edited Loading

agavra commented Feb 20, 2019

rodesai commented Feb 22, 2019

agavra commented Mar 19, 2019

agavra commented Dec 3, 2019

stevenpyzhang commented Dec 4, 2019

agavra commented Dec 4, 2019

big-andy-coates commented Feb 18, 2019 •

edited

Loading