You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the Kafka connector does not provide query filter push-down based on offset, partition or timestamp filter.
It would be nice to have the following filter options pushed-down to a Kafka seek operation:
_partition_offset between start_offset and end_offset
_partition_offset between start_timestamp and end_timestamp (where the timestamps are translated into offsets by using the Kafka offsetsForTimes method
_partition_id equals to one or more partition ids
Additionally the timestamp header field of a Kafka record should be exposed as an internal field (i.e. _timestamp).
The semantic of _timestamp is not the same as the 2nd filter option shown above. So I don't think it's correct to use such a new internal field to implement the 2nd option, such as WHERE _timestamp BETWEEN start_timestamp AND end_timestamp. It would be better to use another internal field, like _timestamp_offset for that.
The reason for that is, that the timestamp header can be set by the Kafka producer client to the real "event time", which can also be a timestamp in the past, if it is a disconnected client. So when the message is written to Kafka, it gets an offset at "ingestion time", which is not the same as the "event time" in that case. A push-down using a seek can only be done on the "ingestion time" (by translating the time to an offset) and never on the "event time". But of course it would still be good to also allow a where clause on the "event time", i.e. _timestamp.
The text was updated successfully, but these errors were encountered:
Currently the Kafka connector does not provide query filter push-down based on offset, partition or timestamp filter.
It would be nice to have the following filter options pushed-down to a Kafka seek operation:
_partition_offset
between start_offset and end_offset_partition_offset
between start_timestamp and end_timestamp (where the timestamps are translated into offsets by using the Kafka offsetsForTimes method_partition_id
equals to one or more partition idsAdditionally the timestamp header field of a Kafka record should be exposed as an internal field (i.e. _timestamp).
The semantic of
_timestamp
is not the same as the 2nd filter option shown above. So I don't think it's correct to use such a new internal field to implement the 2nd option, such asWHERE _timestamp BETWEEN start_timestamp AND end_timestamp
. It would be better to use another internal field, like_timestamp_offset
for that.The reason for that is, that the timestamp header can be set by the Kafka producer client to the real "event time", which can also be a timestamp in the past, if it is a disconnected client. So when the message is written to Kafka, it gets an offset at "ingestion time", which is not the same as the "event time" in that case. A push-down using a seek can only be done on the "ingestion time" (by translating the time to an offset) and never on the "event time". But of course it would still be good to also allow a where clause on the "event time", i.e.
_timestamp
.The text was updated successfully, but these errors were encountered: