-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: support attributes in pubsub sink #116089
Conversation
8ae06a3
to
4cc7e66
Compare
4cc7e66
to
a582967
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 18 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/sink.go
line 88 at r1 (raw file):
// error may be returned if a previously enqueued message has failed. EmitRow(ctx context.Context, topic TopicDescriptor, key, value []byte, updated, mvcc hlc.Timestamp, alloc kvevent.Alloc, tableName string) error
I'm curious: can't we use or extend TopicDescriptor already passed to this function to carry tableName?
Like: doesn't it have StatementTimeName?
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 251 at r1 (raw file):
psb.messages = append(psb.messages, &pb.PubsubMessage{ Data: content, Attributes: map[string]string{"TABLE_NAME": psb.tableName}})
Do we unconditionally include these attributes?
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 251 at r1 (raw file):
psb.messages = append(psb.messages, &pb.PubsubMessage{ Data: content, Attributes: map[string]string{"TABLE_NAME": psb.tableName}})
Looking at this code -- makes me sad -- we allocate &pb.PubsubMessage structures all the time.
So sad... Especially since the pb.PubsubMessage has Reset method... But that's for another cleanup CL...
We would probably need to have "attributes cache" to avoid allocating these maps. Okay to leave a TODO.
a582967
to
8269b5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/sink.go
line 88 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I'm curious: can't we use or extend TopicDescriptor already passed to this function to carry tableName?
Like: doesn't it have StatementTimeName?
I thought about it (actually I think I coded it using this method and abandoned it). The topic desc gets instantiated from the job record and we don't have the table name in the job record. So, we either add the table name or we have to look it up. It's much easier to look up the table name from the row. Also, I like having a separate parameter for row related metadata and keep the topic data in its own param. In a follow up PR, we will change it from just tableName string to something generic like struct{}.
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 251 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Do we unconditionally include these attributes?
Yeah we shouldn't. I added a check and test case for nil attributes.
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 251 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Looking at this code -- makes me sad -- we allocate &pb.PubsubMessage structures all the time.
So sad... Especially since the pb.PubsubMessage has Reset method... But that's for another cleanup CL...We would probably need to have "attributes cache" to avoid allocating these maps. Okay to leave a TODO.
I updated the code so we instantiate it once per batch buffer and save it. This way, each new message doesn't alloc a new map.
8269b5c
to
66c50a6
Compare
66c50a6
to
4c0ecb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/sink.go
line 88 at r1 (raw file):
Previously, jayshrivastava (Jayant) wrote…
I thought about it (actually I think I coded it using this method and abandoned it). The topic desc gets instantiated from the job record and we don't have the table name in the job record. So, we either add the table name or we have to look it up. It's much easier to look up the table name from the row. Also, I like having a separate parameter for row related metadata and keep the topic data in its own param. In a follow up PR, we will change it from just tableName string to something generic like struct{}.
I agree that it's easier to get the table from the row... What I don't understand is why can't we have topic descriptor always carry table name.
Consider the place where we obtain table descriptor (encodeAndEmit):
topic, err := c.topicForEvent(updatedRow.Metadata)
if err != nil {
return err
}
the updatedRow.Metadata
already has TableName set. Why not have topicForEvent just use metadata.TableName to return appropriate table descriptor?
I understand that currently, in this version, you only conditionally set tableName to non-empty
string if tableNameAttributedEnabled:
tableName := ""
if tableNameAttributeEnabled {
tableName = updatedRow.TableName
}
But... why? Why not just have topic descriptor carry table name?
If you are worried about future changes and perhaps having to extend things -- you might want to consider
introducing an additional interface.... something like:
type Attributer interface {
TableName() string
AnotherAttribute() int
...
}
Have topic descriptor lookup optionally wrap TopicDescriptor with "Attributer" interface. At the place
where you need it -- e.g. pubsub sink:
if attributer, ok := topic.(Attributer); ok {
// We should include additional message attributes
}
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 252 at r2 (raw file):
} // No need to allocate
is this meant to be a todo?
d26f38e
to
dedd23d
Compare
dedd23d
to
e239eee
Compare
Verified using the local emulator that it works
|
e239eee
to
3a04bc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r2, 5 of 17 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/batching_sink.go
line 190 at r3 (raw file):
payload.mvcc = mvcc payload.alloc = alloc payload.attributes.tableName = topic.GetTableName()
since we store topicDescriptor with payload.... why do we need to have payload.attributes?
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 47 at r3 (raw file):
} var tableNameAttributeEnabled = envutil.EnvOrDefaultBool("COCKROACH_ENABLE_TABLE_NAME_PUSBUB_ATTRIBUTE", false)
I almost question this env var... I know we discussed it before, and agreed to use this as an opt in mechanism...
But wouldn't having a url parameter (e.g. with_table_name_attribute
or whatnot) be a form of opt in that does not require entire cluster restart to pick up?
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 151 at r3 (raw file):
} sinkClient.mu.topicCache = make(map[string]struct{}) sinkClient.attributesMu.cache = cache.NewUnorderedCache(cache.Config{
I don't know if we need to have an lru cache...
in the cases when you have many topics published to the same sink -- it's likely to be
small(ish) number. And even if you rename ALL topics (and so, presumably, your LRU cache
might kick in) -- again, I just don't see how it matters if we store e.g. 10 elements in a map, or 20...
Basically, consider using regular map.
Also, consider taking this cache and just putting it inside regular mu (which already has topicCache).
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 270 at r3 (raw file):
if tableNameAttributeEnabled { psb.sc.attributesMu.Lock() defer psb.sc.attributesMu.Unlock()
it's a bit sad we need to acquire this lock...
But ... do we need to acquire the lock?
Does it make sense to have attributes per buffer? Is it that much extra?
If you think it has to be on sink level, then maybe try sync.Map or some such
to avoid grabbing a lock.
3a04bc5
to
1e864ab
Compare
9becd8e
to
5ab7954
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/batching_sink.go
line 190 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
since we store topicDescriptor with payload.... why do we need to have payload.attributes?
Done.
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 47 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I almost question this env var... I know we discussed it before, and agreed to use this as an opt in mechanism...
But wouldn't having a url parameter (e.g.with_table_name_attribute
or whatnot) be a form of opt in that does not require entire cluster restart to pick up?
Done.
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 151 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I don't know if we need to have an lru cache...
in the cases when you have many topics published to the same sink -- it's likely to be
small(ish) number. And even if you rename ALL topics (and so, presumably, your LRU cache
might kick in) -- again, I just don't see how it matters if we store e.g. 10 elements in a map, or 20...
Basically, consider using regular map.Also, consider taking this cache and just putting it inside regular mu (which already has topicCache).
Done. Now it's just a map per-buffer.
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 270 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
it's a bit sad we need to acquire this lock...
But ... do we need to acquire the lock?Does it make sense to have attributes per buffer? Is it that much extra?
If you think it has to be on sink level, then maybe try sync.Map or some such
to avoid grabbing a lock.
Done.
pkg/ccl/changefeedccl/sink.go
line 88 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I agree that it's easier to get the table from the row... What I don't understand is why can't we have topic descriptor always carry table name.
Consider the place where we obtain table descriptor (encodeAndEmit):topic, err := c.topicForEvent(updatedRow.Metadata) if err != nil { return err }
the
updatedRow.Metadata
already has TableName set. Why not have topicForEvent just use metadata.TableName to return appropriate table descriptor?I understand that currently, in this version, you only conditionally set tableName to non-empty
string if tableNameAttributedEnabled:tableName := "" if tableNameAttributeEnabled { tableName = updatedRow.TableName }
But... why? Why not just have topic descriptor carry table name?
If you are worried about future changes and perhaps having to extend things -- you might want to consider
introducing an additional interface.... something like:type Attributer interface { TableName() string AnotherAttribute() int ... }
Have topic descriptor lookup optionally wrap TopicDescriptor with "Attributer" interface. At the place
where you need it -- e.g. pubsub sink:if attributer, ok := topic.(Attributer); ok { // We should include additional message attributes }
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r7, 3 of 3 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 74 at r8 (raw file):
// The number of distinct attribute maps to cache per-sink client (which is // the same as per-changefeed). var attributesCacheSize = 128
is this used?
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 253 at r8 (raw file):
msg := &pb.PubsubMessage{Data: content} if psb.sc.withTableNameAttribute { attrMap, ok := psb.attributesCache[attributes]
this doesn't work as a cache since you never initialize attributesCache (make(map[]....), so it's nil and will always return !ok;
pkg/ccl/changefeedccl/sink_pubsub_v2.go
line 255 at r8 (raw file):
attrMap, ok := psb.attributesCache[attributes] if !ok { attrMap = map[string]string{"TABLE_NAME": attributes.tableName}
you need to set psb.attributesCache[attributes] = attrMap
5ab7954
to
1f3ffe1
Compare
This change adds support for including the table name along with each row/batch sent by the v2 pubsub sink (enabled by default). The table name is passed inside pubsub attributes. Attributes are stored in a `map[string]string` and passed emitted alongside each with each message/batch. To enable this feature, the uri parameter `with_table_name_attribute=true` must be added to the sink uri. The key for the table name in the attribute map will be `TABLE_NAME`. Because this change needs to be backported, it is as small as possible to minimize risk. This feature can be expanded upon later to be more generic (ex. use changefeed options instead of env var, use cdc queries to specify custom attributes, use a generic metadata struct instead of tablename string, pass metadata to different sinks and not just pubsub etc). Because this feature will be expanded in the future, the release note is intentionally left blank. Release note: None Closes: cockroachdb#115426
1f3ffe1
to
dde2fbf
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from dde2fbf to blathers/backport-release-23.2-116089: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This change adds support for including the table name along with each row/batch
sent by the v2 pubsub sink (enabled by default). The table name is passed
inside pubsub attributes. Attributes are stored in a
map[string]string
and passedemitted alongside each with each message/batch.
To enable this feature, the uri parameter
with_table_name_attribute=true
must be addedto the sink uri.
The key for the table name in the attribute map will be
TABLE_NAME
. Becausethis change needs to be backported, it is as small as possible to minimize risk.
This feature can be expanded upon later to be more generic (ex. use changefeed
options instead of env var, use cdc queries to specify custom attributes,
use a generic metadata struct instead of tablename string, pass metadata
to different sinks and not just pubsub etc). Because this feature will be expanded
in the future, the release note is intentionally left blank.
Release note: None
Closes: #115426