-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: rationalize the handling of comments #95431
Conversation
@postamar could you tell me what you think about this before i spend more time finalizing it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice.
schema: ` | ||
CREATE VIEW pg_catalog.pg_shdescription AS SELECT objoid, classoid, description | ||
FROM "".crdb_internal.kv_catalog_comments | ||
WHERE classoid = ` + strconv.Itoa(catconstants.PgCatalogDatabaseTableID) + `:::oid`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to make this even more efficient you could add a virtual partial index
`INDEX(classoid) WHERE classoid = ` + strconv.Itoa(catconstants.PgCatalogDatabaseTableID) + `:::oid`
and in that one call GetAllDatabases
instead of GetAll
. It'll do a namespace join, so an extra round-trip, but it might read a heck of a lot less data and produce a heck of a lot fewer rows to filter in some cases.
Prior to this patch, we had a weird translation going in `kv_catalog_comments`, which was causing the tree of objects to be traversed 6 times (one per comment type), and then the rows to be mapped one by one again in `pg_description` / `pg_shdescription`. This made no sense - it should be possible to push a predicate on `pg_description` down to a virtual index on `kv_catalog_comments`, and this requires the column types to align and no computation to happen in-between. This commit simplifies all this by having `kv_catalog_comments` responsible for preparing the data in the format expected by the `pg_` tables. Release note: None
1c78d85
to
fec1227
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I am making progress, but I need some help.
My calls for help are in inline comments below. Before answering them, i'd also welcome a skim review of my changes to catkv
.
I appreciate that the work here is a bit overkill. But this PR is more an attempt to educate myself about the catalog APIs than to make a true perf optimization. (although I'd be happy if I can also achieve the latter.) So I'd be grateful for the education.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @postamar)
pkg/sql/crdb_internal.go
line 5180 at r2 (raw file):
// FIXME: when dbContext is not nil, we want a catalog with // _only_ that db desc and is comment, if any. How to build // it?
I need help here. I want to handle the following cases well:
"".crdb_internal.kv_catalog_comments where pgclassoid = <pg_database>
-> all databases, only database commentsdbprefix.crdb_internal.kv_catalog_comments where pgclassoid = <pg_database>
-> just this 1 database, only this db's comment
pkg/sql/crdb_internal.go
line 5195 at r2 (raw file):
var err error if dbContext != nil { all, err = p.Descriptors().GetAllInDatabase(ctx, p.txn, dbContext)
Here I need all comments inside this DB to be included, but not all comments across all dbs.
pkg/sql/pg_catalog.go
line 1639 at r1 (raw file):
Previously, ajwerner wrote…
if you want to make this even more efficient you could add a virtual partial index
`INDEX(classoid) WHERE classoid = ` + strconv.Itoa(catconstants.PgCatalogDatabaseTableID) + `:::oid`
and in that one call
GetAllDatabases
instead ofGetAll
. It'll do a namespace join, so an extra round-trip, but it might read a heck of a lot less data and produce a heck of a lot fewer rows to filter in some cases.
Done.
pkg/sql/catalog/descs/collection.go
line 743 at r2 (raw file):
// FIXME: here probably I want to call LoadComments() with just // DatabaseCommentType. But I only want to do so when the caller is // interested in comments. Not all are.
I need help here: I want comments of type DatabaseCommentType to be loaded here, but only if the caller cares. Should I add a flag to GetAllDatabases()
for that?
pkg/sql/catalog/descs/collection.go
line 854 at r2 (raw file):
// Also ensure the db desc itself is included, which ensures we can // fetch its comment if any below. // FIXME: this does not seem to work. Why?
I also need help here.
I'm not 100% I want/need this.
What I'm trying to cover here is the case of
mydb.crdb_internal.kv_catalog_comments
in which case I believe I want a row for the database comment on mydb
if there is one.
If I do need this, then I need to modify this function to be more like GetAllDescriptorsForDatabase
(which is deprecated, but does include the dbDesc in the resulting catalog).
But I could also decide that I don't want this, i.e. I redefine kv_catalog_comments
to contain only non-database comments. In that case I need to re-define pg_shdescription
to use another vtable than kv_catalog_comments
.
What do you think is best?
pkg/sql/catalog/descs/collection.go
line 873 at r2 (raw file):
// FIXME: This does not seem to be the right place to call this -- // it should load comments into `stored` _before_ the call // to aggregateLayers(), so that shadowed comments don't get overwritten.
I also need help here. In this function it is too late to call LoadComments; I need this to happen before aggregateLayers.
But stored
as returned by ScanNamespaceForDatabaseSchemasAndObjects
is a nstree.Catalog
, not MutableCatalog
. It seems inefficient to re-build a mutable catalog just to add the comments.
What would be your guidance? Add some "include comments" flag to ScanNamespaceForDatabaseSchemasAndObjects
? A new method on the CatalogReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the catalog API's shortcomings: I think what we really want is for aggregateAllLayers
to behave correctly, correctness being defined thus: presently it behaves correctly when stored
is the product of a ScanAll
or a GetByIDs
, but it does not for any other CatalogReader
method.
If you don't want to do this (quite understandable!) I'll be happy to, please file a stub of an issue and assign it to me.
populate: func(ctx context.Context, _ tree.Datum, p *planner, dbContext catalog.DatabaseDescriptor, addRow func(...tree.Datum) error) (matched bool, err error) { | ||
// FIXME: when dbContext is not nil, we want a catalog with | ||
// _only_ that db desc and is comment, if any. How to build | ||
// it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The catalog API currently doesn't have any way to lookup all the things for one ID and return an nstree.Catalog
, but there's definitely room for that, and I'd welcome that addition. There are several alternatives, in terms of interface design:
- add methods to the getters in
getters.go
: this makes it possible to choose between leased or unleased descriptors, etc. - add a method directly to the collection in
collection.go
which leverageaggregateAllLayers
: this would ensure that the results are consistent with those inGetAll
orGetAllInDatabase
etc. - do both of the above.
I think for virtual tables we definitely need (2) because we want to be extra-sure that the results don't vary depending on whether a virtual index was used or not. There's no very wrong choice here because deeper down it's all using the same machinery.
I haven't tried whether this works but I wouldn't be surprised if it did (not sure about edge cases though):
func (tc *Collection) GetAllForIDs(ctx context.Context, txn *kv.Txn, ids ...descpb.ID) (nstree.Catalog, error) {
const isDescriptorRequired = false
stored, err := tc.cr.GetByIDs(ctx, txn, ids, isDescriptorRequired, catalog.Any)
if err != nil {
return nstree.Catalog{}, err
}
ret, err := tc.aggregateAllLayers(ctx, txn, stored)
if err != nil {
return nstree.Catalog{}, err
}
return ret.FilterByIDs(ids), nil
}
return nstree.Catalog{}, err | ||
} | ||
return mc.Catalog, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serious question: what's wrong with using GetByIDs
instead? AFAIK it shouldn't matter much that we're also reading system.descriptor
, we're almost certainly going to be doing that anyway at some point in the transaction. I'm keen to keep this CatalogReader interface tight.
|
||
// FIXME: here probably I want to call LoadComments() with just | ||
// DatabaseCommentType. But I only want to do so when the caller is | ||
// interested in comments. Not all are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that it's fine to replace LoadComments
in favour of GetByIDs
, and assuming that we would in fact rather not scan the comments table so much (do we really care? those scans are going to be empty most of the time, aren't they?), what we could do is make the GetByIDs
implementation smarter based on the expectedType
argument: if we're expecting a database descriptor then there's no point in scanning for non-database comments, an so forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to do something similar for zone configs as well.
// Also ensure the db desc itself is included, which ensures we can | ||
// fetch its comment if any below. | ||
// FIXME: this does not seem to work. Why? | ||
ret.UpsertDescriptor(db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the call to FilterByIDs
in the final return
removes the database descriptor from the result set. If you want to include the database descriptor in the GetAllInDatabase
result set and thereby change its contract, then you should also include its zone config, etc. which IIRC will require an extra point lookup. For consistency's sake, you'd then have to do the same in GetAllObjectsInSchema
. I'd considered it when I wrote these methods but eventually ruled against it for some reason but I don't have a fundamental objection to including the root of the hierarchy, not at all.
// Also include all the comments for this object. | ||
// FIXME: This does not seem to be the right place to call this -- | ||
// it should load comments into `stored` _before_ the call | ||
// to aggregateLayers(), so that shadowed comments don't get overwritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And zone configs. I think your instinct is correct but I'd go even further: aggregateAllLayers
needs to do for comments and zone configs what it already does for descriptors and namespace records. It already does read comments and the zone configs, in fact: it calls getDescriptorsByID
which in turn calls GetByIDs
for descriptors which weren't found in the uncommitted layer and others. The results of GetByIDs
are cached so there'd be no harm in calling it another time towards the end of aggregateAllLayers
once the descIDs
set is built, and using the results to populate ret
correctly.
104763: sql: rationalize the handling of comments and improve performance r=rafiss a=rafiss rebase of #95431 fixes #102851 Prior to this patch, we had a weird translation going in `kv_catalog_comments`, which was causing the tree of objects to be traversed 6 times (one per comment type), and then the rows to be mapped one by one again in `pg_description` / `pg_shdescription`. This made no sense - it should be possible to push a predicate on `pg_description` down to a virtual index on `kv_catalog_comments`, and this requires the column types to align and no computation to happen in-between. This commit simplifies all this by having `kv_catalog_comments` responsible for preparing the data in the format expected by the `pg_` tables. The number of roundtrips needed for an ORM query that uses pg_description dropped to 4, from 134. ``` goos: linux goarch: amd64 cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │ │ sec/op │ sec/op vs base │ ORMQueries/introspection_description_join-24 10687.1m ± 1% 144.1m ± 8% -98.65% (p=0.000 n=10) │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │ │ roundtrips │ roundtrips vs base │ ORMQueries/introspection_description_join-24 134.000 ± 0% 4.000 ± 0% -97.01% (p=0.000 n=10) │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │ │ B/op │ B/op vs base │ ORMQueries/introspection_description_join-24 3846.69Mi ± 0% 38.98Mi ± 6% -98.99% (p=0.000 n=10) │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │ │ allocs/op │ allocs/op vs base │ ORMQueries/introspection_description_join-24 45566.9k ± 0% 433.3k ± 6% -99.05% (p=0.000 n=10) ``` Release note: None 105069: sql: disable strict gc ttl enforcement in schema changer test r=rharding6373 a=rharding6373 Before this change, the TestRaceWithBackfill test would add a zone config to force garbage collection to happen frequently, however this could cause a race between batch timestamps and garbage collection thresholds. This change adds a call to `DisableGCTTLStrictEnforcement`, which is supposed to be called before updating the zone config with `AddImmediateGCZoneConfig` for testing purposes. Epic: None Fixes: #104655 Release note: None 105156: roachtest: fix tpchvec in some cases r=yuzefovich a=yuzefovich We recently introduced a minor bug in the tpchvec in c5c7ead where could use uninitialized helper to parse the test output, which can lead to a nil pointer when the slowness threshold is exceeded, and this is now fixed. Epic: None Release note: None Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Found while working on #88061 .
Prior to this patch, we had a weird translation going in
kv_catalog_comments
, which was causing the tree of objects to be traversed 6 times (one per comment type), and then the rows to be mapped one by one again inpg_description
/pg_shdescription
.This made no sense - it should be possible to push a predicate on
pg_description
down to a virtual index onkv_catalog_comments
, and this requires the column types to align and no computation to happen in-between.This commit simplifies all this by having
kv_catalog_comments
responsible for preparing the data in the format expected by thepg_
tables.Release note: None
Epic: None