Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: rationalize the handling of comments #95431

Closed
wants to merge 1 commit into from

Conversation

knz
Copy link
Contributor

@knz knz commented Jan 18, 2023

Found while working on #88061 .

Prior to this patch, we had a weird translation going in kv_catalog_comments, which was causing the tree of objects to be traversed 6 times (one per comment type), and then the rows to be mapped one by one again in pg_description / pg_shdescription.

This made no sense - it should be possible to push a predicate on pg_description down to a virtual index on kv_catalog_comments, and this requires the column types to align and no computation to happen in-between.

This commit simplifies all this by having kv_catalog_comments responsible for preparing the data in the format expected by the pg_ tables.

Release note: None
Epic: None

@knz knz requested a review from postamar January 18, 2023 14:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jan 18, 2023

@postamar could you tell me what you think about this before i spend more time finalizing it?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice.

schema: `
CREATE VIEW pg_catalog.pg_shdescription AS SELECT objoid, classoid, description
FROM "".crdb_internal.kv_catalog_comments
WHERE classoid = ` + strconv.Itoa(catconstants.PgCatalogDatabaseTableID) + `:::oid`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to make this even more efficient you could add a virtual partial index

`INDEX(classoid) WHERE classoid = `  + strconv.Itoa(catconstants.PgCatalogDatabaseTableID) + `:::oid`

and in that one call GetAllDatabases instead of GetAll. It'll do a namespace join, so an extra round-trip, but it might read a heck of a lot less data and produce a heck of a lot fewer rows to filter in some cases.

Prior to this patch, we had a weird translation going in
`kv_catalog_comments`, which was causing the tree of objects to be
traversed 6 times (one per comment type), and then the rows to be
mapped one by one again in `pg_description` / `pg_shdescription`.

This made no sense - it should be possible to push a predicate on
`pg_description` down to a virtual index on `kv_catalog_comments`, and
this requires the column types to align and no computation to happen
in-between.

This commit simplifies all this by having `kv_catalog_comments`
responsible for preparing the data in the format expected by the `pg_`
tables.

Release note: None
@knz knz force-pushed the 20230118-comments branch from 1c78d85 to fec1227 Compare January 19, 2023 11:45
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I am making progress, but I need some help.

My calls for help are in inline comments below. Before answering them, i'd also welcome a skim review of my changes to catkv.

I appreciate that the work here is a bit overkill. But this PR is more an attempt to educate myself about the catalog APIs than to make a true perf optimization. (although I'd be happy if I can also achieve the latter.) So I'd be grateful for the education.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @postamar)


pkg/sql/crdb_internal.go line 5180 at r2 (raw file):

				// FIXME: when dbContext is not nil, we want a catalog with
				// _only_ that db desc and is comment, if any. How to build
				// it?

I need help here. I want to handle the following cases well:

  • "".crdb_internal.kv_catalog_comments where pgclassoid = <pg_database> -> all databases, only database comments
  • dbprefix.crdb_internal.kv_catalog_comments where pgclassoid = <pg_database> -> just this 1 database, only this db's comment

pkg/sql/crdb_internal.go line 5195 at r2 (raw file):

		var err error
		if dbContext != nil {
			all, err = p.Descriptors().GetAllInDatabase(ctx, p.txn, dbContext)

Here I need all comments inside this DB to be included, but not all comments across all dbs.


pkg/sql/pg_catalog.go line 1639 at r1 (raw file):

Previously, ajwerner wrote…

if you want to make this even more efficient you could add a virtual partial index

`INDEX(classoid) WHERE classoid = `  + strconv.Itoa(catconstants.PgCatalogDatabaseTableID) + `:::oid`

and in that one call GetAllDatabases instead of GetAll. It'll do a namespace join, so an extra round-trip, but it might read a heck of a lot less data and produce a heck of a lot fewer rows to filter in some cases.

Done.


pkg/sql/catalog/descs/collection.go line 743 at r2 (raw file):

	// FIXME: here probably I want to call LoadComments() with just
	// DatabaseCommentType. But I only want to do so when the caller is
	// interested in comments. Not all are.

I need help here: I want comments of type DatabaseCommentType to be loaded here, but only if the caller cares. Should I add a flag to GetAllDatabases() for that?


pkg/sql/catalog/descs/collection.go line 854 at r2 (raw file):

	// Also ensure the db desc itself is included, which ensures we can
	// fetch its comment if any below.
	// FIXME: this does not seem to work. Why?

I also need help here.
I'm not 100% I want/need this.

What I'm trying to cover here is the case of
mydb.crdb_internal.kv_catalog_comments

in which case I believe I want a row for the database comment on mydb if there is one.

If I do need this, then I need to modify this function to be more like GetAllDescriptorsForDatabase (which is deprecated, but does include the dbDesc in the resulting catalog).

But I could also decide that I don't want this, i.e. I redefine kv_catalog_comments to contain only non-database comments. In that case I need to re-define pg_shdescription to use another vtable than kv_catalog_comments.
What do you think is best?


pkg/sql/catalog/descs/collection.go line 873 at r2 (raw file):

		// FIXME: This does not seem to be the right place to call this --
		// it should load comments into `stored` _before_ the call
		// to aggregateLayers(), so that shadowed comments don't get overwritten.

I also need help here. In this function it is too late to call LoadComments; I need this to happen before aggregateLayers.
But stored as returned by ScanNamespaceForDatabaseSchemasAndObjects is a nstree.Catalog, not MutableCatalog. It seems inefficient to re-build a mutable catalog just to add the comments.
What would be your guidance? Add some "include comments" flag to ScanNamespaceForDatabaseSchemasAndObjects? A new method on the CatalogReader?

Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the catalog API's shortcomings: I think what we really want is for aggregateAllLayers to behave correctly, correctness being defined thus: presently it behaves correctly when stored is the product of a ScanAll or a GetByIDs, but it does not for any other CatalogReader method.

If you don't want to do this (quite understandable!) I'll be happy to, please file a stub of an issue and assign it to me.

populate: func(ctx context.Context, _ tree.Datum, p *planner, dbContext catalog.DatabaseDescriptor, addRow func(...tree.Datum) error) (matched bool, err error) {
// FIXME: when dbContext is not nil, we want a catalog with
// _only_ that db desc and is comment, if any. How to build
// it?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The catalog API currently doesn't have any way to lookup all the things for one ID and return an nstree.Catalog, but there's definitely room for that, and I'd welcome that addition. There are several alternatives, in terms of interface design:

  1. add methods to the getters in getters.go: this makes it possible to choose between leased or unleased descriptors, etc.
  2. add a method directly to the collection in collection.go which leverage aggregateAllLayers: this would ensure that the results are consistent with those in GetAll or GetAllInDatabase etc.
  3. do both of the above.

I think for virtual tables we definitely need (2) because we want to be extra-sure that the results don't vary depending on whether a virtual index was used or not. There's no very wrong choice here because deeper down it's all using the same machinery.

I haven't tried whether this works but I wouldn't be surprised if it did (not sure about edge cases though):

func (tc *Collection) GetAllForIDs(ctx context.Context, txn *kv.Txn, ids ...descpb.ID) (nstree.Catalog, error) {
	const isDescriptorRequired = false
	stored, err := tc.cr.GetByIDs(ctx, txn, ids, isDescriptorRequired, catalog.Any)
	if err != nil {
		return nstree.Catalog{}, err
	}
	ret, err := tc.aggregateAllLayers(ctx, txn, stored)
	if err != nil {
		return nstree.Catalog{}, err
	}
	return ret.FilterByIDs(ids), nil
}

return nstree.Catalog{}, err
}
return mc.Catalog, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serious question: what's wrong with using GetByIDs instead? AFAIK it shouldn't matter much that we're also reading system.descriptor, we're almost certainly going to be doing that anyway at some point in the transaction. I'm keen to keep this CatalogReader interface tight.


// FIXME: here probably I want to call LoadComments() with just
// DatabaseCommentType. But I only want to do so when the caller is
// interested in comments. Not all are.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that it's fine to replace LoadComments in favour of GetByIDs, and assuming that we would in fact rather not scan the comments table so much (do we really care? those scans are going to be empty most of the time, aren't they?), what we could do is make the GetByIDs implementation smarter based on the expectedType argument: if we're expecting a database descriptor then there's no point in scanning for non-database comments, an so forth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to do something similar for zone configs as well.

// Also ensure the db desc itself is included, which ensures we can
// fetch its comment if any below.
// FIXME: this does not seem to work. Why?
ret.UpsertDescriptor(db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call to FilterByIDs in the final return removes the database descriptor from the result set. If you want to include the database descriptor in the GetAllInDatabase result set and thereby change its contract, then you should also include its zone config, etc. which IIRC will require an extra point lookup. For consistency's sake, you'd then have to do the same in GetAllObjectsInSchema. I'd considered it when I wrote these methods but eventually ruled against it for some reason but I don't have a fundamental objection to including the root of the hierarchy, not at all.

// Also include all the comments for this object.
// FIXME: This does not seem to be the right place to call this --
// it should load comments into `stored` _before_ the call
// to aggregateLayers(), so that shadowed comments don't get overwritten.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And zone configs. I think your instinct is correct but I'd go even further: aggregateAllLayers needs to do for comments and zone configs what it already does for descriptors and namespace records. It already does read comments and the zone configs, in fact: it calls getDescriptorsByID which in turn calls GetByIDs for descriptors which weren't found in the uncommitted layer and others. The results of GetByIDs are cached so there'd be no harm in calling it another time towards the end of aggregateAllLayers once the descIDs set is built, and using the results to populate ret correctly.

@rafiss
Copy link
Collaborator

rafiss commented Jun 13, 2023

replaced by #104763

@knz thanks for this idea! it was a critical (99%) latency improvement for a slow query reported in #102851

@rafiss rafiss closed this Jun 13, 2023
craig bot pushed a commit that referenced this pull request Jun 20, 2023
104763: sql: rationalize the handling of comments and improve performance r=rafiss a=rafiss

rebase of #95431
fixes #102851

Prior to this patch, we had a weird translation going in
`kv_catalog_comments`, which was causing the tree of objects to be
traversed 6 times (one per comment type), and then the rows to be
mapped one by one again in `pg_description` / `pg_shdescription`.

This made no sense - it should be possible to push a predicate on
`pg_description` down to a virtual index on `kv_catalog_comments`, and
this requires the column types to align and no computation to happen
in-between.

This commit simplifies all this by having `kv_catalog_comments`
responsible for preparing the data in the format expected by the `pg_`
tables.

The number of roundtrips needed for an ORM query that uses
pg_description dropped to 4, from 134.

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
                                             │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │
                                             │                sec/op                 │                   sec/op                    vs base                │
ORMQueries/introspection_description_join-24                           10687.1m ± 1%                                  144.1m ± 8%  -98.65% (p=0.000 n=10)

                                             │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │
                                             │              roundtrips               │                 roundtrips                  vs base                │
ORMQueries/introspection_description_join-24                            134.000 ± 0%                                   4.000 ± 0%  -97.01% (p=0.000 n=10)

                                             │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │
                                             │                 B/op                  │                    B/op                     vs base                │
ORMQueries/introspection_description_join-24                          3846.69Mi ± 0%                                 38.98Mi ± 6%  -98.99% (p=0.000 n=10)

                                             │ /tmp/tmp.0yUMg6OauX/bench.6abe1851d83 │ /tmp/tmp.0yUMg6OauX/bench.89a494eb3010b945ae5a2be7a057e50802b231c0 │
                                             │               allocs/op               │                 allocs/op                   vs base                │
ORMQueries/introspection_description_join-24                           45566.9k ± 0%                                  433.3k ± 6%  -99.05% (p=0.000 n=10)
```

Release note: None

105069: sql: disable strict gc ttl enforcement in schema changer test r=rharding6373 a=rharding6373

Before this change, the TestRaceWithBackfill test would add a zone config to force garbage collection to happen frequently, however this could cause a race between batch timestamps and garbage collection thresholds. This change adds a call to `DisableGCTTLStrictEnforcement`, which is supposed to be called before updating the zone config with `AddImmediateGCZoneConfig` for testing purposes.

Epic: None
Fixes: #104655

Release note: None

105156: roachtest: fix tpchvec in some cases r=yuzefovich a=yuzefovich

We recently introduced a minor bug in the tpchvec in c5c7ead where could use uninitialized helper to parse the test output, which can lead to a nil pointer when the slowness threshold is exceeded, and this is now fixed.

Epic: None

Release note: None

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants