Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix JSON fetch value operator evaluation #55316

Merged
merged 2 commits into from
Oct 9, 2020

Conversation

mgartner
Copy link
Collaborator

@mgartner mgartner commented Oct 8, 2020

sql: fix JSON fetch value operator evaluation

This commit fixes incorrect evaluation of the JSON fetch value operator,
->. There were two bugs causing incorrect evaluation.

The first issue was a result of optimizer normalization rules that
produced unequivalent scalar expressions when converting -> to @>.
The rules would convert expressions like j -> '"a"' = 1 to
j @> '{"a": 1}'. This is invalid because -> results in NULL when
the LHS does not contain the RHS key, but the resulting @> expression
is always either true or false, never NULL. These normalization
rules have been removed.

These two rules existed to provide inverted index-acceleration for
queries with ->, because the optimizer can only index-accerlate @>
operators during exploration. As a result of their removal, queries with
-> operators are no longer index-accelerated. This will be remedied in
a future commit.

The second issue was a bug in the vectorized overload of the ->
operator. Previously, when the operator evaluated to NULL with two
non-NULL inputs, the resulting NULL would not be tracked by the
Nulls struct.

Fixes #49143

Release note (bug fix): Previously, the JSON fetch value operator, ->,
would evaluate incorrectly in some cases. This has been fixed.

opt: index-accelerate equalities with JSON fetch expressions

This commit allows inverted indexes to be scanned to satisfy query
filters in the form: j->'a' = '1'. The optimizer had previously
supported this by normalizing these expressions into expressions with
JSON containment operators, @>, but it lost this support when these
normalization rules were found to produce inequivalent expressions.

Note that query filters of several forms, which were previously
accelerated via the incorrect normalization rules, are not accelerated
as part of this commit, such as:

j->'a' @> '{"x": "y"}
j->'a'->'b' = '"c"'

Release note: None

@mgartner mgartner requested a review from a team as a code owner October 8, 2020 02:31
@cockroach-teamcity
Copy link
Member

This change is Reviewable

│ table d@primary · ·
│ key columns a · ·
└── scan · · (a) ·
· estimated row count 110 (missing stats) · ·
· table d@foo_inv · ·
· spans /"a"/"b"-/"a"/"b"/PrefixEnd · ·

# TODO(mgartner): Add support for building inverted index constraints for chained JSON
# fetch operators.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a ticket to track here: #55317

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also created a ticket to track the j->'a' @> '{"x": "y"} case: #55318

@@ -273,33 +273,37 @@ filter · · (a, b) ·
· spans FULL SCAN · ·


# TODO(mgartner): It should not be required to force the index scan. It is
# required until the statistics builder treats b->'a' = '"b"' similarly to the
# containment operator, @>.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a ticket to track here: #55319

@mgartner
Copy link
Collaborator Author

mgartner commented Oct 8, 2020

@yuzefovich please take a look at the vectorized engine changes in the first commit, and let me know if this is the right approach.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opt stuff :lgtm:

Reviewed 12 of 12 files at r1, 5 of 5 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @yuzefovich)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me, thanks. I just have a couple of nits.

:lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mgartner and @RaduBerinde)


pkg/sql/colexec/proj_const_ops_tmpl.go, line 101 at r1 (raw file):

	}
	projCol := projVec._RET_TYP()
	_outNulls := projVec.Nulls()

nit: this deserves a comment to spell out the "contract" (similar to what we have about _overloadHelper above).


pkg/sql/colexec/proj_const_ops_tmpl.go, line 140 at r1 (raw file):

		}
	}
	// {{if _HAS_NULLS}}

nit: initially I thought that there might be a bug that we're not updating nulls vector when the argument vector didn't have nulls (i.e. _HAS_NULLS == false), but in that case _outNulls has already been updated by _ASSIGN function if need be. I think it's worth adding a comment to mention this observation.


pkg/sql/colexec/proj_const_ops_tmpl.go, line 141 at r1 (raw file):

	}
	// {{if _HAS_NULLS}}
	projVec.SetNulls(_outNulls.Or(colNulls))

I think here we want to be more efficient in non-datum cases - coldata.Nulls.Copy() is faster than coldata.Nulls.Or(). I'd do something like this:

// {{if eq .VecMethod "Datum"}}
projVec.SetNulls(_outNulls.Or(colNulls))
// {{else}}
colNullsCopy := colNulls.Copy()
projVec.SetNulls(&colNullsCopy)
// {{end}}

This will also make _outNulls variable unused in many code paths, so you'll need to go around the unused warning.

Update: I tried it out, and the microbenchmarks don't show much difference, so the current approach looks good. (If you're curious, the diff is here.)


pkg/sql/colexec/proj_non_const_ops_tmpl.go, line 112 at r1 (raw file):

	col1 := vec1._L_TYP()
	col2 := vec2._R_TYP()
	_outNulls := projVec.Nulls()

nit: ditto for comments

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :lgtm:

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/idxconstraint/index_constraints.go, line 996 at r2 (raw file):

	}

	rightConst := rhs.(*memo.ConstExpr)

, ok missing. Also, I think we have an extractconstval somewhere since other operators can be constant (e.g. True, False, Null)

@mgartner mgartner force-pushed the containment-norm-fix branch from a5c02fe to 31a5d89 Compare October 8, 2020 20:36
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @RaduBerinde, @rytaft, and @yuzefovich)


pkg/sql/colexec/proj_const_ops_tmpl.go, line 101 at r1 (raw file):

Previously, yuzefovich wrote…

nit: this deserves a comment to spell out the "contract" (similar to what we have about _overloadHelper above).

Done.


pkg/sql/colexec/proj_const_ops_tmpl.go, line 140 at r1 (raw file):

Previously, yuzefovich wrote…

nit: initially I thought that there might be a bug that we're not updating nulls vector when the argument vector didn't have nulls (i.e. _HAS_NULLS == false), but in that case _outNulls has already been updated by _ASSIGN function if need be. I think it's worth adding a comment to mention this observation.

Good idea, done.


pkg/sql/colexec/proj_const_ops_tmpl.go, line 141 at r1 (raw file):

Previously, yuzefovich wrote…

I think here we want to be more efficient in non-datum cases - coldata.Nulls.Copy() is faster than coldata.Nulls.Or(). I'd do something like this:

// {{if eq .VecMethod "Datum"}}
projVec.SetNulls(_outNulls.Or(colNulls))
// {{else}}
colNullsCopy := colNulls.Copy()
projVec.SetNulls(&colNullsCopy)
// {{end}}

This will also make _outNulls variable unused in many code paths, so you'll need to go around the unused warning.

Update: I tried it out, and the microbenchmarks don't show much difference, so the current approach looks good. (If you're curious, the diff is here.)

I tried the same approach and also ran into _outNulls being unused, and gave up at that point.

But I did look at the implementation of Or, and if one of the Nulls has no nulls, then it performs a copy. So it makes sense that your microbenchmarks show little difference.


pkg/sql/colexec/proj_non_const_ops_tmpl.go, line 112 at r1 (raw file):

Previously, yuzefovich wrote…

nit: ditto for comments

Done.


pkg/sql/opt/idxconstraint/index_constraints.go, line 996 at r2 (raw file):

Previously, RaduBerinde wrote…

, ok missing. Also, I think we have an extractconstval somewhere since other operators can be constant (e.g. True, False, Null)

Great catch on the missing , ok! Fixed.

We only care about the case when the RHS is a constant DJSON, so neglecting True, False, and Null is not an issue. But I've updated to use memo.CanExtractConstDatum and memo.ExtractConstDatum anyway since it seems more robust.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You need to run make execgen on the first commit to regenerate the colexec code with comments.

Reviewed 3 of 7 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @RaduBerinde, @rytaft, and @yuzefovich)

@mgartner mgartner force-pushed the containment-norm-fix branch from 31a5d89 to 104e5e2 Compare October 9, 2020 15:11
Copy link
Collaborator Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to run make execgen on the first commit to regenerate the colexec code with comments.

Oops! Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @RaduBerinde, @rytaft, and @yuzefovich)

This commit fixes incorrect evaluation of the JSON fetch value operator,
`->`. There were two bugs causing incorrect evaluation.

The first issue was a result of optimizer normalization rules that
produced unequivalent scalar expressions when converting `->` to `@>`.
The rules would convert expressions like `j -> '"a"' = 1` to
`j @> '{"a": 1}'`. This is invalid because `->` results in `NULL` when
the LHS does not contain the RHS key, but the resulting `@>` expression
is always either `true` or `false`, never `NULL`. These normalization
rules have been removed.

These two rules existed to provide inverted index-acceleration for
queries with `->`, because the optimizer can only index-accerlate `@>`
operators during exploration. As a result of their removal, queries with
`->` operators are no longer index-accelerated. This will be remedied in
a future commit.

The second issue was a bug in the vectorized overload of the `->`
operator. Previously, when the operator evaluated to `NULL` with two
non-`NULL` inputs, the resulting `NULL` would not be tracked by the
`Nulls` struct.

Fixes cockroachdb#49143

Release note (bug fix): Previously, the JSON fetch value operator, `->`,
would evaluate incorrectly in some cases. This has been fixed.
This commit allows inverted indexes to be scanned to satisfy query
filters in the form: `j->'a' = '1'`. The optimizer had previously
supported this by normalizing these expressions into expressions with
JSON containment operators, `@>`, but it lost this support when these
normalization rules were found to produce inequivalent expressions.

Note that query filters of several forms, which were previously
accelerated via the incorrect normalization rules, are not accelerated
as part of this commit, such as:

    j->'a' @> '{"x": "y"}
    j->'a'->'b' = '"c"'

Release note: None
@mgartner mgartner force-pushed the containment-norm-fix branch from 104e5e2 to 30f8072 Compare October 9, 2020 16:52
@mgartner
Copy link
Collaborator Author

mgartner commented Oct 9, 2020

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 9, 2020

Build succeeded:

@craig craig bot merged commit 9bfd067 into cockroachdb:master Oct 9, 2020
@mgartner mgartner deleted the containment-norm-fix branch October 9, 2020 19:02
@rafiss
Copy link
Collaborator

rafiss commented Oct 10, 2020

@mgartner are you planning to backport this? (just wondering since it affects JSON compatibility with Django 3.1)

@RaduBerinde
Copy link
Member

I think we should backport for 20.2.1.

@mgartner
Copy link
Collaborator Author

Created a backport PR: #55447

mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 27, 2021
Prior to cockroachdb#55316, the optimizer generated inverted index scans on indexed
JSON columns when queries had filters with chained fetch value
operators, for example `j->'a'->'b' = '1'`. The logic that made this
possible was found to create query plans not equivalent to the query, so
it was removed. This commit restores the ability to index accelerate
chained -> operators.

Fixes cockroachdb#55317

Release note (performance improvement): A bug fix included in 20.2.1 for
for the JSON fetch value operator, `->`, resulted in chained `->`
operators in query filters not being index accelerated, e.g.,
`j->'a'->'b' = '1'`. Chained `->` are now index accelerated.
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 28, 2021
Prior to cockroachdb#55316, the optimizer generated inverted index scans on indexed
JSON columns when queries had filters with chained fetch value
operators, for example `j->'a'->'b' = '1'`. The logic that made this
possible was found to create query plans not equivalent to the query, so
it was removed. This commit restores the ability to index accelerate
chained -> operators.

Fixes cockroachdb#55317

Release note (performance improvement): A bug fix included in 20.2.1 for
for the JSON fetch value operator, `->`, resulted in chained `->`
operators in query filters not being index accelerated, e.g.,
`j->'a'->'b' = '1'`. Chained `->` are now index accelerated.
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 29, 2021
Prior to cockroachdb#55316, the optimizer generated inverted index scans on indexed
JSON columns when queries had filters with chained fetch value
operators, for example `j->'a'->'b' = '1'`. The logic that made this
possible was found to create query plans not equivalent to the query, so
it was removed. This commit restores the ability to index accelerate
chained -> operators.

Fixes cockroachdb#55317

Release note (performance improvement): A bug fix included in 20.2.1 for
for the JSON fetch value operator, `->`, resulted in chained `->`
operators in query filters not being index accelerated, e.g.,
`j->'a'->'b' = '1'`. Chained `->` are now index accelerated.
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 29, 2021
Prior to cockroachdb#55316, the optimizer generated inverted index scans on indexed
JSON columns when queries had filters with chained fetch value
operators, for example `j->'a'->'b' = '1'`. The logic that made this
possible was found to create query plans not equivalent to the query, so
it was removed. This commit restores the ability to index accelerate
chained -> operators.

Fixes cockroachdb#55317

Release note (performance improvement): A bug fix included in 20.2.1 for
for the JSON fetch value operator, `->`, resulted in chained `->`
operators in query filters not being index accelerated, e.g.,
`j->'a'->'b' = '1'`. Chained `->` are now index accelerated.
mgartner added a commit to mgartner/cockroach that referenced this pull request Jan 29, 2021
Prior to cockroachdb#55316, the optimizer generated inverted index scans on indexed
JSON columns when queries had filters with chained fetch value
operators, for example `j->'a'->'b' = '1'`. The logic that made this
possible was found to create query plans not equivalent to the query, so
it was removed. This commit restores the ability to index accelerate
chained -> operators.

Fixes cockroachdb#55317

Release note (performance improvement): A bug fix included in 20.2.1 for
for the JSON fetch value operator, `->`, resulted in chained `->`
operators in query filters not being index accelerated, e.g.,
`j->'a'->'b' = '1'`. Chained `->` are now index accelerated.
craig bot pushed a commit that referenced this pull request Jan 29, 2021
59494: opt: index accelerate chained fetch value operators r=rytaft a=mgartner

#### opt: add test for JSON fetch val inverse

Release note: None

#### opt: index accelerate chained fetch value operators

Prior to #55316, the optimizer generated inverted index scans on indexed
JSON columns when queries had filters with chained fetch value
operators, for example `j->'a'->'b' = '1'`. The logic that made this
possible was found to create query plans not equivalent to the query, so
it was removed. This commit restores the ability to index accelerate
chained -> operators.

Fixes #55317

Release note (performance improvement): A bug fix included in 20.2.1 for
for the JSON fetch value operator, `->`, resulted in chained `->`
operators in query filters not being index accelerated, e.g.,
`j->'a'->'b' = '1'`. Chained `->` are now index accelerated.


Co-authored-by: Marcus Gartner <[email protected]>
mgartner added a commit to mgartner/cockroach that referenced this pull request Feb 3, 2022
This commit removes an invalid normalization from the NormalizeVisitor.
It was previously discovered that transforming expressions in the form
`j->'a' = '1'` to `j @> '{"a": 1}'` is invalid (see cockroachdb#49143). This
transformation rule was removed from the optimizer in cockroachdb#55316. But the
same transformation was not removed from the NormalizeVisitor. This
visitor is only used to normalize scalar expressions in table
descriptors (`DEFAULT` expressions, computed column expressions, and
partial index predicates) during a backfill.

Fixes cockroachdb#75097

Release note (bug fix): A bug has been fixed that caused incorrect
values to be written to computed columns when their expressions were of
the form `j->x = y`, where `j` is a `JSON` column and `x` and `y` are
constants. This bug also caused corruption of partial indexes with
`WHERE` clauses containing expressions of the same form. This bug was
present since version 2.0.0.
craig bot pushed a commit that referenced this pull request Feb 3, 2022
75908: scripts: per-branch bump-pebble.sh script r=jbowens a=nicktrav

Currently, the master branch, in addition to each release branch relies
on the same `bump-pebbble.sh` from the master branch. There are subtle
differences between master and the release branches (i.e. build system)
that ends up breaking the script as the changes are introduced on the
master branch but not backported to the release branches.

One solution to this problem is to continue to maintain one script on
the master branch, but include switching logic for each release branch
to account for the differences.

An alternative approach is to have a script per release branch. Rather
than having switching logic, the script has the appropriate logic for
that branch. When a new release branch is cut, the script inherits the
most up-to-date logic from master, and all that needs to change is the
name of the branch and the corresponding Pebble branch.

Pin the `bump-pebble.sh` script to the master branch. The script will
error out if it is run from a different branch.

Release note: None

75914: tree: remove invalid normalization r=mgartner a=mgartner

This commit removes an invalid normalization from the NormalizeVisitor.
It was previously discovered that transforming expressions in the form
`j->'a' = '1'` to `j @> '{"a": 1}'` is invalid (see #49143). This
transformation rule was removed from the optimizer in #55316. But the
same transformation was not removed from the NormalizeVisitor. This
visitor is only used to normalize scalar expressions in table
descriptors (`DEFAULT` expressions, computed column expressions, and
partial index predicates) during a backfill.

Fixes #75907

Release note (bug fix): A bug has been fixed that caused incorrect
values to be written to computed columns when their expressions were of
the form `j->x = y`, where `j` is a `JSON` column and `x` and `y` are
constants. This bug also caused corruption of partial indexes with
`WHERE` clauses containing expressions of the same form. This bug was
present since version 2.0.0.

75963: batcheval: use same stats timestamp for `AddSSTable` assertions r=rhu713 a=erikgrinaker

Resolves #75643.
Resolves #75642.

Release note: None

Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: excluding json field with -> operator incorrectly returns values where the key doesn't exist
6 participants