Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mango json index selection #816

Merged
merged 1 commit into from
Sep 20, 2017
Merged

Fix mango json index selection #816

merged 1 commit into from
Sep 20, 2017

Conversation

willholley
Copy link
Member

Overview

JSON index selection in Mango previously deemed an index to be usable if a range scan on the first component of its compound key could be generated from the query selector.

For instance, if I had an index:

[A, B]

is_usable would return true for the selector:

{"A": "foo"}

This is incorrect because JSON indexes only index documents that contain all the fields in the index; null values are ok, but the field must be present. That means that for the above selector, the index would implicitly include only documents where B exists, missing documents where {"A":5} matched but field B was not present.

This commit changes is_usable so that it only returns true if all the keys in the index are required to exist by the selector. This means that in the worst case e.g. none of the predicates can be used to generate a range query, we should end up performing a full index scan, but this is still more efficient
than a full database scan.

We leave the generation of the optimal range for a given index as a separate exercise - currently this happens after index selection.

Potentially we'd want to score indexes during index selection based on their ability to restrict the result set, etc.

Testing recommendations

Run the mango test suite. Manually run some mango queries with/without JSON indexes and check that the results are as expected.

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least I think the naming is misleading on this. The fields in the index must exist in the selector, not the other way around which is how it reads.

Also I'm hesitant to modify indexable like this for this check rather than writing the similar yet subtly different check. This definitely isn't an obvious rule but it seems odd to overload the meaning of "indexable" with "covers index field". For instance, the $exists:false call out seems out of place.

Also the maybe return value for unknown operators I'm also not 100% on. It seems like having a new covered index thing which explicitly lists all operators that can operate on the index would be a better approach.

@willholley
Copy link
Member Author

@davisp thanks for the review! I'll have a think about the naming and overloaded function issues.

To your last point, my thought there is that all predicates except, by definition, "$exists":false, "operate" on the index because selection of the index restricts the result set to documents where the indexed fields exist.

@willholley
Copy link
Member Author

@davisp I went with your second suggestion and implemented a has_required_fields function

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work on this. Requested changes here are purely stylistic, once those are addressed then +1 to merging this.

% restricted by the selector such that they are required to exist
% and the selector is not a text search (so requires a text index)
RequiredFields = columns(Idx),
mango_selector:has_required_fields(Selector, RequiredFields) andalso not is_text_search(Selector).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line too long. Break before andalso and indent it twice.

@@ -566,3 +567,111 @@ match({[{Field, Cond}]}, Value, Cmp) ->

match({[_, _ | _] = _Props} = Sel, _Value, _Cmp) ->
erlang:error({unnormalized_selector, Sel}).



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One too many blank lines.

% until we match then all or run out of selector to
% match against.

% empty selector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize empty

has_required_fields(Selector, RequiredFields) when not is_list(Selector) ->
has_required_fields([Selector], RequiredFields);

% "see" through $and operator. We ignore other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence isn't a sentence. "We can" as a prefix would be good enough.

% "see" through $and operator. We ignore other
% combination operators because they can't be used to restrict
% an index.
has_required_fields([{[{<<"$and">>, Args}]}], RequiredFields) when is_list(Args) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line too long. Break before when and indent it twice.

end;

% no more required fields
has_required_fields(_, []) ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its general practice to place your recursion terminating conditions as the first function clauses.

has_required_fields(_, []) ->
true;

% no more selector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize No and move to before the recursive call.

%%%%%%%% module tests below %%%%%%%%

-ifdef(TEST).
-include_lib("eunit/include/eunit.hrl").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test cases!

@@ -14,7 +14,6 @@
import user_docs
import unittest


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include white space only changes in PRs that aren't stylistic changes.

resp_explain = self.db.find(selector, explain=True)
self.assertEqual(resp_explain["index"]["type"], "special")


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many blank lines between class methods here and in a few places below.

JSON index selection in Mango previously deemed an
index to be usable if a range scan on the first component
of its compound key could be generated from the query selector.

For instance, if I had an index:

[A, B]

is_usable would return true for the selector:

{"A": "foo"}

This is incorrect because JSON indexes only index documents that contain all
the fields in the index; null values are ok, but the field must be present.
That means that for the above selector, the index would implicitly include
only documents where B exists, missing documents where {"A":5} matched but
field B was not present.

This commit changes is_usable so that it only returns true if all the keys
in the index are required to exist by the selector. This means that in the
worst case e.g. none of the predicates can be used to generate a range query,
we should end up performing a full index scan, but this is still more
efficient than a full database scan.

We leave the generation of the optimal range for a given index as a separate
exercise - currently this happens after index selection.

Potentially we'd want to score indexes during index selection based on their
ability to restrict the result set, etc.
@willholley willholley merged commit dae81be into apache:master Sep 20, 2017
willholley added a commit that referenced this pull request Oct 12, 2017
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In #816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.
willholley added a commit that referenced this pull request Oct 17, 2017
* Split out text index selection tests
* Skip operator tests that do not apply to text indexes
* Only run array length tests against text indexes
* Fix index crud tests when text indexes available
* Use environment variable to switch on text tests
* Fix incorrect text sort assertion in test
* Always use -test in fixture filename
* Fix index selection test compatibility with #816.
* Improve test README
wohali pushed a commit that referenced this pull request Oct 19, 2017
JSON index selection in Mango previously deemed an
index to be usable if a range scan on the first component
of its compound key could be generated from the query selector.

For instance, if I had an index:

[A, B]

is_usable would return true for the selector:

{"A": "foo"}

This is incorrect because JSON indexes only index documents that contain all
the fields in the index; null values are ok, but the field must be present.
That means that for the above selector, the index would implicitly include
only documents where B exists, missing documents where {"A":5} matched but
field B was not present.

This commit changes is_usable so that it only returns true if all the keys
in the index are required to exist by the selector. This means that in the
worst case e.g. none of the predicates can be used to generate a range query,
we should end up performing a full index scan, but this is still more
efficient than a full database scan.

We leave the generation of the optimal range for a given index as a separate
exercise - currently this happens after index selection.

Potentially we'd want to score indexes during index selection based on their
ability to restrict the result set, etc.
wohali pushed a commit that referenced this pull request Oct 19, 2017
It seems safe to assume that if a user specifies that
results should be sorted by a field, that field needs to exist
(but could be null) in the results returned.

In #816 we can only use an index if all its columns are
required to exist in the selector, so this improves
compatibility with the old behaviour by allowing sort
fields to be included in the index coverage check for
JSON indexes.
wohali pushed a commit that referenced this pull request Oct 19, 2017
* Split out text index selection tests
* Skip operator tests that do not apply to text indexes
* Only run array length tests against text indexes
* Fix index crud tests when text indexes available
* Use environment variable to switch on text tests
* Fix incorrect text sort assertion in test
* Always use -test in fixture filename
* Fix index selection test compatibility with #816.
* Improve test README
iilyak added a commit to cloudant/couchdb that referenced this pull request Nov 6, 2017
This reverts commit dae81be.

Conflicts:
	src/mango/test/05-index-selection-test.py
iilyak added a commit to cloudant/couchdb that referenced this pull request Nov 13, 2017
This reverts commit dae81be.

Conflicts:
	src/mango/test/05-index-selection-test.py
nono added a commit to nono/couchdb that referenced this pull request Nov 20, 2017
Since apache#816, mango JSON index on compound fields can be selected only if
the selector make sure that all the fields listed in the index are
always present. This commit add a special case where all clauses of an
`$or` can ensure that a field is present.

For instance, if I had an index:

[A, B]

is_usable would now return true for the selector:

{
  "A": "foo",
  "$or": {
    "B": "bar",
    "B": "baz"
  }
}
willholley added a commit that referenced this pull request Nov 30, 2017
If a user specifies a value for use_index that is not
valid for the selector - i.e. it does not meet the coverage
requirements of the selector or sort fields - attempt
to fall back to a valid index (or database scan) rather
than returning a 400 error.

When a fallback occurs, populate the "warning" field
in the response (as we already do when a full database
scan takes place) with details of the fallback.

This change is partially as mitigation for #816, which may
lead to some previously valid indexes being deemed invalid,
and also to make use_index less brittle in general. If
an index that is used explicitly by active queries is removed,
Couch will now generate warnings and there may be a performance
impact, but the client will still get correct results.
willholley added a commit that referenced this pull request Dec 11, 2017
Since #816, mango JSON index on compound fields can be 
selected only if the selector make sure that all the fields 
listed in the index are always present. 

This adds a special case where all clauses of an `$or` can
ensure that a field is present.

For instance, if I had an index:

[A, B]

is_usable would now return true for the selector:

{
  "A": "foo",
  "$or": {
    "B": "bar",
    "B": "baz"
  }
}

but false for:

{
  "A": "foo",
  "$or": {
    "B": "bar",
    "C": "bar"
  }
}
@willholley willholley deleted the mango_json_index_selection branch December 20, 2017 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants