-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exists query #3710
Exists query #3710
Conversation
6cc9612
to
04cc9d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting approach. Considering that we are dealing with a small number of fields the probability of collisions is really tiny here. So, that should probably work. The failure will be confusing as hell though :)
The implementation looks good to me, but I was a bit confused by naming. Having "field_exists": true,
on the mapping level next to field_mappings
through me off for a bit, even though I knew what this PR is about. Based on the code it also looks like we are trying to mimic the behavior of elasticsearch which checks for existence of the field value not existence of a field. But unlike elasticsearch we are adding a value catalog field instead of checking the existing index. So it is not free and therefore we need this switch.
I think this parameter is somewhat similar to the store_source
parameter. So to be consistent we can replace field_exists
with index_value_presence
, store_value_catalog
, add_value_presence_field
or something like this.
LGTM assuming field_exists
is renamed into something that starts with verb and has a word value
in it. It would be also good to add some documentation.
output: &mut FnvHashSet<u64>, | ||
) { | ||
match json_value { | ||
JsonValue::Null => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's not about field presence, but rather about value presence, which is probably the right thing to do, we just need to adjust naming a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the name is really garbage. It was my queen's duck. https://bwiggs.com/notebook/queens-duck/
More seriously I am unhappy about all names.
index_value_presence
might be my favorite in your list.
Francois also offered index_field_presence
.
I am not sure the semantic battle between field vs value confusion is that important here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally dealt with several users trying to use exist
query to differentiate between documents where field is not present and where field is present but contains null
and complaining that this query doesn't work even though the docs explicitly said "documents where a specific field has a value". So, I would go a bit father and also rename the query as well into has_value
or not_null
or something like this.
@@ -40,10 +44,11 @@ use crate::query_builder::build_query; | |||
use crate::routing_expression::RoutingExpr; | |||
use crate::{ | |||
Cardinality, DocMapper, DocParsingError, Mode, QueryParserError, TokenizerEntry, WarmupInfo, | |||
DYNAMIC_FIELD_NAME, SOURCE_FIELD_NAME, | |||
DYNAMIC_FIELD_NAME, FIELD_PRESENCE_FIELD_NAME, SOURCE_FIELD_NAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VALUE_PRESENCE_FIELD_NAME
?
@@ -378,6 +391,58 @@ fn extract_single_obj( | |||
} | |||
} | |||
|
|||
#[inline] | |||
fn populate_field_present_val( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pupulate_value_present_field_val
?
4bd81f1
to
8a59e05
Compare
Field exists does not consider types, only field names. Field capability will have to be handled differently unfortunately. This works by introducing an internal (but normal) "u64" field that stores postings list for field existence. For performance/RAM reasons, the fields full path is not stored as a string but instead we compute a u64-fnv hash using the path from root to leaf. If the hash perfects ideally, even with the anniversary attach, collisions are very unlikely. When dealing with complex JSON with the raw tokenizer this feature can double the number of tokens we deal with, and has an impact on performance. For this reason, it is not added as an option in the DocMapper. Like Elasticsearch, we only store field existence of indexed fields. Also in order to handle refinement like expand_dots, we work over the built tantivy Document and reuse the existing resolution logic. On 1.4GB of gharchive (which is close to a worst case scenaio), see the following performance/index size change: With field_exists enabled - Indexing Throughput: 41 MB/s - Index size: 701M With field_exists disabled - Indexing Throughput: 46 MB/s - Index size: 698M
8a59e05
to
e4d2c62
Compare
Field exists does not consider types, only field names.
Field capability will have to be handled differently unfortunately.
This works by introducing an internal (but normal) "u64" field
that stores postings list for field existence.
For performance/RAM reasons, the fields full path is not stored
as a string but instead we compute a u64-fnv hash using the
path from root to leaf.
If the hash perfects ideally, even with the anniversary attach, collisions
are very unlikely.
When dealing with complex JSON with the raw tokenizer this feature can
double the number of tokens we deal with, and has an impact on
performance.
For this reason, it is not added as an option in the DocMapper.
Like Elasticsearch, we only store field existence of indexed fields.
Also in order to handle refinement like expand_dots,
we work over the built tantivy Document and reuse the existing
resolution logic.
On 1.4GB of gharchive (which is close to a worst case scenaio),
see the following performance/index size change:
With field_exists enabled
With field_exists disabled