-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate tsdb's routing_path #79384
Merged
nik9000
merged 7 commits into
elastic:master
from
nik9000:index_routing_from_source_validate_dims
Oct 19, 2021
Merged
Validate tsdb's routing_path #79384
nik9000
merged 7 commits into
elastic:master
from
nik9000:index_routing_from_source_validate_dims
Oct 19, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
elasticmachine
added
the
Team:Analytics
Meta label for analytical engine team (ESQL/Aggs/Geo)
label
Oct 18, 2021
Pinging @elastic/es-analytics-geo (Team:Analytics) |
`routing_path`'s job is to put the same time series on the same node. It does that by extracting a String directly from the xcontent, hashing it, and feeding the hash into the shard selection algorithm. I'm happy with it! But it won't work properly if `routing_path` matches non-dimension fields or if it matches non-keyword dimensions. This prevents us from mapping any fields that aren't keyword dimensions that "line up" with `routing_path`. Let's talk about why `routing_path` can't do it's job if it matches any non-dimension fields. Well, imagine `routing_path` is `[dim, foo]` and only `dim` is a dimension. It'll *still* hash `foo`'s values into the routing key. So, say you get documents like: ``` {"dim": "a", "foo": "1"} {"dim": "a", "foo": "1"} {"dim": "a", "foo": "2"} ``` The third document could be routed to a different shard than the first two! Which would be a disaster because it'd cut the time series identified by `"dim":"a"` into pieces! Now let's talk about when `routing_path` matches a non-keyword dimension. Imagine `routing_path` is `[kwd, int]` and we send these documents: ``` {"kwd": "a", "int": "1"} {"kwd": "a", "int": "01"} ``` Both of these documents belong to the time series identified by `"dim":"a","int":1` but the `routing_path` code just reads strings so it'll route them into separate shards. Also bad! Also forbidden by this change.
imotov
approved these changes
Oct 18, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, added one suggestion.
server/src/main/java/org/elasticsearch/index/mapper/Mapper.java
Outdated
Show resolved
Hide resolved
This slightly moves the routing calculation out of a difficult to reason about `switch` statement and into real OO method implementation. Its a tiny tiny change but it makes me feel much better about it.
This reverts commit 9a040d4.
weizijun
added a commit
to weizijun/elasticsearch
that referenced
this pull request
Oct 19, 2021
* upstream/master: Validate tsdb's routing_path (elastic#79384) Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436) API for adding and removing indices from a data stream (elastic#79279) Exposing the ability to log deprecated settings at non-critical level (elastic#79107) Convert operator privilege license object to LicensedFeature (elastic#79407) Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456) Create cache files with CREATE_NEW & SPARSE options (elastic#79371) Revert "[ML] Use a new annotations index for future annotations (elastic#79151)" [ML] Use a new annotations index for future annotations (elastic#79151) [ML] Removing legacy code from ML/transform auditor (elastic#79434) Fix rate agg with custom `_doc_count` (elastic#79346) Optimize SLM Policy Queries (elastic#79341) Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841) Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227) Do not release snapshot file download permit during recovery retries (elastic#79409) Preserve request headers in a mixed version cluster (elastic#79412) Adjust versions after elastic#79044 backport to 7.x (elastic#79424) Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429) Fail on SSPL licensed x-pack sources (elastic#79348) # Conflicts: # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>non-issue
:StorageEngine/TSDB
You know, for Metrics
Team:Analytics
Meta label for analytical engine team (ESQL/Aggs/Geo)
v8.0.0-beta1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
routing_path
's job is to put the same time series on the same node. Itdoes that by extracting a String directly from the xcontent, hashing it,
and feeding the hash into the shard selection algorithm. I'm happy with
it! But it won't work properly if
routing_path
matches non-dimensionfields or if it matches non-keyword dimensions. This prevents us from
mapping any fields that aren't keyword dimensions that "line up" with
routing_path
.Let's talk about why
routing_path
can't do it's job if it matches anynon-dimension fields. Well, imagine
routing_path
is[dim, foo]
andonly
dim
is a dimension. It'll still hashfoo
's values into therouting key. So, say you get documents like:
The third document could be routed to a different shard than the first
two! Which would be a disaster because it'd cut the time series
identified by
"dim":"a"
into pieces!Now let's talk about when
routing_path
matches a non-keyworddimension. Imagine
routing_path
is[kwd, int]
and we send thesedocuments:
Both of these documents belong to the time series identified by
"dim":"a","int":1
but therouting_path
code just reads strings soit'll route them into separate shards. Also bad! Also forbidden by this
change.