-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for 'size' in EQL Sample queries #87846
Add support for 'size' in EQL Sample queries #87846
Conversation
Pinging @elastic/es-ql (Team:QL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
@@ -15,21 +15,28 @@ | |||
|
|||
public class Sample extends AbstractJoin { | |||
|
|||
public Sample(Source source, List<KeyedFilter> queries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to override hashCode
and equals
to include limit
as well.
@@ -24,20 +25,22 @@ | |||
public class SampleExec extends PhysicalPlan { | |||
|
|||
private final List<List<Attribute>> keys; | |||
private final Limit limit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not an Integer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I did at first, but then I realized that a Limit could have a few advantages:
- it's consistent with the other EQL command implementations (eg. SequenceMatcher uses a Limit as well)
- it anticipates the need for pagination, that is pretty natural in most scenarios (see also here)
- it doesn't really add any significant complication or overhead
Anyway, it's not strictly necessary for the scope of this PR, so let's discuss it; if there is no agreement, I can easily reduce it to a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limit it is.
@@ -24,20 +25,22 @@ | |||
public class SampleExec extends PhysicalPlan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, again, hashCode and equals methods updated.
@@ -174,8 +178,16 @@ private void finalStep(ActionListener<Payload> listener) { | |||
if (docGroupsCounter == maxCriteria) { | |||
List<SearchHit> match = matchSample(sample, maxCriteria); | |||
if (match != null) { | |||
finalSamples.add(match); | |||
samples.add(new Sample(sampleKeys.get(responseIndex / maxCriteria), match)); | |||
if (samplesDiscarded < limit.offset()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHy do you use both offset and limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we have Limit (see previous comment), IMHO it makes sense to have a complete support for it.
The logic is trivial and the added value could be significant: with a small additional effort on syntax, we'll have at least a basic support for pagination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why not having Limit as a node in the logical plan tree (vs a property of Sample plan)? (like SQL and EQL do it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pagination won't work without state.
Without reading the comments first, I also found this code confusing and had to check if we provide a non-null offset anywhere. I guess this could be clarified with a comment, but not sure where such a comment should be placed (if we ever add an offset, a comment placed here will nearly certainly not be updated/removed).
IMO, since the logic is trivial, it could be added when actually needed. But I guess it could also be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the functionality is not there, the code should be removed. Let's not add code for functionality that is to be implemented sometime in the future or, worse, not at all. It will only confuse anyone reading the code.
- LimitWithOffset in the logical plan tree - proper equals()/hashCode()
@@ -102,7 +102,13 @@ public Object visitStatement(StatementContext ctx) { | |||
if (ctx.pipe().size() > 0) { | |||
throw new ParsingException(source(ctx.pipe().get(0)), "Samples do not support pipes yet"); | |||
} | |||
return plan; | |||
LimitWithOffset limitPlan = new LimitWithOffset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit pick
LimitWithOffset limitPlan = new LimitWithOffset( | |
return new LimitWithOffset( |
@@ -102,7 +102,13 @@ public Object visitStatement(StatementContext ctx) { | |||
if (ctx.pipe().size() > 0) { | |||
throw new ParsingException(source(ctx.pipe().get(0)), "Samples do not support pipes yet"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't support for pipes now also be possible since limit and offset are both supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the limit/size comes from a request parameter, not from the query itself (through the query parser).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it works with sequence
, doesn't it? why would sample
be different in this regard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With sequence
you have | head 5 | tail 3 | head 1
. How do you do this with sample
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can do sequence [any where true] [any where true] | head 10 | tail 2
why can't you do something like sample [any where true] [any where true] | head 10 | tail 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see HEAD and TAIL as very natural operators for sequences, that have well defined semantics defined by their order (eg. TAIL 10
: give me last ten - more recent - events).
Samples, on the other hand, do not have a natural order: "last ten samples" is not a clear concept by itself.
In this context, LIMIT/OFFSET is much more similar to the SQL equivalent, where you just need a way to retrieve the results in batches or to show them in a UI in pages.
I agree with @Luegg on the fact that we could support pipes here eventually, but we probably need some slightly different syntax, eg | LIMIT 10 OFFSET 20
; or reuse HEAD with | OFFSET 20 | HEAD 10
, though it seems less natural in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Luegg, as @luigidellaquila well put this, re-using HEAD
and TAIL
doesn't reflect the nature of sample
s which do not have a chronological ordering concept, like sequence
s do. The only type of ordering that make sense is the one on the join keys, but at that point the simple concepts of HEAD
and TAIL
are not enough to express the multitude of ordering options available (order by first join key asc, order by the second join key desc, order by the third join key asc and so on; or order by all keys asc, or all keys desc, but at that point one would ask the question why cannot I order by each key individually etc). At this point in the evolution of sample
s, I believe having a circuit breaker and a simple way of limiting (without computing all the samples) the number of results returned, is enough. With time, and more input from users, we can evaluate our options in providing something more advanced and more granular in terms of limiting the number of results and, at the same time, sorting the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it as reusing head
and tail
for sample
. It's more like making sample
work with the other language constructs as sample
is the only construct that can not be fed into pipes. In this perspective, head
can just fetch the first x results independent of how the order is specified. I agree though that tail might be a bit more work because it requires to reverse the sort order and that's probably better kept out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Luegg I agree with you, tail is definitely not straight forward at this stage.
Also for a proper implementation of pipes on samples (and limit/offset in particular), I think we still need to define a few small details that do not really fit in the scope of this PR.
For this specific one, I'd suggest to stick to the original goal, but keeping a door open for further improvements.
@elasticmachine run elasticsearch-ci/part-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were two approaches in this PR regarding size
:
- a Sample having a
size
attribute, which I don't believe it's something that the sample node must have - a Limit logical plan with a Sample logical plan child node
In the latter case (the one in the PR at this point), transforming the logical plan into a physical one doesn't follow the usual steps. For example, the Mapper
is the one that generates physical plans based on the logical ones and, according to what's in there already a LimitWithOffset
should map to a LimitWithOffsetExec
. In your case, SampleExec
physical plan includes a Limit
, but it should in fact belong to a physical plan having as parent a LimitWithOffsetExec
.
@@ -250,3 +250,61 @@ expected_event_ids = [18,11] | |||
join_keys = ["doom","win10"] | |||
|
|||
|
|||
[[queries]] | |||
name = "size0" | |||
size = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other eql queries use size
as well, but none of them require the toml tests to have size
in them. I'm wondering if there isn't any other way to test size
given that so far there was no need for size
in IT toml tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the Mapper is the one that generates physical plans based on the logical ones and, according to what's in there already a LimitWithOffset should map to a LimitWithOffsetExec. In your case, SampleExec physical plan includes a Limit, but it should in fact belong to a physical plan having as parent a LimitWithOffsetExec.
I'm not sure I got this comment: LimitWithOffset+Sample
logical plan is transformed to LimitWithOffsetExec+SampleExec
, then QueryFolder.FoldLimit
rule applies the limit (from LimitWithOffsetExec) to SampleExec, that is then used to build SampleIterator.
This is the exact same thing that happens for Sequence execution planning.
About "size" in the tests, we have unit tests (mostly for the execution planning, plus some specific tests for the algorithms) in Java, but they do not fully execute any query (mocking the full execution is extremely convoluted due to the usage of internal - Server - components in the execution, that cannot be instantiated in our tests).
The only real execution tests we have are TOML tests, this is why I added a size attribute there. It allows us to test the size on a full execution path.
In addition, I think also Sequence tests can take advantage of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This is required in case new shards allocator might be more proactive with rebalancing.
Clarify that the limits in the docs are absolute maxima that will avoid things just breaking but won't necessarily give great performance.
If the date histogram interval is large and the 'fixed_interval' parameter is very small we might end up with a large number of buckets in the resulting histogram, in case we also generate empty buckets. As a result of this we might generate too many buckets (max date - min date) / fixed_interval > 65536 (roughly).. Here we set minDocCount to 1 so to avoid generation of empty buckets. In the test the maximum value for 'docCount' is 9000 which means, in the worst case we generate 9000 documents, each belonging to a different bucket. In this case we would have 9000 buckets maximum which is well below the default maximum number of buckets allowed by default.
Follow up to elastic#89517 to ensure we don't mistakenly pass the `xpack.ml.enabled` flag on versions of ES that have no such setting.
We can use try-with-resources here, and remove a null check because IndexShard.get never returns null.
For large snapshot clusters the final cleanup step that removes previous shard metadata might have to delete tens of thousands of blobs. This can take many seconds, for which the repository is needlessly blocked from running subsequent finalizations or deletes. Similar to how the final cleanup step was made async and taken out of the finalization loop for deletes in elastic#86514 this PR moves the final cleanup step for finalization async.
**The issue** The flaky test simulates the following: - Create a shrink policy with an invalid target shard count - Then change the policy to have a valid target shard count - Expectation: the `check-target-shards-count` will return true and the shrink operation will be successful. What was happening in some cases in the background: - Create the shrink policy with an invalid target shard count - The `check-target-shards-count` gets created and queued to be executed with the invalid target shards count. - The task doesn't get enough priority to be executed - We change the policy to have a valid target shards count - We execute the queued task which still has the outdated target shard count. **Proof** We enriched the code with some extra logging to verify that the above scenario is actually correct: ``` ## Adding the check target shards to the executingTasks [2022-08-08T18:02:52,824][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] elastic#78460: Adding task to queue check if I can shrink to numberOfShards = 5 [2022-08-08T18:02:52,825][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-zmmrkzfhht","policy":"policy-bEmKF","@timestamp":1659970972825,"index_age":12608,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-bEmKF\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":5}}},\"version\":1,\"modified_date_in_millis\":1659970962968}","action_time":"1659970968847","phase_time":"1659970966014","action":"shrink","step":"check-target-shards-count","creation_date":"1659970960217","step_time":"1659970972076"}}] ## We change the policy before even the condition is never even evaluated. [2022-08-08T18:02:52,825][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-bEmKF] [2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updated policy [policy-bEmKF] contains the same phase step keys and can be refreshed [2022-08-08T18:02:52,826][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updating cached phase definition for policy [policy-bEmKF] [2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-bEmKF] phase definition for [1] indices ## We check the condition for the first time but the target shard count is already outdated [2022-08-08T18:02:53,406][ERROR][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] elastic#78460: Policy has different target number of shards in cluster state 2 vs what will be executed 5. [2022-08-08T18:02:53,441][DEBUG][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] lifecycle action of policy [policy-bEmKF] for index [index-zmmrkzfhht] cannot make progress because the target shards count [5] must be a factor of the source index's shards count [4] ``` **Impact** We do not think that the impact is that big for production clusters because there are many more cluster state updates. However, it might cause some inconvenience to the users who fixed a policy and do not see the effect as soon as they could have. **The fix** Our proposed fix is to not provide the target shard count upon task creation but to retrieve from the cluster state. This way we ensure it will have the newest value. **Future work** Currently for every cluster state we go through all the indices and we check if any step needs to be executed. This doesn't scale well. We would like to try to switch to a more efficient model potentially with a cluster state observer. Issue will be created soon. Resolves elastic#78460
As it has been backported to 8.4.1 with elastic#89611
As per elastic#66908, the setting now defaults to `True`, but it's not shown in the doc. Can we please have the doc updated?
* [DOCS] Update add node section * Include information for enrolling nodes
We never create a stored context that has transient headers to clear while at the same time preserves response headers. Not supporting this path allows for splitting up the logic into 3 distinct methods that are much easier to follow logically. Also, this commit removes a few redundant threadlocal.get and set calls.
This commit changes the status code returned when the start trained model deployment api times out from `500` to `408`. In addition, we add validation that the timeout must be positive. Relates elastic#89585
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a small nitpick.
What happens if both size and limit are specified?
@@ -46,6 +46,7 @@ public abstract class BaseEqlSpecTestCase extends RemoteClusterAwareEqlRestTestC | |||
* For now, every value will be converted to a String. | |||
*/ | |||
private final String[] joinKeys; | |||
private final Integer size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only one nitpick around the usage of Integer (object) instead of a primitive.
The sign doesn't seem to be used (head and tail take care of where the limit is applied) hence why not use the negative value as indicator that the size is out of band and thus unspecified?
As Integer, size has 3 states - valid if >=0, unspecified if null and invalid if <0.
What I'm saying is using only 2 through a primitive (int) - valid if >=0, unspecified if <0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if both size and limit are specified?
Do you mean something like ... | tail n
with n different from size?
For samples, at this stage, pipes are not supported yet, so it cannot happen.
For sequences, both conditions are applied, so the smaller one prevails; this seems a reasonable behavior, so I'd say we could have the same for samples in the future
When DLS/FLS is enabled on an index for a user, the user will not be able to perform operations that can make data content accessible under a different name. This is intentional. This PR makes it clear in the doc. doc preview link: https://elasticsearch_89606.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/security-limitations.html#field-document-limitations
Sorry for the noise here, it seems after a merge from The PR shows ~200 commits ahead of the feature branch, but with a manual check it's clear that all of them are already in the feature branch and that both the feature branch and the PR are up to date with I'll give it a few hours to settle and hopefully re-calculate the diff. If it doesn't fix by itself, I'll close this and open a new PR |
EQL queries allow to pass a "size" parameter to limit the number of returned results:
This PR adds support for this parameter on Sample queries.