ESQL: INLINESTATS #109583

nik9000 · 2024-06-11T13:46:31Z

This implements INLINESTATS. Most of the heavy lifting is done by LOOKUP, with this change mostly adding a new abstraction to logical plans, an interface I'm calling Phased. Implementing this interface allows a logical plan node to cut the query into phases. INLINESTATS implements it by asking for a "first phase" that's the same query, up to INLINESTATS, but with INLINESTATS replaced with STATS. The next phase replaces the INLINESTATS with a hash join on the results of the first phase.

So, this query:

FROM foo
| EVAL bar = a * b
| INLINESTATS m = MAX(bar) BY b
| WHERE m = bar
| LIMIT 1

gets split into

FROM foo
| EVAL bar = a * b
| STATS m = MAX(bar) BY b

followed by

FROM foo
| EVAL bar = a * b
| LOOKUP (results of m = MAX(bar) BY b) ON b
| WHERE m = bar
| LIMIT 1

Here's an example of the syntax:

$ curl -k -XDELETE -u'elastic:password' 'http://localhost:9200/test'
$ for a in {0..99}; do
    echo -n $a
    rm -f /tmp/bulk
    for b in {0..999}; do
        echo '{"index": {}}' >> /tmp/bulk
        echo '{"a": '$a', "b": '$b'}' >> /tmp/bulk
    done
    curl -s -k -XPOST -u'elastic:password' -HContent-Type:application/json 'http://localhost:9200/test/_bulk?pretty' --data-binary @/tmp/bulk | grep errors
done
$ curl -s -k -XPOST -u'elastic:password' -HContent-Type:application/json 'http://localhost:9200/test/_forcemerge?max_num_segments=1&pretty'
$ curl -s -k -XPOST -u'elastic:password' -HContent-Type:application/json 'http://localhost:9200/test/_refresh?pretty'
$ curl -k -XPOST -u'elastic:password' -HContent-Type:application/json http://localhost:9200/_query?pretty -d'{
    "query": "FROM test | INLINESTATS m=MAX(a * b) BY b | WHERE m == a * b | SORT a DESC, b DESC | LIMIT 1",
    "profile": true
}'
{
  "columns" : [
    {
      "name" : "a",
      "type" : "long"
    },
    {
      "name" : "b",
      "type" : "long"
    },
    {
      "name" : "m",
      "type" : "long"
    }
  ],
  "values" : [
    [
      99,
      999,
      98901
    ]

Closes #107589

This implements `INLINESTATS`. Most of the heavy lifting is done by `LOOKUP`, with this change mostly adding a new abstraction to logical plans, and interface I'm calling `Phased`. Implementing this interface allows a logical plan node to cut the query into phases. `INLINESTATS` implements it by asking for a "first phase" that's the same query, up to `INLINESTATS`, but with `INLINESTATS` replaced with `STATS`. The next phase replaces the `INLINESTATS` with a `LOOKUP` on the results of the first phase. So, this query: ``` FROM foo | EVAL bar = a * b | INLINESTATS m = MAX(bar) BY b | WHERE m = bar | LIMIT 1 ``` gets split into ``` FROM foo | EVAL bar = a * b | STATS m = MAX(bar) BY b ``` followed by ``` FROM foo | EVAL bar = a * b | LOOKUP (results of m = MAX(bar) BY b) ON b | WHERE m = bar | LIMIT 1 ```

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/InlineStats.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java

nik9000 · 2024-06-11T13:58:01Z

This is riddled with NOCOMMITs, as any good prototype deserves to be.

It's also not clear that this is the right way to go about this. I mean, cutting the request into two is the right way to do it, I think. The reworking of the callbacks to make that possible seems like the sane way to do that.

But from a planning side, this is an "interesting" choice. We analyze the INLINESTATS call, but then don't run the optimizer on it at all - that seems bad. OTOH, we do run the optimizer on the two halves of the query. That seems good.

nik9000 · 2024-07-03T20:36:15Z

I've extracted #110445 out of this one so I can get it and not have to deal with merge conflicts.

nik9000 · 2024-07-09T19:21:18Z

nik9000 · 2024-07-10T12:10:28Z

There's a fun bug that I'm going to probably leave for a followup:

FROM airports
| INLINESTATS min_scalerank=MIN(scalerank) BY type
| MV_EXPAND type
| WHERE scalerank == MV_MIN(scalerank);

Hits the rule execution limit. It started from a typo - I meant to do scalerank == MV_MIN(min_scalerank) but didn't. And then it hit the rule execution limit. Then I removed extra commands from the query until I got to more minimal recreation.

docs/reference/esql/processing-commands/stats.asciidoc

docs/reference/esql/processing-commands/inlinestats.asciidoc

dnhatn

@dnhatn I'm wondering if this Phased strategy would also work in case of rate function.

I encountered a similar issue while implementing metrics aggregations. To address it, I chose to execute mixed aggregations, wrapped in to_partial and from_partial, in a single phase along with other aggregations. I made this decision for two reasons:

Data consistency: As others have noted, we need to ensure data consistency between phases, which can be challenging because we open and close readers of target shards in batches on data nodes to avoid holding excessive resources (e.g., file descriptors).
Execution redundancy: We might need to execute the initial part of each phase multiple times, such as LuceneQuery and FieldExact.

An alternative approach I explored was implementing a multiplexed pipeline. This method involves broadcasting/scattering pages into multiple sub-plans and then gathering the pages. With InlineStats, we can gather and join pages using a HashJoin (or Eval). I did not spend enough time to make this alternative approach ready. However, we can revisit it if we encounter issues with maintaining data consistency across phases.

However, this PR looks good. Great work. Thank you, Nik!

dnhatn · 2024-07-22T01:49:31Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/inlinestats.csv-spec

+FROM employees
+| KEEP emp_no, languages
+| INLINESTATS max_lang = MAX(languages) 
+| WHERE max_lang == languages


Nice examples ❤️ . It would be lovely if we could push down the filter in the second phase to Lucene instead of using HashJoin, so that we can avoid scanning the entire dataset. Let's address this later.

Yeah! I'm pretty sure we can push these down. That's on the followup list!

astefan

LGTM

I am ok with the code as is now and the awesome usecases it's unlocking, if some aspects will definitely be addressed later (some missing use cases and some performance improvements missing) and if we are ok with the "experimental" label until then.

nik9000 · 2024-07-22T15:26:24Z

NOCOMMIT: Link the new javadocs for Phased and EsqlSession into the package level javadoc.

nik9000 · 2024-07-23T11:20:48Z

NOCOMMIT: Link the new javadocs for Phased and EsqlSession into the package level javadoc.

Pushed a link.

alex-spies · 2024-07-23T13:30:57Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/inlinestats.csv-spec

+shadowingLimit0
+required_capability: inlinestats
+
+ROW left = "left", client_ip = "172.21.0.5", env = "env", right = "right"
+| INLINESTATS env=VALUES(right) BY client_ip
+| LIMIT 0
+;


I don't think we need the limit0 tests added here; the limit0 tests are only present in enrich.csv-spec so we can run at least the logical optimizer against tests that otherwise would need enrich_load.

alex-spies · 2024-07-23T13:34:10Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/inlinestats.csv-spec

+ROW city = "Zürich"
+| INLINESTATS x=VALUES(city), x=VALUES(city)


Let's also add a shadowingSelf test!

| INLINESTATS city = COUNT(city)

Oh. That.... Probably isn't going to work. I guess we do want it to work....

That'd be required to be consistent with enrich, eval, dissect and grok. I think we can add this to the list of follow ups, as this probably requires work on LOOKUP resp. JOIN.

alex-spies · 2024-07-23T13:36:54Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/InlineStats.java

+            Object value = BlockUtils.toJavaObject(p.getBlock(i), 0);
+            values.add(new Alias(source(), s.name(), null, new Literal(source(), value, s.dataType()), aggregates.get(i).id()));
+        }
+        return new Eval(source(), child(), values);


Yeahp, this should work and is conceptually a bit nicer (IMHO) than doing that down in the physical mapping!

alex-spies · 2024-07-23T13:39:13Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/InlineStats.java

+    }
+
+    @Override
+    public LogicalPlan nextPhase(List<Attribute> schema, List<Page> firstPhaseResult) {


I think using the schema from the first phase is reasonable, but throwing an IllegalArgumentException if the schema doesn't line up with what we expected will make our lives much easier, esp. if this should blow up in production.

alex-spies · 2024-07-23T13:39:42Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Phased.java

+ * <p>If there are multiple {@linkplain Phased} nodes in the plan we always
+ * operate on the lowest one first, counting from the data source "upwards".


nik9000 · 2024-07-23T19:02:08Z

I've put this behind a feature flag and removed release highlight because it's not super clear that this'll get un-feature flagged in 8.16.

There are a bunch of extra follow ups. I had a conversation with @costin about moving around how we trigger the Phased nature. The plan now is to merge this as is and rework some things.

I'm going to push a few more tests and see if we can land this. Folks can experiment with it some more while we finish up.

alex-spies · 2024-07-24T08:54:13Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/InlineStats.java

+    }
+
+    @Override
+    public LogicalPlan nextPhase(List<Attribute> schema, List<Page> firstPhaseResult) {


Thanks Nik, this is much better now!

I have one last nit: the check is slightly insufficient because

firstPhase().output().equals(schema) == false

will not look at the name ids (they are not checked in NamedExpression.equals(), nor in NamedExpression's descendants). The plan may become inconsistent if the name ids do not line up, even if the first phase produces the correct names and data types. I think what we need is a little helper that we should call here.

public static equalsAndSemanticEquals(List<Attribute> left, List<Attribute> right) { if (left.equals(right) == false) { return false; } for (int i = 0; i < left.size(); i++) { if (left.get(i).semanticEquals(right.get(i)) == false) { return false; } } return true; }

We could put that into the Expressions class (not Expression).

Either this, or we ignore name ids from the first phase: in ungroupedNextPhase we already do this, because we obtain the name ids from the aggregates. In groupedNextPhase, however, we put the schema's attributes directly as the attributes of the local relation.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/EsqlPlugin.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/LogicalPlanBuilder.java

nik9000 · 2024-07-24T21:16:21Z

Release tests look unrelated. We'll fight with them as fight comes up.

* main: (39 commits) Update README.asciidoc (elastic#111244) ESQL: INLINESTATS (elastic#109583) ESQL: Document a little of `DataType` (elastic#111250) Relax assertions in segment level field stats (elastic#111243) LogsDB data generator - support nested object field (elastic#111206) Validate `Authorization` header in Azure test fixture (elastic#111242) Fixing HistoryStoreTests.testPut() and testStoreWithHideSecrets() (elastic#111246) [ESQL] Remove Named Expcted Types map from testing infrastructure (elastic#111213) Change visibility of createWriter to allow tests from a different package to override it (elastic#111234) [ES|QL] Remove EsqlDataTypes (elastic#111089) Mute org.elasticsearch.repositories.azure.AzureBlobContainerRetriesTests testReadNonexistentBlobThrowsNoSuchFileException elastic#111233 Abstract codec lookup by name, to make CodecService extensible (elastic#111007) Add HTTPS support to `AzureHttpFixture` (elastic#111228) Unmuting tests related to free_context action being processed in ESSingleNodeTestCase (elastic#111224) Upgrade Azure SDK (elastic#111225) Collapse transport versions for 8.14.0 (elastic#111199) Make sure contender uses logs templates (elastic#111183) unmute HistogramPercentileAggregationTests.testBoxplotHistogram (elastic#111223) Refactor Quality Assurance test infrastructure (elastic#111195) Mute org.elasticsearch.xpack.restart.FullClusterRestartIT testDisableFieldNameField {cluster=UPGRADED} elastic#111222 ... # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

costin · 2024-08-28T01:03:29Z

FTR, I've raised an meta ticket to track my progress at #112266

maggieghamry · 2024-09-19T21:54:51Z

@costin would it be possible to update this blog https://www.elastic.co/search-labs/blog/esql-piped-query-language-goes-ga to note that INLINESTATS is not yet available/will only be available in 8.16.0?

nik9000 added release highlight :Analytics/ES|QL AKA ESQL v8.15.0 labels Jun 11, 2024

nik9000 added 3 commits June 11, 2024 09:47

Explain

14017d6

More nocommit

355905a

More nocommit

a634f90

nik9000 commented Jun 11, 2024

View reviewed changes

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/InlineStats.java Outdated Show resolved Hide resolved

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plugin/ComputeService.java Outdated Show resolved Hide resolved

Spotless

8f04e1b

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

nik9000 added 9 commits July 5, 2024 10:49

Merge branch 'main' into inlinestats

3a3939b

Works again

5483426

Closer

7695f67

Merge branch 'main' into inlinestats

64f858b

More test

141a63f

Share

d0dc736

Merge branch 'main' into inlinestats

6947e1c

More

cc44421

ungrouped

8466a4e

nik9000 added the ES|QL-ui Impacts ES|QL UI label Jul 9, 2024

nik9000 added 3 commits July 10, 2024 08:23

WIt P

cc20b73

Merge branch 'main' into inlinestats

2f9b8af

Remove

c4f1d87

nik9000 commented Jul 10, 2024

View reviewed changes

docs/reference/esql/processing-commands/stats.asciidoc Outdated Show resolved Hide resolved

docs/reference/esql/processing-commands/inlinestats.asciidoc Outdated Show resolved Hide resolved

Remove unused

1408824

dnhatn approved these changes Jul 22, 2024

View reviewed changes

astefan approved these changes Jul 22, 2024

View reviewed changes

nik9000 added 7 commits July 22, 2024 12:31

Merge branch 'main' into inlinestats

76998b7

techpreview

40b13df

Merge branch 'main' into inlinestats

48253a4

WIP

674d93d

Update

4489b27

Merge branch 'main' into inlinestats

8074a95

Link

10b1fcd

alex-spies reviewed Jul 23, 2024

View reviewed changes

nik9000 added 3 commits July 23, 2024 13:28

Check

fdb43d8

Feature flag it

cbb1e60

Merge branch 'main' into inlinestats

8fbe301

nik9000 added test-release Trigger CI checks against release build and removed release highlight labels Jul 23, 2024

alex-spies reviewed Jul 24, 2024

View reviewed changes

nik9000 added 3 commits July 24, 2024 08:18

WI{

09d226a

Merge branch 'main' into inlinestats

0e454f7

more skips

a6ec9be

nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 24, 2024

nik9000 merged commit b5c6c2d into elastic:main Jul 24, 2024
14 of 16 checks passed

nik9000 deleted the inlinestats branch July 24, 2024 21:17

stratoula mentioned this pull request Jul 29, 2024

[ES|QL] Support inline stats in autocomplete and client side validation elastic/kibana#189356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: INLINESTATS #109583

ESQL: INLINESTATS #109583

nik9000 commented Jun 11, 2024 •

edited

Loading

nik9000 commented Jun 11, 2024

nik9000 commented Jul 3, 2024

nik9000 commented Jul 9, 2024

nik9000 commented Jul 10, 2024

dnhatn left a comment •

edited

Loading

dnhatn Jul 22, 2024

nik9000 Jul 22, 2024

astefan left a comment

nik9000 commented Jul 22, 2024

nik9000 commented Jul 23, 2024

alex-spies Jul 23, 2024

alex-spies Jul 23, 2024

nik9000 Jul 23, 2024

alex-spies Jul 24, 2024

alex-spies Jul 23, 2024

alex-spies Jul 23, 2024

alex-spies Jul 23, 2024

nik9000 commented Jul 23, 2024

alex-spies Jul 24, 2024

nik9000 commented Jul 24, 2024

costin commented Aug 28, 2024

maggieghamry commented Sep 19, 2024

		ROW city = "Zürich"
		\| INLINESTATS x=VALUES(city), x=VALUES(city)

		* <p>If there are multiple {@linkplain Phased} nodes in the plan we always
		* operate on the lowest one first, counting from the data source "upwards".

ESQL: INLINESTATS #109583

ESQL: INLINESTATS #109583

Conversation

nik9000 commented Jun 11, 2024 • edited Loading

nik9000 commented Jun 11, 2024

nik9000 commented Jul 3, 2024

nik9000 commented Jul 9, 2024

nik9000 commented Jul 10, 2024

dnhatn left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astefan left a comment

Choose a reason for hiding this comment

nik9000 commented Jul 22, 2024

nik9000 commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Jul 23, 2024

Choose a reason for hiding this comment

nik9000 commented Jul 24, 2024

costin commented Aug 28, 2024

maggieghamry commented Sep 19, 2024

nik9000 commented Jun 11, 2024 •

edited

Loading

dnhatn left a comment •

edited

Loading