ESQL: Add way for `Block` to `keepMask` #112160

nik9000 · 2024-08-23T17:52:35Z

This adds a Block#keepMask(BooleanVector) method that will make a new block, keeping all of the values where the vector is true and nulling all of the values where the vector is false.

This will be useful for implementing partial aggregation application like | STATS MAX(a WHERE b > 1), MIN(j WHERE b > 2) BY bar. Or however the syntax ends up being. We already skip null group keys and we can evaluate the b > 2 bits to a mask pretty easily. It should also be useful in optimizing CASE(a > 2, foo) - but only when the RHS of the CASE is null and the LHS is a constant or constant-like.

This is something that's very optimize-able. I haven't really optimized it in this PR, but it should be possible to speed this up a ton and remove a lot of copying. Here's where the benchmarks start:

(dataTypeAndBlockKind)  Mode  Cnt  Score   Error  Units
             int/array  avgt    7  3.705 ± 0.153  ns/op
            int/vector  avgt    7  3.234 ± 0.078  ns/op

That's about the same speed as reading the block. In a few of these cases I expect we can get them to constant performance rather than per-record performance.

This adds a `Block#keepMask(BooleanVector)` method that will make a new block, keeping all of the values where the vector is `true` and `null`ing all of the velues where the vector is false. This will be useful for implementing partial aggregation application like `| STATS MAX(a WHERE b > 1), MIN(j WHERE b > 2) BY bar`. Or however the syntax ends up being. We already skip `null` group keys and we can evaluate the `b > 2` bits to a mask pretty easily. It should also be useful in optimizing `CASE(a > 2, foo)` - but only when the RHS of the CASE is `null` and the LHS is a constant or constant-like. This is something that's very optimize-able. I haven't really optimized it in this PR, but it should be possible to speed this up a ton and remove a lot of copying. Here's where the benchmarks start: ``` (dataTypeAndBlockKind) Mode Cnt Score Error Units int/array avgt 7 3.705 ± 0.153 ns/op int/vector avgt 7 3.234 ± 0.078 ns/op ``` That's about the same speed as reading the block. In a few of these cases I expect we can get them to constant performance rather than per-record performance.

elasticsearchmachine · 2024-08-23T17:53:00Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

costin

+1 for BooleanVector (vs BooleanBlock).
There's plenty of optimization space from compact representation for the filter to convenience methods for skipping nulls (similar to a bitset), or combining this with the internal block BitSet.

ChrisHegarty

LGTM

alex-spies

LGTM and ++ to adding tests for float blocks!

alex-spies · 2024-08-27T13:29:56Z

...gin/esql/compute/src/main/generated-src/org/elasticsearch/compute/data/DoubleArrayBlock.java

+            }
+            return (DoubleBlock) blockFactory().newConstantNullBlock(getPositionCount());
+        }
+        try (DoubleBlock.Builder builder = blockFactory().newDoubleBlockBuilder(getPositionCount())) {


For the array blocks, couldn't we just incref the underlying vector and create a new nullsmask from the existing one?

Yes. This is what I meant but "we can optimize this". I did the simplest thing that'd get the tests to pass and figured we could grab more later.

I was thinking that we could try and replace the masks on the Block subclasses with BooleanVector - or BooleanArrayVector - and then in some cases this'd be just incRef-ing a few things and returning a new combined block.

alex-spies · 2024-08-27T14:08:00Z

...lugin/esql/compute/src/main/generated-src/org/elasticsearch/compute/data/IntArrayVector.java

+        try (IntBlock.Builder builder = blockFactory().newIntBlockBuilder(getPositionCount())) {
+            // TODO if X-ArrayBlock used BooleanVector for it's null mask then we could shuffle references here.
+            for (int p = 0; p < getPositionCount(); p++) {


Similarly here; an array block is a vector + nullmask + firstvalueindexes array; we could incref the vector and only provide the nullmask, and the firstvalueindexes should be null as there's no MVs.

Right - this is one that I think is super optimizeable. I went with the slow, plodding implementation.

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/Block.java

alex-spies · 2024-08-27T14:12:52Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ConstantNullBlock.java

@@ -83,6 +83,11 @@ public ConstantNullBlock filter(int... positions) {
        return (ConstantNullBlock) blockFactory().newConstantNullBlock(positions.length);
    }

+    @Override
+    public ConstantNullBlock keepMask(BooleanVector mask) {
+        return (ConstantNullBlock) blockFactory().newConstantNullBlock(getPositionCount());


Shouldn't we just incref?

I think I'll have a look at this one on it's own - there are a few cases where I think we can incRef; return this; in this class that we're not doing.

alex-spies · 2024-08-27T14:13:53Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/ConstantNullVector.java

+        assert false : "null vector";
+        throw new UnsupportedOperationException("null vector");


Why do we throw UOE instead of returning a constant null block?

(This seems like something that could blow up in the future, where due to some optimization we end up having a constant null vector in a place where we apply a mask to the vector.)

Vectors can never contain null anyway. This exists to make the compiler happy. The class level javadoc mentions this - it's mostly so we can return something the implements all of the appropriate interfaces from ConstantNullBlock. If there's a way to delete this class I'd be happy to.

…pute/data/Block.java Co-authored-by: Alexander Spies <[email protected]>

nik9000 · 2024-08-27T17:54:31Z

elasticsearch-ci/rest-compatibility Pending

This has passed but the communication seems to have dropped. Merging on my own.

This adds a `Block#keepMask(BooleanVector)` method that will make a new block, keeping all of the values where the vector is `true` and `null`ing all of the velues where the vector is false. This will be useful for implementing partial aggregation application like `| STATS MAX(a WHERE b > 1), MIN(j WHERE b > 2) BY bar`. Or however the syntax ends up being. We already skip `null` group keys and we can evaluate the `b > 2` bits to a mask pretty easily. It should also be useful in optimizing `CASE(a > 2, foo)` - but only when the RHS of the CASE is `null` and the LHS is a constant or constant-like. This is something that's very optimize-able. I haven't really optimized it in this PR, but it should be possible to speed this up a ton and remove a lot of copying. Here's where the benchmarks start: ``` (dataTypeAndBlockKind) Mode Cnt Score Error Units int/array avgt 7 3.705 ± 0.153 ns/op int/vector avgt 7 3.234 ± 0.078 ns/op ``` That's about the same speed as reading the block. In a few of these cases I expect we can get them to constant performance rather than per-record performance.

nik9000 added >non-issue :Analytics/ES|QL AKA ESQL v8.16.0 labels Aug 23, 2024

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 23, 2024

costin approved these changes Aug 26, 2024

View reviewed changes

ChrisHegarty approved these changes Aug 27, 2024

View reviewed changes

alex-spies approved these changes Aug 27, 2024

View reviewed changes

Update x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/com…

5f77cb1

…pute/data/Block.java Co-authored-by: Alexander Spies <[email protected]>

nik9000 added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 27, 2024

nik9000 merged commit c05f7e9 into elastic:main Aug 27, 2024
14 of 15 checks passed

nik9000 deleted the esql_keep_mask branch August 27, 2024 17:55

nik9000 mentioned this pull request Sep 3, 2024

ESQL: Add pre and post filter for grouping operator #111439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Add way for `Block` to `keepMask` #112160

ESQL: Add way for `Block` to `keepMask` #112160

nik9000 commented Aug 23, 2024 •

edited by craigtaverner

Loading

elasticsearchmachine commented Aug 23, 2024

costin left a comment

ChrisHegarty left a comment

alex-spies left a comment

alex-spies Aug 27, 2024

nik9000 Aug 27, 2024

alex-spies Aug 27, 2024

nik9000 Aug 27, 2024

alex-spies Aug 27, 2024

nik9000 Aug 27, 2024

alex-spies Aug 27, 2024

nik9000 Aug 27, 2024

nik9000 commented Aug 27, 2024

		assert false : "null vector";
		throw new UnsupportedOperationException("null vector");

ESQL: Add way for Block to keepMask #112160

ESQL: Add way for Block to keepMask #112160

Conversation

nik9000 commented Aug 23, 2024 • edited by craigtaverner Loading

elasticsearchmachine commented Aug 23, 2024

costin left a comment

Choose a reason for hiding this comment

ChrisHegarty left a comment

Choose a reason for hiding this comment

alex-spies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Aug 27, 2024

ESQL: Add way for `Block` to `keepMask` #112160

ESQL: Add way for `Block` to `keepMask` #112160

nik9000 commented Aug 23, 2024 •

edited by craigtaverner

Loading