ESQL: Remove parent from FieldAttribute #112881

alex-spies · 2024-09-13T16:01:00Z

In #110793, if an index field parent.field had a parent parent, the corresponding FieldAttribute started keeping track of a corresponding parent FieldAttribute as well. This was necessary only to keep track of the parent field's name.

The parent FieldAttribute contains a lot of things that are not needed, especially the parent's EsField object, which contains a full map of any subfields. Unless deduplicated, this leads to very high spacial complexity when serialized.

To avoid mistakes in the future, let's trim things down to the data we actually need to keep track of: remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.

This also enables further improvements to the size of serialized LogicalPlans: we currently send all available EsFields with theEsIndex object; not sending parent FieldAttributes (and thus all EsFields in a field's hierarchy) enables us to avoid sending unnecessary EsFields altogether.

…te-parent

elasticsearchmachine · 2024-09-16T10:00:44Z

Hi @alex-spies, I've created a changelog YAML for you.

elasticsearchmachine · 2024-09-16T10:00:44Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000

LGTM. I think worth having Craig look too.

Does this change those tests for large numbers of conflicts? I expect the numbers on there should drop.

...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java

astefan

LGTM
But one of the failing tests seems to indicate that we are using a bit more now?

ExchangeSinkExecSerializationTests > testManyTypeConflictsWithParent FAILED
    java.lang.AssertionError: 
    Expected: "3.1mb" (<3271486L> bytes)
         but: "3.1mb" (<3307704L> bytes)
        at __randomizedtesting.SeedInfo.seed([7603FA0D74F3937A:7776D00531444912]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
        at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2467)
        at org.elasticsearch.xpack.esql.plan.physical.ExchangeSinkExecSerializationTests.testManyTypeConflicts(ExchangeSinkExecSerializationTests.java:123)
        at org.elasticsearch.xpack.esql.plan.physical.ExchangeSinkExecSerializationTests.testManyTypeConflictsWithParent(ExchangeSinkExecSerializationTests.java:83)

alex-spies · 2024-09-16T12:14:36Z

...test/java/org/elasticsearch/xpack/esql/plan/physical/ExchangeSinkExecSerializationTests.java

@@ -80,7 +80,7 @@ public void testManyTypeConflicts() throws IOException {
     * See {@link #testManyTypeConflicts(boolean, ByteSizeValue)} for more.
     */
    public void testManyTypeConflictsWithParent() throws IOException {
-        testManyTypeConflicts(true, ByteSizeValue.ofBytes(3271486));
+        testManyTypeConflicts(true, ByteSizeValue.ofBytes(3307704L));


Fun! Since FieldAttributes are being deduplicated but strings are not, this means that the serialized size of a plan can go up even though this change otherwise simplifies things.

Example: for fields parentWithALongName.x, parentWithALongName.y, parentWithALongName.z, the parent is parentWithALongName all the time, so before this PR, this get's stuffed into a FieldAttribute and reused during serialization, taking up only 1 optional long's space after the first time its written.

This could be alleviated by caching (some?) strings in our plans; not sure that's worth the trouble, though.

Since FieldAttributes are being deduplicated but strings are not

I have a bit of a concern now that you mentioned this bit (thank you for explaining 👍 ): the test itself has a single level parent hierarchy for the fields it's creating. I am wondering if there will (or won't) be an exponential growth in serialized data size if the mapping would be many layers deep (I don't know... 4-5-6), something that would reflect better the ECS mappings.

The concern on the advantage of field deduplication vs. fields only is reasonable.
For the record, I tested string caching and it gives us a 5-20% on these tests. I have a local branch, I can revive it and submit a PR, maybe it can give advantages also for parent names.

I'm working on adding a test to simulate a precarious situation with lots of nesting.

FWIW, I don't think the cost can be exponential, because the added cost is per fieldname that we serialize; shouldn't matter how nested the field hierarchy is, for every nested.field.maybe.even.nested we send an addtional string that's bounded by the field name, e.g. nested.field.maybe.even in this case.

I added two tests. Both use a rather deep and wide mapping: 6 levels deep, each level has 9 children (per node), so it's 9^6 relevant fields.

For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.
For FROM idx | LIMIT 10 | KEEP one_field the plan size is ~20MB on both - it's absolutely dominated by the size of the serialized EsIndex, that's written as part of the EsRelation.

the plan size on main is ~100MB, with this PR it grows to 130MB

That's not insignificant. @nik9000 what's your gut feeling about this?

Oh, and plan deserialization OOMs on FROM idx | LIMIT 10 with and without this PR. @luigidellaquila, should we tackle this as part of #111358?

.../plugin/esql/src/test/java/org/elasticsearch/xpack/esql/index/EsIndexSerializationTests.java

alex-spies · 2024-09-16T15:03:29Z

I believe this is a useful simplification, but given that this has a chance to make plans bigger, I think we should wait with merging this - @luigidellaquila has #112929 which together with this here should make things nice and clean.

costin

I understand the intent of this PR however since references are being used and recognized, both the runtime and serialized cost are actually small.

For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.

That's a 30% increase in memory.
Unless the situation goes in the other direction, I'm 👎 on this PR since it complicate field hierarchy navigation without any upside.

Side-note:
With the current approach, the storage is going to increase since the string of the parent flattens its hierarchy instead of delegating.
Before this PR, field a.b.c is represented as field c with parent b with parent a. With this PR the parent of c becomes b.a which causes a cache miss on all of its parents.
So field with depth D will cause cache hits on D-2 ancestors which gets multiplied by the amount of breath each level has.
The workaround is to wrap the String (which is immutable) in a reference and not flatten the hierarchy in a string, namely FieldAttribute.

alex-spies · 2024-09-17T09:59:53Z

I understand the intent of this PR however since references are being used and recognized, both the runtime and serialized cost are actually small.

For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.

That's a 30% increase in memory. Unless the situation goes in the other direction, I'm 👎 on this PR since it complicate field hierarchy navigation without any upside.

I agree. After #112929, this PR should reduce the plan size.

Once that's the case, I think it's important that we go forward with slimming down FieldAttribute.

Currently, if we have an index with fields
parent.child0, ..., parent.childN and we run FROM idx* | KEEP parent.child0, the single field attribute for parent.child0 contains the FA for parent, which contains the corresponding EsField and thus aaall the subfields, their types, their type conflicts (in case of InvalidMappedFields) etc. which all don't matter for this FieldAttribute or the query.

Side-note: With the current approach, the storage is going to increase since the string of the parent flattens its hierarchy instead of delegating. Before this PR, field a.b.c is represented as field c with parent b with parent a. With this PR the parent of c becomes b.a which causes a cache miss on all of its parents. So field with depth D will cause cache hits on D-2 ancestors which gets multiplied by the amount of breath each level has. The workaround is to wrap the String (which is immutable) in a reference and not flatten the hierarchy in a string, namely FieldAttribute.

Once we have string caching during de-/serialization, this PR will strictly throw away unneeded data. That's because in the current state, a field a.b.c is not represented as field c with parent b with parent a - it's represented as field a.b.c with parent a.b with parent a. Instead of using the whole parent field attribute, we should use only the name.

(Additionally, I don't think that cache hits on d-2 ancestors matter, because field attributes are only really used for leaf fields. So for a field a.b.c that's used in a field attribute in our logical plans, a and a.b only matter while resolving the index mapping. (Except for exact subfields, but that's just 1 level of parent-child relationships).)

…te-parent

elasticsearchmachine · 2024-10-17T12:41:43Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 112881

alex-spies · 2024-10-17T12:53:18Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName. (cherry picked from commit caa16b4) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

This reverts commit caa16b4.

…lastic#115006) This reverts commit 17ecb66.

…#115007) This reverts commit 17ecb66 and reapplies #112881 once the previous, non-backported transport version bump is dealt with.

…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.

…#115007) (#115035) This reverts commit 17ecb66 and reapplies #112881 once the previous, non-backported transport version bump is dealt with.

…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.

To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.

…astic#115006) This reverts commit caa16b4.

…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.

To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.

…astic#115006) This reverts commit caa16b4.

…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.

Remove parent from FieldAttribute

0e7a097

elasticsearchmachine added the v9.0.0 label Sep 13, 2024

alex-spies requested a review from craigtaverner September 13, 2024 16:10

alex-spies added the v8.16.0 label Sep 16, 2024

alex-spies added 2 commits September 16, 2024 10:01

Merge remote-tracking branch 'upstream/main' into remove-fieldattribu…

1ba70cd

…te-parent

Update test

47e6ccc

alex-spies marked this pull request as ready for review September 16, 2024 09:59

alex-spies requested a review from a team as a code owner September 16, 2024 09:59

alex-spies requested a review from luigidellaquila September 16, 2024 10:00

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 16, 2024

alex-spies added >enhancement auto-backport-and-merge :Analytics/ES|QL AKA ESQL labels Sep 16, 2024

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 16, 2024

Update docs/changelog/112881.yaml

1058efd

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Sep 16, 2024

nik9000 approved these changes Sep 16, 2024

View reviewed changes

...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java Outdated Show resolved Hide resolved

astefan approved these changes Sep 16, 2024

View reviewed changes

Update serialization test

43eb12f

alex-spies commented Sep 16, 2024

View reviewed changes

Test for deeply nested fields

642f0e6

astefan reviewed Sep 16, 2024

View reviewed changes

.../plugin/esql/src/test/java/org/elasticsearch/xpack/esql/index/EsIndexSerializationTests.java Outdated Show resolved Hide resolved

Add another test: select single field attribute

a2fc167

alex-spies added 2 commits September 16, 2024 18:09

Refactor FieldAttribute's ser/de methods

f62da85

Make test runnable

16fbfb5

costin requested changes Sep 16, 2024

View reviewed changes

brianseeders added the v8.17.0 label Oct 16, 2024

alex-spies added 6 commits October 17, 2024 07:29

Merge remote-tracking branch 'upstream/main' into remove-fieldattribu…

a678b01

…te-parent

Address feedback

77f1cc1

Own method for reading cached str + version check

35b2c9e

Own method for writing cached str + version check

931d24d

Merge remote-tracking branch 'upstream/main' into remove-fieldattribu…

8dd478c

…te-parent

Fix copy-paste mistake

c526d57

alex-spies merged commit caa16b4 into elastic:main Oct 17, 2024
16 checks passed

alex-spies deleted the remove-fieldattribute-parent branch October 17, 2024 12:40

elasticsearchmachine added the backport pending label Oct 17, 2024

alex-spies mentioned this pull request Oct 17, 2024

[8.x] ESQL: Remove parent from FieldAttribute (#112881) #115005

Closed

alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Oct 17, 2024

Revert "ESQL: Remove parent from FieldAttribute (elastic#112881)"

cf840f0

This reverts commit caa16b4.

alex-spies mentioned this pull request Oct 17, 2024

Revert "ESQL: Remove parent from FieldAttribute (#112881)" #115006

Merged

alex-spies added a commit that referenced this pull request Oct 17, 2024

Revert "ESQL: Remove parent from FieldAttribute (#112881)" (#115006)

17ecb66

This reverts commit caa16b4.

alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Oct 17, 2024

Reapply "ESQL: Remove parent from FieldAttribute (elastic#112881)" (e…

651fa2e

…lastic#115006) This reverts commit 17ecb66.

alex-spies mentioned this pull request Oct 17, 2024

Reapply "ESQL: Remove parent from FieldAttribute (#112881)" (#115006) #115007

Merged

alex-spies mentioned this pull request Oct 17, 2024

[8.x] Reapply "ESQL: Remove parent from FieldAttribute (#112881)" (#115006) (#115007) #115035

Merged

georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024

Revert "ESQL: Remove parent from FieldAttribute (elastic#112881)" (el…

5657fee

…astic#115006) This reverts commit caa16b4.

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024

ESQL: Remove parent from FieldAttribute (elastic#112881)

7168c6e

To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024

Revert "ESQL: Remove parent from FieldAttribute (elastic#112881)" (el…

250ae63

…astic#115006) This reverts commit caa16b4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Remove parent from FieldAttribute #112881

ESQL: Remove parent from FieldAttribute #112881

alex-spies commented Sep 13, 2024 •

edited

Loading

elasticsearchmachine commented Sep 16, 2024

elasticsearchmachine commented Sep 16, 2024

nik9000 left a comment

astefan left a comment •

edited

Loading

alex-spies Sep 16, 2024

astefan Sep 16, 2024

luigidellaquila Sep 16, 2024 •

edited

Loading

alex-spies Sep 16, 2024

luigidellaquila Sep 16, 2024

alex-spies Sep 16, 2024

astefan Sep 16, 2024

alex-spies Sep 16, 2024

alex-spies commented Sep 16, 2024

costin left a comment •

edited

Loading

alex-spies commented Sep 17, 2024

elasticsearchmachine commented Oct 17, 2024

alex-spies commented Oct 17, 2024

ESQL: Remove parent from FieldAttribute #112881

ESQL: Remove parent from FieldAttribute #112881

Conversation

alex-spies commented Sep 13, 2024 • edited Loading

elasticsearchmachine commented Sep 16, 2024

elasticsearchmachine commented Sep 16, 2024

nik9000 left a comment

Choose a reason for hiding this comment

astefan left a comment • edited Loading

Choose a reason for hiding this comment

alex-spies Sep 16, 2024

Choose a reason for hiding this comment

astefan Sep 16, 2024

Choose a reason for hiding this comment

luigidellaquila Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

alex-spies Sep 16, 2024

Choose a reason for hiding this comment

luigidellaquila Sep 16, 2024

Choose a reason for hiding this comment

alex-spies Sep 16, 2024

Choose a reason for hiding this comment

astefan Sep 16, 2024

Choose a reason for hiding this comment

alex-spies Sep 16, 2024

Choose a reason for hiding this comment

alex-spies commented Sep 16, 2024

costin left a comment • edited Loading

Choose a reason for hiding this comment

alex-spies commented Sep 17, 2024

elasticsearchmachine commented Oct 17, 2024

💔 Backport failed

alex-spies commented Oct 17, 2024

💚 All backports created successfully

Questions ?

alex-spies commented Sep 13, 2024 •

edited

Loading

astefan left a comment •

edited

Loading

luigidellaquila Sep 16, 2024 •

edited

Loading

costin left a comment •

edited

Loading