Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: Remove parent from FieldAttribute #112881

Merged
merged 18 commits into from
Oct 17, 2024

Conversation

alex-spies
Copy link
Contributor

@alex-spies alex-spies commented Sep 13, 2024

In #110793, if an index field parent.field had a parent parent, the corresponding FieldAttribute started keeping track of a corresponding parent FieldAttribute as well. This was necessary only to keep track of the parent field's name.

The parent FieldAttribute contains a lot of things that are not needed, especially the parent's EsField object, which contains a full map of any subfields. Unless deduplicated, this leads to very high spacial complexity when serialized.

To avoid mistakes in the future, let's trim things down to the data we actually need to keep track of: remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.

This also enables further improvements to the size of serialized LogicalPlans: we currently send all available EsFields with theEsIndex object; not sending parent FieldAttributes (and thus all EsFields in a field's hierarchy) enables us to avoid sending unnecessary EsFields altogether.

@alex-spies alex-spies marked this pull request as ready for review September 16, 2024 09:59
@alex-spies alex-spies requested a review from a team as a code owner September 16, 2024 09:59
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 16, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @alex-spies, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Sep 16, 2024
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think worth having Craig look too.

Does this change those tests for large numbers of conflicts? I expect the numbers on there should drop.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
But one of the failing tests seems to indicate that we are using a bit more now?

ExchangeSinkExecSerializationTests > testManyTypeConflictsWithParent FAILED
    java.lang.AssertionError: 
    Expected: "3.1mb" (<3271486L> bytes)
         but: "3.1mb" (<3307704L> bytes)
        at __randomizedtesting.SeedInfo.seed([7603FA0D74F3937A:7776D00531444912]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
        at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2467)
        at org.elasticsearch.xpack.esql.plan.physical.ExchangeSinkExecSerializationTests.testManyTypeConflicts(ExchangeSinkExecSerializationTests.java:123)
        at org.elasticsearch.xpack.esql.plan.physical.ExchangeSinkExecSerializationTests.testManyTypeConflictsWithParent(ExchangeSinkExecSerializationTests.java:83)

@@ -80,7 +80,7 @@ public void testManyTypeConflicts() throws IOException {
* See {@link #testManyTypeConflicts(boolean, ByteSizeValue)} for more.
*/
public void testManyTypeConflictsWithParent() throws IOException {
testManyTypeConflicts(true, ByteSizeValue.ofBytes(3271486));
testManyTypeConflicts(true, ByteSizeValue.ofBytes(3307704L));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun! Since FieldAttributes are being deduplicated but strings are not, this means that the serialized size of a plan can go up even though this change otherwise simplifies things.

Example: for fields parentWithALongName.x, parentWithALongName.y, parentWithALongName.z, the parent is parentWithALongName all the time, so before this PR, this get's stuffed into a FieldAttribute and reused during serialization, taking up only 1 optional long's space after the first time its written.

This could be alleviated by caching (some?) strings in our plans; not sure that's worth the trouble, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FieldAttributes are being deduplicated but strings are not

I have a bit of a concern now that you mentioned this bit (thank you for explaining 👍 ): the test itself has a single level parent hierarchy for the fields it's creating. I am wondering if there will (or won't) be an exponential growth in serialized data size if the mapping would be many layers deep (I don't know... 4-5-6), something that would reflect better the ECS mappings.

Copy link
Contributor

@luigidellaquila luigidellaquila Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern on the advantage of field deduplication vs. fields only is reasonable.
For the record, I tested string caching and it gives us a 5-20% on these tests. I have a local branch, I can revive it and submit a PR, maybe it can give advantages also for parent names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on adding a test to simulate a precarious situation with lots of nesting.

FWIW, I don't think the cost can be exponential, because the added cost is per fieldname that we serialize; shouldn't matter how nested the field hierarchy is, for every nested.field.maybe.even.nested we send an addtional string that's bounded by the field name, e.g. nested.field.maybe.even in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added two tests. Both use a rather deep and wide mapping: 6 levels deep, each level has 9 children (per node), so it's 9^6 relevant fields.

For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.
For FROM idx | LIMIT 10 | KEEP one_field the plan size is ~20MB on both - it's absolutely dominated by the size of the serialized EsIndex, that's written as part of the EsRelation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plan size on main is ~100MB, with this PR it grows to 130MB

That's not insignificant. @nik9000 what's your gut feeling about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and plan deserialization OOMs on FROM idx | LIMIT 10 with and without this PR. @luigidellaquila, should we tackle this as part of #111358?

@alex-spies
Copy link
Contributor Author

I believe this is a useful simplification, but given that this has a chance to make plans bigger, I think we should wait with merging this - @luigidellaquila has #112929 which together with this here should make things nice and clean.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intent of this PR however since references are being used and recognized, both the runtime and serialized cost are actually small.

For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.

That's a 30% increase in memory.
Unless the situation goes in the other direction, I'm 👎 on this PR since it complicate field hierarchy navigation without any upside.

Side-note:
With the current approach, the storage is going to increase since the string of the parent flattens its hierarchy instead of delegating.
Before this PR, field a.b.c is represented as field c with parent b with parent a. With this PR the parent of c becomes b.a which causes a cache miss on all of its parents.
So field with depth D will cause cache hits on D-2 ancestors which gets multiplied by the amount of breath each level has.
The workaround is to wrap the String (which is immutable) in a reference and not flatten the hierarchy in a string, namely FieldAttribute.

@alex-spies
Copy link
Contributor Author

I understand the intent of this PR however since references are being used and recognized, both the runtime and serialized cost are actually small.

For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.

That's a 30% increase in memory. Unless the situation goes in the other direction, I'm 👎 on this PR since it complicate field hierarchy navigation without any upside.

I agree. After #112929, this PR should reduce the plan size.

Once that's the case, I think it's important that we go forward with slimming down FieldAttribute.

Currently, if we have an index with fields
parent.child0, ..., parent.childN and we run FROM idx* | KEEP parent.child0, the single field attribute for parent.child0 contains the FA for parent, which contains the corresponding EsField and thus aaall the subfields, their types, their type conflicts (in case of InvalidMappedFields) etc. which all don't matter for this FieldAttribute or the query.

Side-note: With the current approach, the storage is going to increase since the string of the parent flattens its hierarchy instead of delegating. Before this PR, field a.b.c is represented as field c with parent b with parent a. With this PR the parent of c becomes b.a which causes a cache miss on all of its parents. So field with depth D will cause cache hits on D-2 ancestors which gets multiplied by the amount of breath each level has. The workaround is to wrap the String (which is immutable) in a reference and not flatten the hierarchy in a string, namely FieldAttribute.

Once we have string caching during de-/serialization, this PR will strictly throw away unneeded data. That's because in the current state, a field a.b.c is not represented as field c with parent b with parent a - it's represented as field a.b.c with parent a.b with parent a. Instead of using the whole parent field attribute, we should use only the name.

(Additionally, I don't think that cache hits on d-2 ancestors matter, because field attributes are only really used for leaf fields. So for a field a.b.c that's used in a field attribute in our logical plans, a and a.b only matter while resolving the index mapping. (Except for exact subfields, but that's just 1 level of parent-child relationships).)

@alex-spies alex-spies merged commit caa16b4 into elastic:main Oct 17, 2024
16 checks passed
@alex-spies alex-spies deleted the remove-fieldattribute-parent branch October 17, 2024 12:40
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 112881

@alex-spies
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Oct 17, 2024
To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.

(cherry picked from commit caa16b4)

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Oct 17, 2024
alex-spies added a commit that referenced this pull request Oct 17, 2024
alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Oct 17, 2024
elasticsearchmachine pushed a commit that referenced this pull request Oct 17, 2024
…#115007)

This reverts commit 17ecb66 and
reapplies #112881 once the
previous, non-backported transport version bump is dealt with.
alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Oct 17, 2024
…lastic#115006) (elastic#115007)

This reverts commit 17ecb66 and
reapplies elastic#112881 once the
previous, non-backported transport version bump is dealt with.
elasticsearchmachine pushed a commit that referenced this pull request Oct 17, 2024
…#115007) (#115035)

This reverts commit 17ecb66 and
reapplies #112881 once the
previous, non-backported transport version bump is dealt with.
lkts pushed a commit to lkts/elasticsearch that referenced this pull request Oct 18, 2024
…lastic#115006) (elastic#115007)

This reverts commit 17ecb66 and
reapplies elastic#112881 once the
previous, non-backported transport version bump is dealt with.
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
…lastic#115006) (elastic#115007)

This reverts commit 17ecb66 and
reapplies elastic#112881 once the
previous, non-backported transport version bump is dealt with.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…lastic#115006) (elastic#115007)

This reverts commit 17ecb66 and
reapplies elastic#112881 once the
previous, non-backported transport version bump is dealt with.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants