-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: Remove parent from FieldAttribute #112881
ESQL: Remove parent from FieldAttribute #112881
Conversation
Hi @alex-spies, I've created a changelog YAML for you. |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think worth having Craig look too.
Does this change those tests for large numbers of conflicts? I expect the numbers on there should drop.
...gin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/FieldAttribute.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But one of the failing tests seems to indicate that we are using a bit more now?
ExchangeSinkExecSerializationTests > testManyTypeConflictsWithParent FAILED
java.lang.AssertionError:
Expected: "3.1mb" (<3271486L> bytes)
but: "3.1mb" (<3307704L> bytes)
at __randomizedtesting.SeedInfo.seed([7603FA0D74F3937A:7776D00531444912]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
at org.elasticsearch.test.ESTestCase.assertThat(ESTestCase.java:2467)
at org.elasticsearch.xpack.esql.plan.physical.ExchangeSinkExecSerializationTests.testManyTypeConflicts(ExchangeSinkExecSerializationTests.java:123)
at org.elasticsearch.xpack.esql.plan.physical.ExchangeSinkExecSerializationTests.testManyTypeConflictsWithParent(ExchangeSinkExecSerializationTests.java:83)
@@ -80,7 +80,7 @@ public void testManyTypeConflicts() throws IOException { | |||
* See {@link #testManyTypeConflicts(boolean, ByteSizeValue)} for more. | |||
*/ | |||
public void testManyTypeConflictsWithParent() throws IOException { | |||
testManyTypeConflicts(true, ByteSizeValue.ofBytes(3271486)); | |||
testManyTypeConflicts(true, ByteSizeValue.ofBytes(3307704L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun! Since FieldAttributes are being deduplicated but strings are not, this means that the serialized size of a plan can go up even though this change otherwise simplifies things.
Example: for fields parentWithALongName.x, parentWithALongName.y, parentWithALongName.z
, the parent is parentWithALongName
all the time, so before this PR, this get's stuffed into a FieldAttribute and reused during serialization, taking up only 1 optional long's space after the first time its written.
This could be alleviated by caching (some?) strings in our plans; not sure that's worth the trouble, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FieldAttributes are being deduplicated but strings are not
I have a bit of a concern now that you mentioned this bit (thank you for explaining 👍 ): the test itself has a single level parent hierarchy for the fields it's creating. I am wondering if there will (or won't) be an exponential growth in serialized data size if the mapping would be many layers deep (I don't know... 4-5-6), something that would reflect better the ECS mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern on the advantage of field deduplication vs. fields only is reasonable.
For the record, I tested string caching and it gives us a 5-20% on these tests. I have a local branch, I can revive it and submit a PR, maybe it can give advantages also for parent names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on adding a test to simulate a precarious situation with lots of nesting.
FWIW, I don't think the cost can be exponential, because the added cost is per fieldname that we serialize; shouldn't matter how nested the field hierarchy is, for every nested.field.maybe.even.nested
we send an addtional string that's bounded by the field name, e.g. nested.field.maybe.even
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added two tests. Both use a rather deep and wide mapping: 6 levels deep, each level has 9 children (per node), so it's 9^6 relevant fields.
For FROM idx | LIMIT 10
, the plan size on main is ~100MB, with this PR it grows to 130MB.
For FROM idx | LIMIT 10 | KEEP one_field
the plan size is ~20MB on both - it's absolutely dominated by the size of the serialized EsIndex
, that's written as part of the EsRelation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plan size on main is ~100MB, with this PR it grows to 130MB
That's not insignificant. @nik9000 what's your gut feeling about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and plan deserialization OOMs on FROM idx | LIMIT 10
with and without this PR. @luigidellaquila, should we tackle this as part of #111358?
.../plugin/esql/src/test/java/org/elasticsearch/xpack/esql/index/EsIndexSerializationTests.java
Outdated
Show resolved
Hide resolved
I believe this is a useful simplification, but given that this has a chance to make plans bigger, I think we should wait with merging this - @luigidellaquila has #112929 which together with this here should make things nice and clean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intent of this PR however since references are being used and recognized, both the runtime and serialized cost are actually small.
For FROM idx | LIMIT 10, the plan size on main is ~100MB, with this PR it grows to 130MB.
That's a 30% increase in memory.
Unless the situation goes in the other direction, I'm 👎 on this PR since it complicate field hierarchy navigation without any upside.
Side-note:
With the current approach, the storage is going to increase since the string of the parent flattens its hierarchy instead of delegating.
Before this PR, field a.b.c
is represented as field c
with parent b
with parent a
. With this PR the parent of c
becomes b.a
which causes a cache miss on all of its parents.
So field with depth D will cause cache hits on D-2 ancestors which gets multiplied by the amount of breath each level has.
The workaround is to wrap the String (which is immutable) in a reference and not flatten the hierarchy in a string, namely FieldAttribute.
I agree. After #112929, this PR should reduce the plan size. Once that's the case, I think it's important that we go forward with slimming down FieldAttribute. Currently, if we have an index with fields
Once we have string caching during de-/serialization, this PR will strictly throw away unneeded data. That's because in the current state, a field (Additionally, I don't think that cache hits on d-2 ancestors matter, because field attributes are only really used for leaf fields. So for a field |
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName. (cherry picked from commit caa16b4) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
This reverts commit caa16b4.
…lastic#115006) This reverts commit 17ecb66.
…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.
…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.
To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.
…astic#115006) This reverts commit caa16b4.
…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.
To avoid serializing unnecessary data, remove FieldAttribute.parent, and instead only keep a String FieldAttribute.parentName.
…astic#115006) This reverts commit caa16b4.
…lastic#115006) (elastic#115007) This reverts commit 17ecb66 and reapplies elastic#112881 once the previous, non-backported transport version bump is dealt with.
In #110793, if an index field
parent.field
had a parentparent
, the corresponding FieldAttribute started keeping track of a corresponding parent FieldAttribute as well. This was necessary only to keep track of the parent field's name.The parent FieldAttribute contains a lot of things that are not needed, especially the parent's
EsField
object, which contains a full map of any subfields. Unless deduplicated, this leads to very high spacial complexity when serialized.To avoid mistakes in the future, let's trim things down to the data we actually need to keep track of: remove
FieldAttribute.parent
, and instead only keep a StringFieldAttribute.parentName
.This also enables further improvements to the size of serialized
LogicalPlan
s: we currently send all availableEsField
s with theEsIndex
object; not sending parentFieldAttribute
s (and thus allEsField
s in a field's hierarchy) enables us to avoid sending unnecessaryEsField
s altogether.