Add trendline PPL command #3071

jduo · 2024-10-12T00:05:03Z

Description

Adds the trendline command

Related Issues

Check List

New functionality includes testing.
New functionality has been documented.
New functionality has javadoc added.
New functionality has a user manual doc added.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jduo · 2024-10-14T20:44:32Z

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields.
When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list:
{ "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays:
{ "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field:
{ "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " }
I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

jduo · 2024-10-14T21:05:26Z

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

I would expect the only field out of this schema to be the one computation in trendline ("foo"), rather than all 3 fields in the real index, but perhaps I'm mistaken here.

YANG-DB · 2024-10-17T19:31:05Z

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

jduo · 2024-10-17T23:19:24Z

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.
If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

jduo · 2024-10-23T20:26:01Z

Requesting reviews from @LantaoJin @MaxKsyunz
Thanks

YANG-DB · 2024-10-24T16:43:45Z

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense...
@penghuo @dai-chen ??

jduo · 2024-10-24T16:58:49Z

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

If the field in the input is not in the trendline computations, it shows up unaltered.

If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.

If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense... @penghuo @dai-chen ??

@YANG-DB , I used the PPL parser code from the Spark PR. The schema semantics seem to be the same AFAIK, but I haven't tried the Spark one out. Same with the handling of results when there aren't enough samples (returning NULL) @kt-eliatra ?

Signed-off-by: James Duong <[email protected]>

Also evaluate the computation type in the parser Signed-off-by: James Duong <[email protected]>

Signed-off-by: James Duong <[email protected]>

Sort by creating a LogicalSort between the input plan and LogicalTrendline Signed-off-by: James Duong <[email protected]>

Also add examples with the sort option and without an alias Signed-off-by: James Duong <[email protected]>

Signed-off-by: James Duong <[email protected]>

YANG-DB

Thanks LGTM !!
@jduo can u plz check the failed CI tasks ?

jduo · 2024-12-10T19:06:50Z

Thanks LGTM !! @jduo can u plz check the failed CI tasks ?

Looks like a merge error in the grammar (lost a semi-colon). I'll post in an update shortly.

Add back missing semi-colon Signed-off-by: James Duong <[email protected]>

core/src/main/java/org/opensearch/sql/ast/tree/Trendline.java

acarbonetto · 2024-12-12T17:26:58Z

docs/user/ppl/cmd/trendline.rst

+* sort-field: mandatory when sorting is used. The field used to sort.
+* number-of-datapoints: mandatory. number of datapoints to calculate the moving average (must be greater than zero).
+* field: mandatory. the name of the field the moving average should be calculated for.
+* alias: optional. the name of the resulting column containing the moving average.


We could add something like:

By default, the column name appends "_trendline" to the field name.

Signed-off-by: Andrew Carbonetto <[email protected]>

YANG-DB · 2024-12-12T18:19:50Z

@jduo LGTM !
@Yury-Fridlyand @acarbonetto can u plz review ?

acarbonetto · 2024-12-12T18:28:33Z

linkchecker will be fixed here: #3193 (review)

jduo force-pushed the ppl-trendline branch from 48fa561 to 78b8127 Compare October 25, 2024 13:43

jduo mentioned this pull request Oct 27, 2024

[DOC] Add trendline PPL documentation opensearch-project/documentation-website#8621

Open

4 tasks

jduo marked this pull request as ready for review October 27, 2024 16:08

jduo requested review from ps48, kavithacm, derek-ho, joshuali925, dai-chen, YANG-DB, rupal-bq, mengweieric, vamsimanohar, Swiddis, penghuo, seankao-az, MaxKsyunz, Yury-Fridlyand, anirudha, forestmvey, acarbonetto, GumpacG, ykmr1224 and LantaoJin as code owners October 27, 2024 16:08

jduo added 12 commits December 9, 2024 09:31

Fix typo drawing example table

48a93e4

Signed-off-by: James Duong <[email protected]>

Add explain integration test

4e2e2c0

Signed-off-by: James Duong <[email protected]>

Fix trendline explain IT test

54a8569

Signed-off-by: James Duong <[email protected]>

Make the alias optional

aca31bd

Also evaluate the computation type in the parser Signed-off-by: James Duong <[email protected]>

Add validation on number of data points

50d590a

Signed-off-by: James Duong <[email protected]>

Add sort functionality to trendline

07c7efb

Sort by creating a LogicalSort between the input plan and LogicalTrendline Signed-off-by: James Duong <[email protected]>

Make docs more consistent with Spark

b60093e

Also add examples with the sort option and without an alias Signed-off-by: James Duong <[email protected]>

Fix docs typo in example

25ccada

Signed-off-by: James Duong <[email protected]>

Add missed update to AstBuilderTest for sort option

c7b7cb1

Signed-off-by: James Duong <[email protected]>

Add test for checking an invalid number of samples

23917a5

Signed-off-by: James Duong <[email protected]>

Add Trendline to KeywordsCanBeId

b01dded

Signed-off-by: James Duong <[email protected]>

Fix wrong column name in docs

8da8255

Signed-off-by: James Duong <[email protected]>

jduo dismissed YANG-DB’s stale review via 8da8255 December 9, 2024 17:33

jduo force-pushed the ppl-trendline branch from d497dba to 8da8255 Compare December 9, 2024 17:33

YANG-DB previously approved these changes Dec 9, 2024

View reviewed changes

Fix rebase error in parse

9f1684f

Add back missing semi-colon Signed-off-by: James Duong <[email protected]>

jduo dismissed YANG-DB’s stale review via 9f1684f December 10, 2024 19:10

acarbonetto reviewed Dec 12, 2024

View reviewed changes

core/src/main/java/org/opensearch/sql/ast/tree/Trendline.java Outdated Show resolved Hide resolved

acarbonetto reviewed Dec 12, 2024

View reviewed changes

Merge branch 'main' into ppl-trendline

09bd6aa

YANG-DB previously approved these changes Dec 12, 2024

View reviewed changes

PPL-Trendline: remove unused grammar; clean doc

0ff1653

Signed-off-by: Andrew Carbonetto <[email protected]>

acarbonetto dismissed YANG-DB’s stale review via 0ff1653 December 12, 2024 18:19

YANG-DB approved these changes Dec 12, 2024

View reviewed changes

acarbonetto approved these changes Dec 12, 2024

View reviewed changes

acarbonetto merged commit ed0ca8d into opensearch-project:main Dec 12, 2024
13 of 15 checks passed

acarbonetto deleted the ppl-trendline branch December 12, 2024 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trendline PPL command #3071

Add trendline PPL command #3071

jduo commented Oct 12, 2024 •

edited

Loading

jduo commented Oct 14, 2024

jduo commented Oct 14, 2024

YANG-DB commented Oct 17, 2024

jduo commented Oct 17, 2024 •

edited

Loading

jduo commented Oct 23, 2024

YANG-DB commented Oct 24, 2024

jduo commented Oct 24, 2024

YANG-DB left a comment •

edited

Loading

jduo commented Dec 10, 2024

acarbonetto Dec 12, 2024

YANG-DB commented Dec 12, 2024

acarbonetto commented Dec 12, 2024

Add trendline PPL command #3071

Add trendline PPL command #3071

Conversation

jduo commented Oct 12, 2024 • edited Loading

Description

Related Issues

Check List

jduo commented Oct 14, 2024

jduo commented Oct 14, 2024

YANG-DB commented Oct 17, 2024

jduo commented Oct 17, 2024 • edited Loading

jduo commented Oct 23, 2024

YANG-DB commented Oct 24, 2024

jduo commented Oct 24, 2024

YANG-DB left a comment • edited Loading

Choose a reason for hiding this comment

jduo commented Dec 10, 2024

acarbonetto Dec 12, 2024

Choose a reason for hiding this comment

YANG-DB commented Dec 12, 2024

acarbonetto commented Dec 12, 2024

jduo commented Oct 12, 2024 •

edited

Loading

jduo commented Oct 17, 2024 •

edited

Loading

YANG-DB left a comment •

edited

Loading