Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trendline PPL command #3071

Merged
merged 43 commits into from
Dec 12, 2024
Merged

Conversation

jduo
Copy link
Contributor

@jduo jduo commented Oct 12, 2024

Description

Adds the trendline command

Related Issues

Resolves #3013
#3011

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jduo
Copy link
Contributor Author

jduo commented Oct 14, 2024

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields.
When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list:
{ "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays:
{ "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field:
{ "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " }
I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

@jduo
Copy link
Contributor Author

jduo commented Oct 14, 2024

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

I would expect the only field out of this schema to be the one computation in trendline ("foo"), rather than all 3 fields in the real index, but perhaps I'm mistaken here.

@YANG-DB
Copy link
Member

YANG-DB commented Oct 17, 2024

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

@jduo
Copy link
Contributor Author

jduo commented Oct 17, 2024

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo
Copy link
Contributor Author

jduo commented Oct 23, 2024

Requesting reviews from @LantaoJin @MaxKsyunz
Thanks

@YANG-DB
Copy link
Member

YANG-DB commented Oct 24, 2024

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense...
@penghuo @dai-chen ??

@jduo
Copy link
Contributor Author

jduo commented Oct 24, 2024

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo did you manage to review the spark trendline PR ?

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }
I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }
However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.
Is it correct that ProjectOperator does not use the schema from its input?

@vamsi-amazon @penghuo can you please verify ?

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

@jduo yes I think it make sense... @penghuo @dai-chen ??

@YANG-DB , I used the PPL parser code from the Spark PR. The schema semantics seem to be the same AFAIK, but I haven't tried the Spark one out. Same with the handling of results when there aren't enough samples (returning NULL) @kt-eliatra ?

jduo added 12 commits December 9, 2024 09:31
Signed-off-by: James Duong <[email protected]>
Also evaluate the computation type in the parser

Signed-off-by: James Duong <[email protected]>
Sort by creating a LogicalSort between the input plan and LogicalTrendline

Signed-off-by: James Duong <[email protected]>
Also add examples with the sort option and without an alias

Signed-off-by: James Duong <[email protected]>
Signed-off-by: James Duong <[email protected]>
YANG-DB
YANG-DB previously approved these changes Dec 9, 2024
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM !!
@jduo can u plz check the failed CI tasks ?

@jduo
Copy link
Contributor Author

jduo commented Dec 10, 2024

Thanks LGTM !! @jduo can u plz check the failed CI tasks ?

Looks like a merge error in the grammar (lost a semi-colon). I'll post in an update shortly.

Add back missing semi-colon

Signed-off-by: James Duong <[email protected]>
* sort-field: mandatory when sorting is used. The field used to sort.
* number-of-datapoints: mandatory. number of datapoints to calculate the moving average (must be greater than zero).
* field: mandatory. the name of the field the moving average should be calculated for.
* alias: optional. the name of the resulting column containing the moving average.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add something like:

By default, the column name appends "_trendline" to the field name.

YANG-DB
YANG-DB previously approved these changes Dec 12, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Dec 12, 2024

@jduo LGTM !
@Yury-Fridlyand @acarbonetto can u plz review ?

@acarbonetto
Copy link
Collaborator

linkchecker will be fixed here: #3193 (review)

@acarbonetto acarbonetto merged commit ed0ca8d into opensearch-project:main Dec 12, 2024
13 of 15 checks passed
@acarbonetto acarbonetto deleted the ppl-trendline branch December 12, 2024 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]PPL new trendline command
6 participants