Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strict Type Checking on Import (or strict type usage for scripts) #30211

Closed
OlivierPetri opened this issue Apr 27, 2018 · 7 comments
Closed

Strict Type Checking on Import (or strict type usage for scripts) #30211

OlivierPetri opened this issue Apr 27, 2018 · 7 comments
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >feature

Comments

@OlivierPetri
Copy link

Describe the feature:

Currently ES will check the format of input data against the field type, but it does not stick to the specified type when supplying field content to scripts (see example below).

This leads to confusing results and hard to track bugs in client applications.

Desired feature:
Implement different level of type checking:
STRICT : Do not accept "123" for numeric input, only 123.
BEST_EFFORT: Convert "123" to 123 (desired default behaviour).
LEGACY: Current behaviour.

Alternatively controle (and cast!) document fields when handed to scripts (see below).

Example for current behaviour.

Given the following mapping:

PUT sumtest
{
	"mappings" : {
	"doc" : {
	"properties" : {
		"NRA" : { "type" : "integer" },
		"NRB" : { "type" : "integer" }
	}
	}
	}
}

and the following input

PUT sumtest/doc/1
{
"NRA" : 12,
"NRB" : 34
}

PUT sumtest/doc/2
{
"NRA" : "12",
"NRB" : "34"
}

The script expression
doc['NRA'] + doc['NRB']
will return 46 for item 1 and "1234" for item 2. The second result ist quite surprising for unsuspecting users.

The expression
ctx._source.NRA + ctx._source.NRB
yields the same result.

@talevy
Copy link
Contributor

talevy commented Apr 27, 2018

@OlivierPetri, which version of Elasticsearch are you running? Knowing the script context
you are running against would also be helpful.

In Elasticsearch 6.2.2, I see the expected behavior. Assuming this is for search-response-time scripting, I've re-created your example with what I see. Let me know if this does not align with your use-case.

PUT sumtest
{
	"mappings" : {
  	"doc" : {
  	"properties" : {
  		"NRA" : { "type" : "integer" },
  		"NRB" : { "type" : "integer" }
  	}
  	}
	}
}

PUT sumtest/doc/1
{
"NRA" : 12,
"NRB" : 34
}

PUT sumtest/doc/2
{
"NRA" : "12",
"NRB" : "34"
}


GET sumtest/_search
{
    "query" : {
        "match_all": {}
    },
    "script_fields": {
      "sum": {
        "script": """doc["NRA"].value + doc["NRB"].value"""
      }
    }
}

returns:

...
    "hits": [
      {
        "_index": "sumtest",
        "_type": "doc",
        "_id": "2",
        "_score": 1,
        "fields": {
          "sum": [
            46
          ]
        }
      },
      {
        "_index": "sumtest",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "fields": {
          "sum": [
            46
          ]
        }
      }
    ]
...

Two things to note about the above example that differs from the two other expressions that were shared.

  1. the doc object holds the value of the fields that are stored in the doc-values for the index. This means that these values are post-coercion and mapping evaluation. doc['NRA'] references a wrapper object around the actual value, so doc['NRA'].value is necessary to retrieve the long-value to add together.
  2. ctx._source.NRA retrieves the original value in the source that was indexed. This is pre-coercion, meaning that this will stay a string for doc 2 and a long for doc 1. You can find some more explanation about this concept at the bottom of the documentation on this page: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-script-fields.html

@talevy talevy added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Apr 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@talevy talevy self-assigned this Apr 27, 2018
@talevy
Copy link
Contributor

talevy commented Apr 27, 2018

Regarding your feature request for stricter validation, I forgot to mention that we do have this for numeric mapping types.

Please check out https://www.elastic.co/guide/en/elasticsearch/reference/6.2/coerce.html for examples

@OlivierPetri
Copy link
Author

OlivierPetri commented May 2, 2018

Hi there, and thanks for your kind reply.

I confirm that
doc['NRA'] + doc['NRB']
works and
ctx._source.NRA + ctx._source.NRB
does not.

However the doc notation cannot be used in update or update by query statements as stated here and here. This means, that you have to do it manualy
Integer.parseInt(ctx._source.CNTA) + Integer.parseInt(ctx._source.CNTB)
as suggested here, and I do not feel this is a viable solution in the long run.

I will rerun my tests and come back here, but from what I observed, coercion did not work by default.

I'll be back later today.

@talevy
Copy link
Contributor

talevy commented May 2, 2018

@OlivierPetri ah, I see. Since this is an update-script, it does not have access doc-values. I believe that the real issue here is that errors like this should come up at compilation-time, rather than runtime. I believe the explicit integer parsing is the only option at the moment

@OlivierPetri
Copy link
Author

Allright and hello again. So from what I found, "coerce = false" just enforces what I named "STRICT" in my proposal, so an integer field has to be inserted as a json integer, and "coerce = true" does what I dubbed "LEGACY": It accepts a quoted number as an integer, but it will not cast it to an integer (which is what I'd understand as "coercion"), but retain the string value.

So with "coerce=true" and the input I gave above, I'll end up with:

            {
                "_index": "sumtest",
                "_type": "doc",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "NRA": 12,
                    "NRB": 34,
                }
            },
            {
                "_index": "sumtest",
                "_type": "doc",
                "_id": "2",
                "_score": 1,
                "_source": {
                    "NRA": "12",
                    "NRB": "34",
                }
            }

With this,
ctx._source.NRA + ctx._source.NRB
will work for item #1 but not for #2. Or rather, #1 will yield 46 and #2 will yield "1234" in an update query.

And sorry for describing this behaviour for
doc['NRA'] + doc['NRB']
witch is wrong, as doc[] seemingly does coerce on the fly. But since it is not available in update queries, it's not a general solution to use the doc notation. I'm afraid I caused some confustion mentioning doc at all, and I appologize for posting on Friday afternoon...

Hey, thank you all for the excellent work and your lightening fast responsiveness!

@talevy
Copy link
Contributor

talevy commented May 2, 2018

you're welcome @OlivierPetri, thanks for bearing with the lack of compile-time checking around this! This will be fixed over time, as was mentioned in #30210 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >feature
Projects
None yet
Development

No branches or pull requests

4 participants