-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the coerce option on numeric fields. #25861
Comments
We had a brief discussion about it in the search meeting and there was consensus that enabling truncation by default can be trappy. We discussed the following alternatives:
We should also try to gather feedback about use-cases for truncating, if any, and what users should migrate to instead (the convert processor?). |
Discussed in Fix it Friday and we agreed that we should keep the |
hi I'd like to look into this as my first PR. |
@nablyy You should look at |
@nablyy If you're not working on this, can I take this? |
@liketic If you want, take this. |
Hello everyone, |
@jpountz Hi, I am trying to understand the issue and I think I need some clarification. When you say make i.e Do you mean, the change should be so that, "5.0" --> Integer [coerce, in this context would actually coerce the value to 5], but "5.1" --> Integer [coerce, in this context means nothing, because "5.1" can't be coerced to integer without manipulation, and so should result in error] |
I think this is a frankly terrible idea. It is perfectly understandable what happens when you convert a float to an int. It truncated the float in every sensible language. Why conflate this wth questions about rounding? In what language does int(float) result in a rounded number? I find this feature eminantly useful because it allows me to define int(float) in my indexing config. Please leave it in. This is a documentation issue at worst. |
@akotlar Agreed casting may be useful. The thing I dislike is that it happens implicitly. |
@jpountz Sorry, maybe I misunderstood. Does it not only happen when coerce: true ? If so, isn't that explicit? I would agree if it did something like round implicitly: when coercing from float to int I expect truncation. |
@akotlar Could you give more details about the kind of fields that you index and want to truncate? |
Absolutely. I work with genetic data, indexing all mutations, and a lot of annotated information about those mutations, found in say 1000 people. One of those annotations comes from dbSNP. As it arrives from UCSC Genome Browser, there is a field, alleleNs, which is an integer count ostensibly, because it is the averaging of number of chromosomes (most people have 2 copies of each chromosome) by number of people. Unfortunately, on rare occasion someone will have 3 or 1 copies of a chromosome. So the average ends up being a float. This happens very rarely, and I mostly don’t care about being off by < 1 once in a great while. In general, genetic data can be really inconsistent, and so say 1/10000 entries will be badly formatted. These I want to cast to the majority type. The way I wrote the program that handles all of this stuff (and which uses ES), I have attempted to place as much of my api as possible into YAML configs, so that one program could accommodate many permutations of this messy data. This maps really well to ES, since the search layer can be largely configured from a single Json/YAML. I don’t want to write programming logic or invest functions per annotation feature, since there are on the order of a thousand possible features, and millions of possibly desirable combinations of these (and the feature space grows monthly more or less). That is a good use case for config files. So I realize that I could use an ingestion pipeline, but for simple transformations like this, I vastly prefer to just use an already existing mechanism (existant in the mapping configs ES supports) that can handle this in a perfectly understandable way. |
OK, I'm adding back the On my end I would still prefer that we do this change. I understand there could be disagreement. And I also think there might be work to do on the integration of index APIs with pipelines. For instance the mapping option has the benefit that you can never skip it, while it's easy to just forget to pass the name of the ingest pipeline in a call to the index APIs. |
Sounds good |
We discussed this in FixitFriday and agreed to still remove support for
|
My position hasn't changed, it is clear that this is a needless complication. Thanks for the ingestion pipeline tip. |
I am wondering if we still want to coerce values during queries. Currently we truncate decimal part on range queries on integer data types. I assume this is useful if we ran a query across different indexes where the same field name can have different field types. |
@mayya-sharipova Indeed we want to keep doing the right thing when querying integer field types with ranges on floating-point numbers. (We're not always truncating, we sometimes round up, sometimes down, so that querying integer fields with floating-point numbers always gives the same answer as if the field had been indexed as a floating-point number too.) |
Float values (like 7.5 or 8.5) get stored in the Integer-mapped confidence field due to coercion [1], but when querying ES (as in a range query), the value is truncated/rounded [2]. E.g., if an indicator's conf is stored as 7.5, but you later query for indicators at 7.5, that indicator won't be returned b/c ES truncates the field value before querying to comply with the Integer mapping type. Fix this problem by remapping as Float. [1]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/coerce.html [2]: elastic/elasticsearch#25861 (comment) Co-authored-by: wes <[email protected]>
Float values (like 7.5 or 8.5) get stored in the Integer-mapped confidence field due to coercion [1], but when querying ES (as in a range query), the value is truncated/rounded [2]. E.g., if an indicator's conf is stored as 7.5, but you later query for indicators at 7.5, that indicator won't be returned b/c ES truncates the field value before querying to comply with the Integer mapping type. [1]: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/coerce.html [2]: elastic/elasticsearch#25861 (comment) Co-authored-by: wes <[email protected]>
I think this is a good initiative. However I am concerned about the breaking change when a field is mapped as integer type but I think such a change should be done with at least 2 major release cycles: 1 to flip the default, so its strict by default and 2 to remove it completely. I am not fully sure if accepting |
We've discussed this issue today in the Search meeting, and decided for now to postpone implementation of this change. One reason is that we still don't have an appealing way to index numbers with floating point as integers. Ingest pipeline can be complex for some use cases. |
Is it at least possible to have an opt-in setting that allows floating point numbers (doubles, decimals, etc), to not be truncated as integers (long)? That way there's no breaking change, while those of us who are being bit by floats / doubles being truncated have a way ff. Thanks |
We have re-discussed this with the team. We are reluctant to change the behaviour of With this, I am leaning towards closing this issue. Please feel free to comment below if you think otherwise. |
I don't like the fact that the coerce option combines a useful feature: the ability to convert strings to a number (which we are supposed to do in most APIs anyway) with a dangerous feature: the truncation (it does not even round) of floating-point numbers when indexed as a byte/short/integer/long.
I think we should remove this option, accept a string in all cases and require users to use an ingest processor if they want to ignore the decimal part of their floating-point numbers. Note that eg. "1.0" would still be accepted in an integer field since its decimal part is equal to zero.
The text was updated successfully, but these errors were encountered: