-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for BigInteger and BigDecimal #17006
Comments
One thing to note here is that our points support is for fixed-width types. In other words, the BigIntegerPoint in lucene is a little misleading, it does not in fact support "Immutable arbitrary-precision integers". Instead its a signed 128-bit integer type, more like a On the other hand, If someone wanted to add support for a 128-bit floating point type, its of course possible, but I have my doubts there if BigDecimal is even the right java api around that (BigDecimal is a very different thing than a quad-precision floating point type). I already see some confusion (e.g. "lossless storage") referenced to the issue so I think its important to disambiguate a little. Maybe names like BigInteger/BigDecimal should be avoided with these, but thats part of why the thing is in sandbox, we can change that (e.g. to LongLongPoint). |
thanks for the heads up @rmuir - i was indeed unaware of that |
I'd like to collect more information about use-cases before we start implementing this type. For instance I think the natural decision would be to use |
I agree: we did some digging the other day. One cause of confusion is many databases have a Also we have the challenge of how such numbers would behave in e.g. scripting and other places. Personally, i've only used BigInteger for cryptography-like things. You can see from its API its really geared at that. So maybe its not something we should expose? |
Sorry for my newb questions, but why wouldn't this work? Aren't stats aggregations done with floats possibly inaccurate due to floating point arithmetics? |
They can be inaccurate indeed. The point I was making above is that Lucene provides two ways to encode doc values. On the one hand, we have So knowing about the use-cases will help figure out which format to use. But then if we want to leverage all 128 bits of the values, we will have to duplicate implementations for everything that needs to add or multiply values such as stats/sum/avg aggregations. This would be an important burden in terms of maintenance so we would certainly not want to go that route without making sure that there are valid/common use-cases for it first. |
This feature would be useful for the Digital Forensics and Indecent Response (DFIR) community. There are lots of data structures we look at that have uint64 types. When we index these, if the field is considered a long and the value is out of range, information can be lost. |
I see a 64 bit unsigned integer type (versus the 64-bit signed type we have), as a separate feature actually. This can be implemented more efficiently with lucene (and made easier with java 8). Yeah, figuring out how to make a 64-bit unsigned type work efficiently in say, the scripting API might be a challenge as it stands today. Perhaps it truly must be a Number backed by BigInteger to work the best today, which would be slower. But in general, typical things such as ranges and aggregations would be as fast as the 64-bit signed type we have today, and perhaps a newer scripting api (with more type information) could make scripting faster too down the road, so it is much more compelling than larger integers (e.g. 128-bit), which will always be slower. Use cases where BigInteger is truly needed, to me that situation is less clear. I would like for us to consider the two cases (64-bit unsigned vs larger integers) as separate. |
@rmuir it's surprising to me that you have to ask for cases where |
This is not correct; the spec says:
You are correct that numerics in the JSON spec are arbitrary precision, but nothing in the spec suggests that implementations must support this and, in fact, implementations do not have to support this. The spec further says:
|
@jasontedor I was referring to ECMA-404 but regardless, my point is that the elastic documentation specifically says that You also cut your quoting of the spec short, as the entire paragraph is:
This is exactly what I'm referring to, as "the software that created it" (i.e. a client) has no reason to suspect, based on the documentation, that either of these values would lose precision. |
We are asking for use-cases because depending on the expectations, the feature could be implemented in very different ways. For instance a MySQL If the use-case requires more than 64 bits (eg. 128), then things are more complicated. We could probably support efficient sorting, but aggregations would be tricky. If arbitrary precision is needed, then there is not much we can do efficiently, at least at the moment. |
The JSON spec only spells out the representation in JSON which is used for interchange, it is completely agnostic to how such information is represented by software consuming such JSON.
The documentation spells out the numeric datatypes that are supported. |
Here is a good example. Windows uses the USN Journal to record changes made to the file system. These records are extremely important "logs" for people in the DFIR community. Version 2 records uses 64 bit unsigned integer to store reference numbers. Version 3 records uses 128-bit ordinal number for reference numbers.
I would say that this is important for the DFIR community.
I would say this is equally as important. There are many other logs that record these references, thus by maintaining their native types we can correlate logs to determine certain types of activity. |
Should we go ahead and create a new issue for 64-bit unsigned type as a feature? |
I'm also in the digital forensics world and see merit in providing a 64-bit unsigned type. If it were 128-bit with a speed impact, it wouldn't affect the way in which I process data. My use is less real time and more one time run bulk processing. The biggest factor to me would be what makes the most sense from the developer side in respect to java and OS integration. |
Spring Data JPA supports BigInteger and BigDecimal, so any code where you try also use elasticsearch with will fail:
I think a hack (that may end up being almost as efficient) is to convert my BigInteger to a string for use with elasticsearch:
So these data types should be added in my opinion. |
We also need something like this, We are unable to store C#'s Decimal.MaxValue currently. |
On use cases, I see DFIR and USN mentioned. Would either of these use cases use aggregations, or just search and sorting? If you see aggregations necessary, can you state which ones and what the use case for that is? Apologies if I am oversimplifying, but it seems like:
If search and sort is enough, and no aggregations are needed, I wonder if there is even need for a 128 bit numeric type-- could strings be enough for these use cases, even if they may have speed differences from a (theoretical) 128bit type? |
Some use-cases described on this issue do not need biginteger/bigdecimal:
In general it looks like there is more interest in big integers than decimals. In particular, some use-cases look like they could benefit from unsigned 128-bits integers because they need ordering, which There seems to be less traction for big decimals. @jeffknupp Can you clarify what operations you would like to run on these big decimal fields (exact queries? range queries? sorting? aggregations?). |
cc @elastic/es-search-aggs |
For Ethereum contracts, integers default to 256 bit so this is an issue. Lucene doesn't support that large of an integer, so it seems out of the question, but 128 bits would cover a far larger set of values for aggregation, analysis, querying, etc. |
@tyre What kind of aggregations and querying would you perform on such a field? |
@jpountz off the top of my head: sum, average, moving average, percentile, percentile rank, filter |
I don't think we will ever support these aggregations on large integers. Numeric aggregations use doubles internally, so either we support big integers and still use doubles internally but then the fact we have big integers is pointless as they could just be indexed as doubles instead. Or we try to make aggregations support wider data types but it will make them slower which is also something we want to avoid. So I don't see this happening. The only aggregation that we could support on big integers would be the range aggregation. |
Some data: Beats are interested in supporting uint64, which they typically need for OS-level counters, and they would be fine with the accuracy loss due to the fact that these numbers would be converted to doubles for aggregations. |
Do we have any update of this case? Still will not support BigInteger and BigDecimal officially? |
@insukcho No, no support for BigInteger and BigDecimal. Note that the naming may be a bit confusing due to the fact that what some datastores call bigint map to our longs. For instance both Mysql's |
We discussed this issue in FixitFriday and agreed to implement 64-bit unsigned integers. I opened #32434. Thanks all for the feedback. |
@jpountz Thanks for taking the time to keep this under the radar. Initially, my interest on this issue was not to have a custom/new datatype per se, but to have support for BigDecimal/BigInteger (the java objects) on the Elasticsearch API (TransportClient using BulkProcessor, to be specific). I had to implement a generic number normalization to bring everything to it's pure and non-scientific-notation representation to be able to send data properly to elasticsearch, because when I tried to simply proxy my ETL input to the Elasticsearch client, I'd get an error for BigDecimal/BigInteger don't have a mapped type on the translating API. To be honest, I first got that issue on a 2.4.x cluster/api, and I'm on the way to finish migrating to 6.3.x, and have not tried removing numeric normalization to see if the limitation still exists (please feel free to point me to any obscure point on the changelogs or commit that would make me happy). Although I'm sure 64-bit uint will solve most issues for people that wanted a new datatype for really long numbers, this issue of mine doesn't get attention by proxy with that implementation. Are there plans to support in any way the translation of BigDecimal/BigIntegers from the java client perspectives (even if it means an error/warning when the value would incur in precision loss)? |
I would expect this issue to be specific to the transport client, which we want to replace with a new rest client, which we call high-level rest client as opposed to the other low-level rest client which doesn't try to understand requests and responses and only works with maps of maps. With a rest client, bigintegers wouldn't be transported any differently from short, ints and longs, so I would expect things to work as long as the value that your big integers store are in the acceptable range of the mapping type, eg. -2^63 to 2^63-1 for |
Fair enough. I forgot to take into account the client progression when re-evaluating the issue. Thanks @jpountz ! |
I'm risking being out of scope by stretching this, but I think we still have an issue on 6.3.0 with I'm using the LLRC (to talk to the cluster) in conjunction with the server artifact
@jpountz Am I out of scope? Should I not expect the builders on the |
@fredgalvao This is a different bug. I would expect us to fix it when addressing #32395. |
Just pinging this conversation to point to a specific issue which requests that Elasticsearch commences support for Ethereum blockchain data types. Namely uint256 (2 ^ 256 -1) |
Lucene now has sandbox support for BigInteger (LUCENE-7043), and hopefully BigDecimal will follow soon. We should look at what needs to be done to support them in Elasticsearch.
I propose adding
big_integer
andbig_decimal
types which have to be specified explicitly - they shouldn't be a type which can be detected by dynamic mapping.Many languages don't support big int/decimal. Javascript will convert to floats or throw an exception if a number is out of range. This can be worked around by always rendering these numbers in JSON as strings. We can possibly accept known bigints/bigdecimals as numbers but there are a few places where this could be a problem:
The above could be worked around by telling Jackson to parse floats and ints as BIG* (
USE_BIG_DECIMAL_FOR_FLOATS
andUSE_BIG_INTEGER_FOR_INTS
) but this may well generate a lot of garbage for what is an infrequent use case.Alternatively, we could just say that Big* should always be passed in as strings if they are to maintain their precision.
The text was updated successfully, but these errors were encountered: