-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for IPv6 #17007
Comments
That is not true. This is only something that happens internally in a BKD tree. Users do not see it. You do not "get things back" from that API. Also note: if you follow the RFC correctly, then calling https://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#getByAddress%28byte[]%29 with a 128-bit mapped IPv4 address returns back a 32-bit Inet4Address: it does this round tripping for you. That is why all Query.toString()'s and so on work intuitively with the lucene support. |
While we don't have this issue with the bkd tree, I think it's something we will have to think about for doc values, eg. for computing the top ip addresses that hit a web server. |
its really not at all. just encode the same way, and use InetAddress to decode for any presentation form. |
Its this part that is super-important: calling https://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#getByAddress%28byte[]%29 with a 128-bit mapped IPv4 address returns back a 32-bit Inet4Address: it does this round tripping for you so if you want to display to the user in any way, make an InetAddress from the bytes. it does the right thing. Please don't make ipv4 vs ipv6 types or any other craziness when this all works exactly as the user expects using basic java apis! |
I agree there should not be two types and I'm not concerned about the Java API. The question was more about what we do on the rest layer eg. is it ok for aggs to render ip addresses differently from the way they were formatted at index time. I think it is, but it did not happen with the v4-only field so I think it's also important that we raise the question and document the behaviour. |
I don't think you guys get how the transition mechanism works inside the java InetAddress class. I encourage you to play with it. |
Relax, Rob. The only reason I suggested it was because I was unaware of the information that you have provided. Thanks for the link. Isn't it true though that if a user indexes an IPv4 address in IPv6 format, doc values will return the address in IPv4 format? (ie not what they indexed?). |
If someone supplies an ipv4 address (mapped or not), its an ipv4 address. thats what that reserved space in the ipv4 region is for. If someone supplies an ipv6 address, its an ipv6 address. note that ipv6 addresses can be represented in many ways. i recommend emitting canonical representations. So if someone supplies Its always "what they indexed". its the same address. Its like returning Please, lets not try to overcomplicate this. Its not complicated. |
I wasn't trying to overcomplicate it, I was trying to understand it. Thank you for explaining. |
as far as docvalues, i would just use SortedSetDocValues class for internet addresses. This is critical for ipv6 data because (like the BKD tree) it will provide prefix compression. For network addresses, "range" operations are a common operation (e.g. cidr/prefix operations), so having ordinal indirection is not useless, and can speed things up. |
And for that to work, docvalues needs a little tweak before the compression will happen: we currently always 'bail' on prefix-compression for fixed-length data: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene54/Lucene54DocValuesConsumer.java#L423-L425 that is not always the best tradeoff... |
We may even want to turn that logic off completely. basically that logic was originally added to "save space" (= save on addresses) before we had prefix compression at all... Unfortunately, we really need to solve this to avoid "my ip address field quadrupled in size!" |
This makes all numeric fields including `date`, `ip` and `token_count` use points instead of the inverted index as a lookup structure. This is expected to perform worse for exact queries, but faster for range queries. It also requires less storage. Notes about how the change works: - Numeric mappers have been split into a legacy version that is essentially the current mapper, and a new version that uses points, eg. LegacyDateFieldMapper and DateFieldMapper. - Since new and old fields have the same names, the decision about which one to use is made based on the index creation version. - If you try to force using a legacy field on a new index or a field that uses points on an old index, you will get an exception. - IP addresses now support IPv6 via Lucene's InetAddressPoint and store them in SORTED_SET doc values using the same encoding (fixed length of 16 bytes and sortable). - The internal MappedFieldType that is stored by the new mappers does not have any of the points-related properties set. Instead, it keeps setting the index options when parsing the `index` property of mappings and does `if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }` when parsing documents. Known issues that won't fix: - You can't use numeric fields in significant terms aggregations anymore since this requires document frequencies, which points do not record. - Term queries on numeric fields will now return constant scores instead of giving better scores to the rare values. Known issues that we could work around (in follow-up PRs, this one is too large already): - Range queries on `ip` addresses only work if both the lower and upper bounds are inclusive (exclusive bounds are not exposed in Lucene). We could either decide to implement it, or drop range support entirely and tell users to query subnets using the CIDR notation instead. - Since IP addresses now use a different representation for doc values, aggregations will fail when running a terms aggregation on an ip field on a list of indices that contains both pre-5.0 and 5.0 indices. - The ip range aggregation does not work on the new ip field. We need to either implement range aggs for SORTED_SET doc values or drop support for ip ranges and tell users to use filters instead. elastic#17700 Closes elastic#16751 Closes elastic#17007 Closes elastic#11513
Lucene now has sandbox support for IPv6 (LUCENE-7043).
We should look at what needs to be done to support IPv6 in Elasticsearch. We should add a new
ipv6
field which handles both IPv4 and IPv6.The only problem I can see is that users might expect to get an IPv4 address back (eg
192.168.1.5
) and instead it'd be returned as0:0:0:0:0:ffff:c0a8:105
. Perhaps we should have anipv4
field as well that handles only IPv4 addresses, or add some option to render as IPv4 when possible?The text was updated successfully, but these errors were encountered: