-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856
Comments
If there are separate ingest nodes and data nodes how the search will look like as the data is not present on the node? Who will get the query request for search? Will there be a separate RFC which will provide details on index mapping, query to get IP to location details? |
It will use internal client to query the index. Therefore, the routing will be decided as same way as other index queries.
Will share implementation detail later in a separate document. |
Hi @heemin32 , if this feature requires documentation, can you please create a documentation issue for it? Thanks! |
Thanks @hdhalter. Created an issue. |
Thanks! Will you be adding it to the unified backlog project? |
|
I like the idea to use an OpenSearch index a lot more than introducing an additional store or format. Will comment on #5860. |
If we are using an OpenSearch index, maybe it makes sense to store the data as an index snapshot. |
With index snapshot, there is an issue of a chance of incompatibility between versions. Also, generation of index snapshot make a user hard to setup their own server for isolated region. |
After implementation we have done a performance test and found a high performance degradation compared to existing GeoIP processor. Following is a summary of the benchmark result.
The reason of this low performance is in three places. 1. We make search call inside ingest process for each document. This consume search thread and the requests are rejected at certain point because search thread is not released quickly enough to handle all the incoming ingestion requests. This can be a greater problem when there is a search traffic. Both ingest and search request will fight for limited search thread resource and the ingestion throughput will decrease further. 2. As we make search call for each and every document, there are some overhead on request/response serialization and deserialization. 3. The Lucene library to search IP range is not good enough compared to Maxmind library. See the benchmark result on library level below.
Therefore, we will look for another way to overcome this performance degradation. Mostly we will use the Maxmind data format and its library for now and give some room to support other format later in the future. |
What is the IP distribution in your data set? Would an in-memory size-bound cache suffice to overcome this? |
What is the size of the entire IP-Geo mapping file ? |
Depends on what format we use. For MaxMind binary format, City data is 70MB, Country data is 6MB, ASN data is 8MB as of today. |
For running the mapping, can we assume the maximum memory requirement is ~100MB worst case? If so, can we evaluate keeping an in-mem for performance? |
With cache size of 10,000 items
With cache size of 1,000 items
|
Changed a code to make a search call synchronously which makes ingestion thread to wait until search complete. After the change, the error rate(too many request) went to zero. Also, with caching, the throughput is close to legacy GeoIP processor.
|
@heemin32 Should this issue be labelled for 2.11? |
It will be released in 2.10. Closing it. |
The purpose of this RFC (request for comments) is to gather community feedbacks on a proposal to provide a way to update a GeoIP database in GeoIP processor automatically.
Problem Statement
There is a need to add location information like city name, country name, or coordinates of a given IP address during a data ingestion in an OpenSearch cluster. As IP addresses are assigned to organizations, the mapping between an IP address to a location information keeps changing by nature. Therefore, to get a better accuracy on a location information of a given IP address, the mapping data need to be updated periodically. However, the OpenSearch uses a static mapping data which does not get updated.
Current State
OpenSearch has a GeoIP processor with which a user can add location data like city name, country name, latitude/longitude, and more based on an IP address in a document. OpenSearch uses GeoLite2 databases as a mapping data from an IP address to a location information which was provided by MaxMind in 2019/11/19.
OpenSearch gets GeoLite2 Country, GeoLite2 City, and GeoLite2 ASN database file from a maven repository and include them in the build artifact. When a node starts, it prepare the list of available databases by reading the GeoLite2 databases from a local disk. Once GeoIP processor is called for the first time after the node starts, it loads an appropriate database into memory and use it. Users can put their own database in $OS_CONFIG/ingest-geoip folder either to override existing database or to add new database. However, users have to restart every nodes to reload the updated database files from a disk.
MaxMind update the GeoLite2 database twice weekly but the OpenSearch users cannot benefit from the update as there is no easy way to update the database automatically without restarting nodes in a cluster.
Proposal
We want to have a feature in the OpenSearch where the mapping data from IP address to location information is updated regularly without manual intervention so that a user can get better accuracy on a location information of an IP address during a data ingestion with minimum effort.
Approach
Data flow diagram
API design
#5860
Data format
Option1. MaxMind Format
One option is to use MaxMind data format and MaxMind SDK to read the database file as what we have today. A cluster manager node will download GeoIP database from an external endpoint and store it in an index. It will notify to all ingest nodes to download the new database file from the index. Once every ingest node is ready to use the new database file, the manager node will update a flag in an index. Each ingest node check the flag in the index to decide whether to start to use the new database or not.
Pros
Cons
Possible future improvement
Option2. OpenSearch Index(Preferred)
In this option, we will utilize an OpenSearch index. An OpenSearch cluster will download a file in CSV format from an external endpoint. After the download complete, it will put the data into an index in an OpenSearch cluster.
It will also create an index alias pointing to the newly created index.Update: We are not going to use alias but query the index directly as alias can be modified or deleted by a user unlike system index.The index will be a single shard with auto_expand_replicas value as 0-all so that querying an index can happen within a same node to achieve a fast processing time.
The GeoIP processor will query the index internally to populate location data during the ingest time.
CSV file format
Pros
Cons
Possible future improvement
Questions to community
Implementations
The text was updated successfully, but these errors were encountered: