[Feature]GeoIP datasource implementation #6559

heemin32 · 2023-03-07T00:58:34Z

Description

This document contains implementation detail on GeoIP datasource as part of #5856

Tasks

Tasks are listed here to track a progress in the implementation. One PR can cover multiple tasks if code change is small.

Create datasource

Create API interface
Read default value from a cluster configuration property
Read manifest file and validate input parameter
Store meta data in a system index
Schedule update GeoIP db task for new datasource

Update datasource

Update metadata in a system index
Schedule update GeoIP db task for existing datasource

Read datasource

Return metadata

Delete datasource

Return error if there is GeoIP processor using this GeoIP datasource
Update metadata in a system index
Schedule delete GeoIP db task

Update GeoIP database

Check if update is required
Download zip file and ingest data into an index without storing it in a disk
Delete old index
Schedule either next update or delete task

Delete GeoIP database

Delete GeoIP datasource index
Delete GeoIP datasource metadata.

User scenarios

Create/Update of GeoIP data source

Customer make a call to OpenSearch cluster to create GeoIP data source. It takes parameters of endpoint and update interval. Default value is provided as well. Default value can be configurable using property.
The data about GeoIP data source will be stored in a system index named .geoip_datasource
PUT/POST API handler for data source
1. Read manifest file.
2. Validate parameter.
  1. Manifest file is reachable.
  2. Manifest file format is correct.
  3. Update_interval is less than valid_for value in the manifest file.
3. Store data in a system index
4. Scheduling update
  1. If data source name exist
    1. If there is ongoing update
      1. Does nothing
    2. If there is no ongoing update
      1. Cancel scheduled update task
      2. Reschedule update task after update_interval.
  2. If data source name does not exist
    1. Schedule update task
5. Return OK
Update task
1. It reads a manifest file.
  1. If md5_hash value is same with previous one
    1. Only update meta data of the data source: expire_after, next_update_at, last_skipped_at.
  2. If md5_hash value is different with previous one,
    1. Download and ingest it into a new system index.
    2. Update meta data of the data source: md5_hash, expire_after, updated_at, next_update_at, last_succeeded_at, last_processing_time.
    3. Delete the old index.
    4. Schedule the next update task.

Datasource API signature

PUT /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/geolite2-city/manifest.json"
  "update_interval_in_days": 20
}

GET /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/manifest/geolite2-city",
  "update_interval_in_days": 20,
  "state": "AVAILABLE",
  "expire_after": 12343434,
  "next_update": 12341244,
  "database": {
    "provider": "maxmind",
    "md5_hash": "63d0cea9d550e495fde1b81310951bd7"
    "updated_at": 123123213,
    "valid_for_in_days" : 30,
    "fields": ["latitude", "longitude", "country", "city"]
  },
  "indices": [
    ".geoip_datasource.my-datasource.123123213",
    ".geoip_datasource.my-datasource.123123212"
  ],
  "update_stats": {
    "last_succeeded_at": 123123,
    "last_processing_time_in_millis": 912999,
    "last_failed_at": 123123213123,
    "last_skipped_at": 123123213,
  }
}

GeoIP database in an index

Index
/.geoip_datasource.my-datasource.1
{
   "_cidr" : "2a12:49c5:4380::/41",
   "_data" : {
       "country_name" : "Georgia",
       "continent_name" : "Asia",
        ...
    }
}

Manifest.json

{
  "url": "https://d17zozg08cgjfy.cloudfront.net/GeoLite2-ASN-CSV_20221206.zip",
  "db_name": "GeoLite2-ASN.csv",
  "md5_hash": "safasdfaskkkesadfasdf",
  "valid_for_in_days": 30,
  "updated_at": 3134012341236,
  "provider": "maxmind"
}

Deletion of GeoIP data source

Customer make a call to OpenSearch cluster to delete GeoIP data source.
It check if there are any GeoIP processor using the GeoIP data source.
1. If there are, return error.
2. If there are not
  1. Mark the datasource as deleted.
  2. If there is ongoing update
    1. Let the update task to trigger delete task at the end
  3. If there is no ongoing update
    1. Cancel scheduled update task
    2. Schedule delete task immediately.
      1. Delete GeoIP data index
      2. Delete GeoIP data source data

DELETE /_geoip/datasource/my-datasource

Cluster manager node failure

All of the works related with GeoIP datasource will be executed in a cluster manager node. The cluster manager node maintains scheduled tasks in memory. When cluster manager node fails, it will fail over to the one of cluster eligible node. The new cluster manager node will scan all existing GeoIP datasource and schedule tasks again accordingly. It use "next_update" field in GeoIP datasource to set correct time to update GeoIP databases.

The text was updated successfully, but these errors were encountered:

heemin32 · 2023-03-23T23:30:07Z

For GeoIP datasource implementation, the original plan was having n+1 system index. For example, we will have one system index to store datasource metadata and one system indice for each of n datasource. During update, it will create a new system index and delete the old system index for a datasource.

Other options to reduce the operation of create/delete index are,

Reusing system index during update having version field to distinguish.
Using single system index to hold every datasource having name field to distinguish among them; name will have version number as suffix.
Using single system index to hold both metadata and every datasource. With this, we need two field. One to distinguish between metadata and datasource and another to distinguish data among same type.

The problem that I encountered with above three options are that,

Latency increase during ingestion time. For metadata, the query time is increased by 0.03ms per doc when I used single unified index compared to metadata only index. In previous test, the in node query time took 0.059 ms per doc and 0.03ms increase in query time is not negligible.
Circuit breaking exception [parent]. I got exception when I query data with two or more term matching condition which are required for above three options. Cluster was single node with default memory setup running in docker. The chance of getting the exception was higher with more term matching condition.

Therefore, unless there are hidden risk of continuous creation/deletion of system index with frequency of twice a week per datasource, the original approach is better than others in terms of latency and success rate.

Another option I can think of is utilizing post filtering after getting result from an index to avoid circuit breaking exception.

@dblock, @nknize Can I get an opinion on this subject?

dblock · 2023-03-24T17:51:04Z

Is the latency increase only on the background job of updating geoip data. What is the total end-user impact?
In the version where you have the max number of indices (metadata, one per source), how many do you expect in typical production scenarios? Do you feel that number will become a problem?
Are there any concerns around permissions/security/tenancy in the case where you create new indices per data source?

heemin32 · 2023-03-24T18:00:15Z

The latency is not for background job updating database. It is directly impacting to the ingest activity using GeoIP processor. Ingestion latency is increased from 0.069 ms per doc with index of 345.9mb and 4553264 records to 0.082 ms per doc with index of 685.4mb and 9106528 records
In typical production scenarios, it will be max 2 in normal state and 3 during update in an assumption that a user mostly use city GeoIP database only. Currently MaxMind provides three type of database. So, it can be 4 in normal state and 7 during update if every update is happening at the same time. The index for datasource will keep creating and deleting during update.(Twice a week at max per source)
I don't see any concern regarding permissions/security/tenancy as of now.

heemin32 · 2023-09-08T17:11:21Z

Implementation is completed.

heemin32 added enhancement untriaged labels Mar 7, 2023

heemin32 mentioned this issue Mar 7, 2023

[RFC] GeoIP database auto update to provide the latest IP to geo location mapping in GeoIP processor #5856

Closed

minalsha added feature and removed untriaged labels Mar 10, 2023

This was referenced Apr 17, 2023

Implement creation of ip2geo feature opensearch-project/geospatial#257

Merged

Implement GET API of ip2geo datasource heemin32/geospatial#1

Closed

Implement GET API of ip2geo datasource opensearch-project/geospatial#261

Closed

heemin32 closed this as completed Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]GeoIP datasource implementation #6559

[Feature]GeoIP datasource implementation #6559

heemin32 commented Mar 7, 2023 •

edited

Loading

heemin32 commented Mar 23, 2023 •

edited

Loading

dblock commented Mar 24, 2023 •

edited

Loading

heemin32 commented Mar 24, 2023 •

edited

Loading

heemin32 commented Sep 8, 2023

[Feature]GeoIP datasource implementation #6559

[Feature]GeoIP datasource implementation #6559

Comments

heemin32 commented Mar 7, 2023 • edited Loading

Description

Tasks

Create datasource

Update datasource

Read datasource

Delete datasource

Update GeoIP database

Delete GeoIP database

User scenarios

Create/Update of GeoIP data source

Deletion of GeoIP data source

Cluster manager node failure

heemin32 commented Mar 23, 2023 • edited Loading

dblock commented Mar 24, 2023 • edited Loading

heemin32 commented Mar 24, 2023 • edited Loading

heemin32 commented Sep 8, 2023

heemin32 commented Mar 7, 2023 •

edited

Loading

heemin32 commented Mar 23, 2023 •

edited

Loading

dblock commented Mar 24, 2023 •

edited

Loading

heemin32 commented Mar 24, 2023 •

edited

Loading