Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]GeoIP datasource implementation #6559

Closed
9 of 17 tasks
heemin32 opened this issue Mar 7, 2023 · 4 comments
Closed
9 of 17 tasks

[Feature]GeoIP datasource implementation #6559

heemin32 opened this issue Mar 7, 2023 · 4 comments
Labels
enhancement Enhancement or improvement to existing feature or request feature New feature or request

Comments

@heemin32
Copy link
Contributor

heemin32 commented Mar 7, 2023

Description

This document contains implementation detail on GeoIP datasource as part of #5856

Tasks

Tasks are listed here to track a progress in the implementation. One PR can cover multiple tasks if code change is small.

Create datasource

  • Create API interface
  • Read default value from a cluster configuration property
  • Read manifest file and validate input parameter
  • Store meta data in a system index
  • Schedule update GeoIP db task for new datasource

Update datasource

  • Update metadata in a system index
  • Schedule update GeoIP db task for existing datasource

Read datasource

  • Return metadata

Delete datasource

  • Return error if there is GeoIP processor using this GeoIP datasource
  • Update metadata in a system index
  • Schedule delete GeoIP db task

Update GeoIP database

  • Check if update is required
  • Download zip file and ingest data into an index without storing it in a disk
  • Delete old index
  • Schedule either next update or delete task

Delete GeoIP database

  • Delete GeoIP datasource index
  • Delete GeoIP datasource metadata.

User scenarios

Create/Update of GeoIP data source

  1. Customer make a call to OpenSearch cluster to create GeoIP data source. It takes parameters of endpoint and update interval. Default value is provided as well. Default value can be configurable using property.
  2. The data about GeoIP data source will be stored in a system index named .geoip_datasource
  3. PUT/POST API handler for data source
    1. Read manifest file.
    2. Validate parameter.
      1. Manifest file is reachable.
      2. Manifest file format is correct.
      3. Update_interval is less than valid_for value in the manifest file.
    3. Store data in a system index
    4. Scheduling update
      1. If data source name exist
        1. If there is ongoing update
          1. Does nothing
        2. If there is no ongoing update
          1. Cancel scheduled update task
          2. Reschedule update task after update_interval.
      2. If data source name does not exist
        1. Schedule update task
    5. Return OK
  4. Update task
    1. It reads a manifest file.
      1. If md5_hash value is same with previous one
        1. Only update meta data of the data source: expire_after, next_update_at, last_skipped_at.
      2. If md5_hash value is different with previous one,
        1. Download and ingest it into a new system index.
        2. Update meta data of the data source: md5_hash, expire_after, updated_at, next_update_at, last_succeeded_at, last_processing_time.
        3. Delete the old index.
        4. Schedule the next update task.

Datasource API signature

PUT /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/geolite2-city/manifest.json"
  "update_interval_in_days": 20
}
GET /_geoip/datasource/my-datasource
{
  "endpoint": "https://geoip.opensearch.org/v1/manifest/geolite2-city",
  "update_interval_in_days": 20,
  "state": "AVAILABLE",
  "expire_after": 12343434,
  "next_update": 12341244,
  "database": {
    "provider": "maxmind",
    "md5_hash": "63d0cea9d550e495fde1b81310951bd7"
    "updated_at": 123123213,
    "valid_for_in_days" : 30,
    "fields": ["latitude", "longitude", "country", "city"]
  },
  "indices": [
    ".geoip_datasource.my-datasource.123123213",
    ".geoip_datasource.my-datasource.123123212"
  ],
  "update_stats": {
    "last_succeeded_at": 123123,
    "last_processing_time_in_millis": 912999,
    "last_failed_at": 123123213123,
    "last_skipped_at": 123123213,
  }
}

GeoIP database in an index

Index
/.geoip_datasource.my-datasource.1
{
   "_cidr" : "2a12:49c5:4380::/41",
   "_data" : {
       "country_name" : "Georgia",
       "continent_name" : "Asia",
        ...
    }
}

Manifest.json

{
  "url": "https://d17zozg08cgjfy.cloudfront.net/GeoLite2-ASN-CSV_20221206.zip",
  "db_name": "GeoLite2-ASN.csv",
  "md5_hash": "safasdfaskkkesadfasdf",
  "valid_for_in_days": 30,
  "updated_at": 3134012341236,
  "provider": "maxmind"
}

Deletion of GeoIP data source

  1. Customer make a call to OpenSearch cluster to delete GeoIP data source.
  2. It check if there are any GeoIP processor using the GeoIP data source.
    1. If there are, return error.
    2. If there are not
      1. Mark the datasource as deleted.
      2. If there is ongoing update
        1. Let the update task to trigger delete task at the end
      3. If there is no ongoing update
        1. Cancel scheduled update task
        2. Schedule delete task immediately.
          1. Delete GeoIP data index
          2. Delete GeoIP data source data
DELETE /_geoip/datasource/my-datasource

Cluster manager node failure

All of the works related with GeoIP datasource will be executed in a cluster manager node. The cluster manager node maintains scheduled tasks in memory. When cluster manager node fails, it will fail over to the one of cluster eligible node. The new cluster manager node will scan all existing GeoIP datasource and schedule tasks again accordingly. It use "next_update" field in GeoIP datasource to set correct time to update GeoIP databases.

@heemin32 heemin32 added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 7, 2023
@minalsha minalsha added feature New feature or request and removed untriaged labels Mar 10, 2023
@heemin32
Copy link
Contributor Author

heemin32 commented Mar 23, 2023

For GeoIP datasource implementation, the original plan was having n+1 system index. For example, we will have one system index to store datasource metadata and one system indice for each of n datasource. During update, it will create a new system index and delete the old system index for a datasource.

Other options to reduce the operation of create/delete index are,

  1. Reusing system index during update having version field to distinguish.
  2. Using single system index to hold every datasource having name field to distinguish among them; name will have version number as suffix.
  3. Using single system index to hold both metadata and every datasource. With this, we need two field. One to distinguish between metadata and datasource and another to distinguish data among same type.

The problem that I encountered with above three options are that,

  1. Latency increase during ingestion time. For metadata, the query time is increased by 0.03ms per doc when I used single unified index compared to metadata only index. In previous test, the in node query time took 0.059 ms per doc and 0.03ms increase in query time is not negligible.
  2. Circuit breaking exception [parent]. I got exception when I query data with two or more term matching condition which are required for above three options. Cluster was single node with default memory setup running in docker. The chance of getting the exception was higher with more term matching condition.

Therefore, unless there are hidden risk of continuous creation/deletion of system index with frequency of twice a week per datasource, the original approach is better than others in terms of latency and success rate.

Another option I can think of is utilizing post filtering after getting result from an index to avoid circuit breaking exception.

@dblock, @nknize Can I get an opinion on this subject?

@dblock
Copy link
Member

dblock commented Mar 24, 2023

  • Is the latency increase only on the background job of updating geoip data. What is the total end-user impact?
  • In the version where you have the max number of indices (metadata, one per source), how many do you expect in typical production scenarios? Do you feel that number will become a problem?
  • Are there any concerns around permissions/security/tenancy in the case where you create new indices per data source?

@heemin32
Copy link
Contributor Author

heemin32 commented Mar 24, 2023

  • The latency is not for background job updating database. It is directly impacting to the ingest activity using GeoIP processor. Ingestion latency is increased from 0.069 ms per doc with index of 345.9mb and 4553264 records to 0.082 ms per doc with index of 685.4mb and 9106528 records
  • In typical production scenarios, it will be max 2 in normal state and 3 during update in an assumption that a user mostly use city GeoIP database only. Currently MaxMind provides three type of database. So, it can be 4 in normal state and 7 during update if every update is happening at the same time. The index for datasource will keep creating and deleting during update.(Twice a week at max per source)
  • I don't see any concern regarding permissions/security/tenancy as of now.

@heemin32
Copy link
Contributor Author

heemin32 commented Sep 8, 2023

Implementation is completed.

@heemin32 heemin32 closed this as completed Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants