Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] Adaptive shard selection for writes for auto generated ids. #4984

Open
itiyama opened this issue Oct 31, 2022 · 4 comments
Open
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing

Comments

@itiyama
Copy link

itiyama commented Oct 31, 2022

When there are multiple parallel bulk requests from a client with no document id, the coordinator does the following:

  1. Generate a new id per document in bulk.
  2. Create a per shard request out of the entire set of documents based on the hash of the document.
  3. Send out per shard bulk requests to workers

If any of the shards is slow, all the bulk requests need to wait on the coordinator. The client is also blocked till the coordinator return the response. When the request is returned to the customer, the customer reads the bulk response and then retries the remaining requests. Imagine a shard being in INITIALIZING state - all the bulk requests will wait on the coordinator and cause the entire system to be slow.

How about the coordinators always send all document in a bulk request to one shard and then round robin the requests across shards? This will reduce the amount of requests waiting in the coordinator queue as a result of a slow shard and also free up the resources on the client side.

What are the downsides of this approach? Can this result in imbalance of documents across shards? If a shard is really slow, it would be imbalanced even with a uniform splitting approach as it will not be able to complete the work on time and hence would be timed out on coordinator.

This optimization will not work for customer generated ids or for custom routing use-cases.

@itiyama itiyama added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 31, 2022
@anasalkouz
Copy link
Member

@adnapibar Could you please take a look? since this related to streaming index API

@adnapibar adnapibar added the Indexing Indexing, Bulk Indexing and anything related to indexing label Nov 17, 2022
@itiyama
Copy link
Author

itiyama commented Dec 6, 2022

With segment replication and remote storage where the failovers are slow, this optimization can be combined with an adaptive shard selection based on latencies or shard availability.

@itiyama
Copy link
Author

itiyama commented Apr 27, 2023

@nknize Your ghost writer solution will address this problem too, right?

@itiyama
Copy link
Author

itiyama commented Aug 29, 2023

Automatic routing is one way to solve this, but automatic routing makes updates/gets inefficient. The issue that our solution should solve is that id is tightly coupled with routing. Here are a few options to decouple the two, that handle updates and custom document ids.

  1. Encode shard in the document id for auto generated ids.
  2. Pre generate document and associate them with shards on the fly. This will waste some memory on the coordinating node.

This solution does not prevent us from using custom doc ids or custom routing - it is just that the optimization does not work when custom doc id or routing is used.

Adaptive shard selection for indexing - To implement this, each shard returns the number of documents indexed within each shard for auto generated id, the size per shard to coordinator and also stats on shard request queue, processing time etc. Based on this, requests using auto generated ids can adaptively select the shard the document lands into. Each coordinator will calculate a balance score based on documents with auto ids within each shard and once the score starts breaching a threshold, the adaptive shard selection will attempt to fix the balance.

We will use an opt-in to enable this feature and will be enabled by default for data streams

@itiyama itiyama changed the title [Optimization] Send all sub-bulk requests to a single shard when there are multiple concurrent bulk requests [Optimization] Adaptive shard selection for writes for auto generated ids. Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

No branches or pull requests

4 participants