-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Streaming ingestion (pull based) #16495
Comments
Thanks @RS146BIJAY , @yupeng9 please take a look at and comment on existing RFC, I am closing this one as a duplicate, the large amount of work to have streaming ingestion has been done. |
thanks. yes, this is different from adding streaming in HTTP protocol. It aims to pull from streaming systems like Kafka, Kinesis, Pulsar, Redpanda etc |
Thanks for starting the discussion. It would be interesting to see how we decouple partitions or streams from the shards espl with the work on online shard split. |
Thanks for the RFC @yupeng9. This would help bring the pro-active back-pressure to OpenSearch ingestion.
Does source need to be a streaming system? Can it be a database storing records and based on the checkpoint maintained, opensearch ingestion workers will be able to ingest new records. I understand that it won't be as efficient as, say Kafka, but would be more generic. Also, if I compare this to Data Prepper (https://opensearch.org/docs/latest/data-prepper/), the advantage in pull based ingestion would be, it is inbuilt into server, so it would know the health and capacity of the server better. Is this correct understanding? |
Ultimately, as long as something supports the API described in the Google Doc (or at least whatever the final version of it is), we should be able to ingest. I'm excited to try implementing a Parquet (or maybe JSON file in a blob store) source that would do a one-time import, where each shard fetches a subset of the input as fast as they can. Something else I would like to support is a combined source that includes both a database and an event stream. One of the Amazon systems that I worked on previously had an upstream system of record implemented in DynamoDB. That system would receive partial updates for records and apply them to DynamoDB using optimistic locking. Once the (versioned) update succeeded, it would be sent on a Kinesis stream for live updates. The search system that I worked on would periodically rebuild the whole index by backfilling from DynamoDB while simultaneously processing live updates from Kinesis (to make sure that updates applied to records already read from DynamoDB would not be missed). Once the backfill was done, it would continue processing updates from Kinesis. From our perspective on the search side, we were just pulling documents from "somewhere" and didn't care whether they came from DynamoDB or Kinesis. A composable ingest source (that pulls from multiple ingest sources) should be pretty easy. |
Thanks @yupeng9 for the detailed RFC. I liked the alternative approach of using |
Thanks @yupeng9 for this detailed RFC. I really like the idea of using a Kafka-like queue between the producer and OpenSearch. This should take a lot of pressure off clients, so they don't need to build their own complex logic or queues when OpenSearch can't keep up with ingestion. I'm most familiar with Kafka, so I'll use that as an example, but the same ideas apply to other streaming systems too. A couple things I'm curious about: How flexible can we make the scaling between Kafka partitions and OS shards? Do we really need to stick to a 1:1 mapping, or can we let them scale independently? Both systems already have their own ways to scale, so it'd be cool if we could take advantage of that. Can we keep the queue stuff "under the hood" as much as possible from the OS cluster's perspective? It'd be great if we could keep the learning curve low for users. Also, how can we make sure the default setup works well for most use cases without needing a ton of tweaking? Just thinking about how this might work in real-world setups. Looking forward to seeing how this develops! |
Check out the linked Google Doc. There's an interface that essentially gives OpenSearch an "iterator" over the ingest source. Configuring the ingest source is part of the index configuration and is handled by the ingest source plugin (so it's exactly as complicated as whatever the ingest source requires, but is opaque to OpenSearch itself).
The main challenge is making sure that the producer and shards agree on the strategy used to route documents to stream partitions. The approach I've seen work (though it's probably not the only solution) is a consistent hash-range strategy. Essentially, with N stream partitions, you divide your hash space into N contiguous ranges (and can update that split as N changes). The producer hashes the doc ID (or custom routing value), sees which range that lands in, and writes to the appropriate partition. On the shard side, if you have M shards, you similarly split the hash space into M ranges (and can update as M changes, like if you do an online shard split). When M and N are not multiples of one another, the partition ranges and shard ranges won't line up perfectly, but shards can read documents from all partitions whose ranges overlap the shard's range. Any documents outside the shard's range are ignored. I know at least two highly-scalable production search systems that use this strategy. Anyway, we should be able to ship with a 1:1 mapping constraint to start (since that already helps from a scaling perspective). We can solve the N:M mapping case later (probably by providing configuration to communicate the document hashing strategy) and remove that constraint.
Using stream-based ingestion is very common in lots of real-world setups, including at least three different search systems that operate within Amazon. It's so much easier to scale than a system that pretends to be a database. You can also check out what Slack did with their Astra system for log search: https://www.youtube.com/watch?v=iZt-eL1GUKo |
I see this as an opportunity to lean into data prepper (or other ingestion tools) as the front door for end users. These tools allow for more sophisticated and flexible features as compared to what is possible to do in an indexing coordinator inside the cluster itself. The specifics of how that ingestion tool sends data to cluster then really is "under the hood" from the user's perspective. |
I think this other challenge in this category, and probably the harder one (that @yupeng9 and @Bukhtawar mentioned): how to keep the locality with the primary shard(s) in case relocation happens for whatever reasons (this would eliminate any network hops that other ingestion tools would exhibit). |
This is a good question. At Uber, our current Search systems in production already use pull-based ingestion, and we used 1:1 mapping between Kafka partitions and shards. Our learning is that this provides a lot of simplicity, as we can trace the data from search shard to the input Kafka partition sharing the same sharding key. So that we can build dashboards for observability, and toolings for debugaggbility. And yes, resharding is a very involved procedure for this setup. However, we realized that resharding is a very infrequent operation, and therefore we want to optimize the systems for frequent operations, and less so for infrequent ones as the latter can introduce significant complexity to the system.
Yes, that's the intention. We want to make this feature pluggable, so that a separate index engine is used to ingest from streaming sources, and thus not disruptive to the existing indexing code flow. So when pull-based ingestion is enabled, we can disable the HTTP-based index API.
|
I gave a talk at Community-over-code NA last year, and this deck shows the current Search architecture at Uber (slide 17) of how pull-based streaming is in use at Uber production env. It includes some other interesting features too, such as real-time indexing (compared to NRT in Lucene). |
That's the option described in the alternative solution. I feel a true native pull-based ingestion built into the cluster with a separate engine will achieve much better performance, as we already observed this in Uber production. We also use an ingester similar to logstash to ingest from Kafka into Elasticsearch. From our benchmark, the native pull-based ingestion not only showed much better ingestion throughput, but also reduced the significant cost by removing the compute resources of the ingesters. |
Just curious how do we store the mappings between partitions and shards, is that part of the cluster metadata? Another question is that for log scenario, in traditional way, users always write documents to an alias, data streams or a concrete index with date suffix, these targets rollover based on age, shard size or index size with an ISM policy, does |
In the initial phase, I don't plan to store these mappings, but go with the convention of 1:1 mapping between partitions and shards and have validation enforcing it. In future, we can relax this constraint For 2nd question, do you mean if the retention policy from streaming systems shall be carried over to opensearch? I think these two are decoupled, and a separate TTL within OS controls the lifecycle of the data, which is independent from the streaming source. |
@yupeng9 I'm referring to how the data gets into the event stream (Kafka) in the first place. Users will likely need support for things like filtering, enriching, transforming, etc and that would be provided by a tool like data prepper, which would then write to the event stream. How the server pulls data from that event stream (i.e. either solution you described in the doc) becomes an implementation detail for the user as the front end API is the same no matter what. Does that make sense? |
I see. I think that's a separate problem, and a nice thing about streaming system is to decouple the producers and consumers (i.e. OpenSearch). In the industry, there are various ways to produce to the streaming system. Streaming processing systems like Flink, Samza, Storm can be used for such filtering, enriching, and other kinds of preprocessing. And yes, data prepper is also one of them that can be enhanced for more native support. |
Agreed! The point I'm making is that the way you keep the queue functionality "under the hood" is offering an end-to-end solution coupled with a producer. The OpenSearch Project could offer a complete, easy-to-use solution using data prepper. Managed vendors could offer solutions that make sense in their ecosystem. Companies like Uber would have the flexibility to integrate OpenSearch into existing systems that have completely different producers. |
Yes, makes a lot of sense. I believe lots more can be built/extended upon this feature |
Data Prepper today supports OpenSearch API termination and an Apache Kafka based persistent buffer. Hence, while managed vendors could offer solutions that bakes other streaming offerings like Kinesis, Google Pub Sub, Red Panda etc, it makes sense for the OpenSearch Project to offer Data Prepper as a first class citizen for pull based indexing. With the OpenSearch API termination in Data Prepper, clients do not even need to updated and the same HTTP push request can be intercepted by Data Prepper. This can then be written to the Kafka topics in Data Prepper and the shards can use the pull based indexing to read the data from Data Prepper. The other advantage is that Data Prepper already has native connectors to Kinesis, dynamoDB, Mongo, S3 and also supports ingesting data from HTTP clients like FluentBit/FluentD and also Otel shippers. |
Is your feature request related to a problem? Please describe
Today, OpenSearch exposes an HTTP-based API for indexing, in which users invoke the endpoint to push changes. It also has a “_bulk” API to group multiple operations in a single request.
There are some shortcomings with the push-based API, especially for complex applications at scale:
Describe the solution you'd like
In general, a streaming ingestion solution could bring in multiple values and address the aforementioned challenges:
More details of the solution can be found in this document
Related component
Indexing
Describe alternatives you've considered
As an alternative, the plugin-based approach starts a streaming-ingester process as a sidecar to the OpenSearch server in the same host, which is described in this section.
Additional context
No response
The text was updated successfully, but these errors were encountered: