Preserve input actions when yielding in bulk helpers #980
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR solves the issue of the
streaming_bulk
andparallel_bulk
helpers discarding original input actions while exhausting the generator. Somewhat related to #940 and probably solves that issue as well.Use case
Let's say you have a lot of items, too many for memory, that you want to index and report on. For example; a XML stream-parsing generator, or an unresolved django queryset.
To index those items with the current api, you pass that generator as
actions
argument along with anexpand_action_callback
function to construct the actual documents.Problem is that if you want to take further actions, like reporting, to the items in the generator after being index (or failed), the original input item (xml element or django model in this example) is not yielded back by the
bulk
helpers, resulting in an exhausted generator with items lost in translation, unable to report on.The
bulk
helpers therefore needs to yield each input item along with the current ES action result.Example
Additionally, this PR adds two new low level bulk helpers,
streaming_chunks
andparallel_chunks
, allowing the chunking function to be customised.Note: Current versions of
streaming_bulk
andparallel_bulk
usesmap()
to extend actions, which in python 2 exhausts the generator at once and loads all items in the generator in memory. This PR also fixes this issue as an implementation bonus.