Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ChunkedZipResponse #109820

Merged

Conversation

DaveCTurner
Copy link
Contributor

Adds a utility for implementing REST APIs which construct a streaming
(i.e. pretty-much-constant-memory) .zip file response as a (pausable)
sequence of ChunkedRestResponseBodyPart instances, where each entry in
the .zip file is itself a (pausable) sequence of
ChunkedRestResponseBodyPart instances.

Relates #104851

Adds a utility for implementing REST APIs which construct a streaming
(i.e. pretty-much-constant-memory) `.zip` file response as a (pausable)
sequence of `ChunkedRestResponseBodyPart` instances, where each entry in
the `.zip` file is itself a (pausable) sequence of
`ChunkedRestResponseBodyPart` instances.

Relates elastic#104851
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Network Http and internode communication implementations v8.15.0 labels Jun 17, 2024
@DaveCTurner DaveCTurner requested a review from ywangd June 17, 2024 16:01
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jun 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@ywangd
Copy link
Member

ywangd commented Jun 19, 2024

I had a brief look at the main code changes. They make sense to me. But I will need more time to read closely. For the context, do we already return zip response today? Maybe something like ML models? I assume this change does not have user visible impact? If we don't have such usage today, could you please explain a bit on the intended use cases? Thanks!

@DaveCTurner
Copy link
Contributor Author

Yes sorry, some more context. This was also part of my recent on-week project. Today we don't have any APIs that expose zip-format data, at least partly because until #104851 we didn't really have a way to create such a thing in a streaming fashion, and our usual approach of creating the whole response in-memory first would risk making the node go OOM. I have a number of possible use-cases for this in mind, all somewhat to do with supportability:

  • Diagnostics bundle - today we have to use a separate tool to collect diagnostics bundles, which has the advantage of decoupling the bundle contents definition from the ES version, but also several disadvantages over a built-in API:

    • ES doesn't know that all the APIs it's hitting are related to a diagnostics request so it cannot schedule the work very effectively, and that can mean that the act of collecting diagnostics is sometimes itself harmful to the cluster.
    • Some customers struggle to install the separate tool.
    • It doesn't work especially reliably, e.g. in cases of an unstable cluster or flaky network.
    • It apparently involves an awful lot of extra ceremony to run in k8s.
  • Logs - we don't index all the logs that ES produces (notably omitting TRACE logs and detailed GC logs) so it involves a lot of extra work to get a hold of them if needed. Likewise customers often struggle to share the correct logs in support cases, typically picking a few log files from one node rather than gathering all relevant logs from the whole cluster, but if we had a built-in API to gather a logs bundle then we'd cut down massively on the time we waste in these cases.

  • Snapshot repository debugging - sometimes we get asked questions about space usage in a snapshot repository, or integrity issues, and it'd be awfully useful to have a dump of all the metadata in the repository for further investigation. Today that's essentially impossible to do, but again if we had a built-in API to expose the info then it'd be easy.

These things all to be discussed separately, but this PR introduces a common prerequisite for them.

@DaveCTurner
Copy link
Contributor Author

Gentle reminder for reviews here if you have time :)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the extended delay. Honestly, this PR is a bit daunting to review. I just read it again and still trying to build a full mental model for it. I do plan to come back to it soon since otherwise I'll be forgetting about the details again. For the time being, I have left some comments and questions. Thanks!

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ywangd, quite an interesting exercise coming back to a PR like this with fresh eyes after a few months. Your questions prompted some renaming/commentary/other cleanup that I hope helps make it easier to understand.

@DaveCTurner DaveCTurner requested a review from ywangd August 4, 2024 09:24
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more comments.

Comment on lines +316 to +322
/**
* Transfer {@link #currentEntryReleasable} into the supplied collection (i.e. add it to {@code releasables} and then clear
* {@link #currentEntryReleasable}). Called when the last chunk of the last part of the current entry is serialized, so that we can
* start serializing chunks of the next entry straight away whilst delaying the release of the current entry's resources until the
* transmission of the chunk that is currently under construction.
*/
private void transferCurrentEntryReleasable(ArrayList<Releasable> releasables) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transfer of entry releasables is somewhat complext to follow for me. Conceptually, an entry releasable is released when the entry is completed or cancelled. That seems most intuitive. But we accumulate them for efficiency to prioritize ongoing bytes writing? Could it also be a concern that we are not releasing them in time? I wonder whether it might be simpler to release them once per entry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're delaying the release of the releasable until the bytes are actually sent. That's by design: the entry is still consuming some memory in the network layer until we hand those bytes off to the OS, so we shouldn't consider it as completed earlier.

@DaveCTurner DaveCTurner requested a review from ywangd August 5, 2024 08:58
nicktindall
nicktindall previously approved these changes Aug 5, 2024
Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (once @ywangd's concerns are addressed also)

@DaveCTurner DaveCTurner dismissed nicktindall’s stale review August 5, 2024 12:58

clearing the approved flag until Yang has taken a look too

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments. I don't expect to find anything of significance. But do plan to take yet another look. Thanks!

Comment on lines 379 to 382
// request aborted, nothing more to send (queue is being cleared by queueRefs#closeInternal)
isPartComplete = true;
isLastPart = true;
return new ReleasableBytesReference(BytesArray.EMPTY, () -> {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we throw AlreadyClosedException similar to what enqueueEntry does? The underlying channel should be closed at this point. So no need to be gentle here?

Copy link
Contributor Author

@DaveCTurner DaveCTurner Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, no real need for an ACE in enqueueEntry either - see 8d73a28.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change of using Releasable as parameter for newEntryListener looks nice to me. In most cases, I think the caller indeed does not care about getting notified when the resource is released other than it gets released at some point. 👍


private void finishCurrentPart(ArrayList<Releasable> releasables) throws IOException {
if (bodyPart.isLastPart()) {
zipOutputStream.closeEntry();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment to say that we don't set isLastPart = true here because we bridge the last part of this entry to the first part of the next entry so that the caller sees only continuous parts instead of having to be aware of entries. The fact that the networking code is unware of entries and they are for producing side only is something that I didn't immediately realize. In hindsight, it would be helpful to see this first to build the mental model.

Copy link
Contributor Author

@DaveCTurner DaveCTurner Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see 9d5c060.

Comment on lines +289 to +295
/**
* A cache for an empty list to be used to collect the {@code Releasable} instances to be released when the next chunk has been
* fully transmitted. It's a list because a call to {@link #encodeChunk} may yield a chunk that completes several entries, each of
* which has its own resources to release. We cache this value across chunks because most chunks won't release anything, so we can
* keep the empty list around for later to save on allocations.
*/
private ArrayList<Releasable> nextReleasablesCache = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding of the other comment, it is necessary to accumulate the currentEntryReleasable for an entry because the do loop for writeNextBytes may write more than one entry. Until these written bytes are actually sent later by the networking layer, we must retain releaseables for all the entries. The releaseables are released each time data is sent out, i.e. each time the response is paused or finished at the end. If this sounds correct, can we please somehow incorporate it into the comments, maybe here or somewhere else more suitable?

We cache this value across chunks because most chunks won't release anything, so we can keep the empty list around for later to save on allocations.

I am not sure whether this helps the understanding. Unless my above understanding is wrong, I think the need for a list instead of a single entry is best covered by the fact that a loop of writeNextBytes can span multiple entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The releaseables are released each time data is sent out, i.e. each time the response is paused or finished at the end.

No, that's not right, the releasables are released when the chunk that completes the entries has been sent, but we do not wait all the way until the end (pause or finish) of the part that contains those entries.

Not sure what else to add to the comments to help here. Although the sentence about caching may not be what you're looking for, the preceding sentence explains why it's a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my comment was imprecise. I was meant to say

The releaseables are packaged to be released each time data is sent out, i.e. each time the response is paused or finished at the end.

Essentially it means how we pass the current list of entry releaseables each time after the do/while loop.

Not sure what else to add to the comments to help here

I think it could help to add a comment right before the do/while loop to say that writeNextBytes can work through multiple entries and accumulate entry releaseables which are then released in a single batch once the processed bytes are fully sent out. Feels like a good compliment to the comment here about "a call ... completes serveral entries".

@DaveCTurner DaveCTurner requested a review from ywangd August 6, 2024 07:16
private SubscribableListener<ChunkedRestResponseBodyPart> nextAvailableChunksListener;

/**
* A resource to be released when the transmission of the current entry is complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
* A resource to be released when the transmission of the current entry is complete.
* A resource to be released when the transmission of the current entry is complete.
* Multiple of them maybe released in a single batch if their associated entries are transmitted together.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded in 5aa4c08.


@Override
public void onFailure(Exception e) {
Releasables.closeExpectNoException(releasable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might worth a logging message here since I don't think it should happen normally and it could lead to an unusable zip file if happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is fine, it's covered by the docs and by ChunkedZipResponseIT#testRandomZipResponse (see the comment about NPEs in handleZipRestRequest). It just means no entry is sent.

Comment on lines 134 to 135
* @param releasable A resource which is released when the entry has been completely processed: either fully sent, or else the request
* was cancelled and the response will not be used any further.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's also released when the entry is "skipped" (for lack of a better word). But maybe it's obvious from the code

Suggested change
* @param releasable A resource which is released when the entry has been completely processed: either fully sent, or else the request
* was cancelled and the response will not be used any further.
* @param releasable A resource which is released when the entry has been skipped or completely processed: either fully sent, or else the request
* was cancelled and the response will not be used any further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded in 5aa4c08.

Comment on lines 379 to 382
// request aborted, nothing more to send (queue is being cleared by queueRefs#closeInternal)
isPartComplete = true;
isLastPart = true;
return new ReleasableBytesReference(BytesArray.EMPTY, () -> {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change of using Releasable as parameter for newEntryListener looks nice to me. In most cases, I think the caller indeed does not care about getting notified when the resource is released other than it gets released at some point. 👍

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the iterations and the opportunity to review this work 👍

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 7, 2024
@DaveCTurner
Copy link
Contributor Author

Thanks both for the reviews!

@elasticsearchmachine elasticsearchmachine merged commit 4272164 into elastic:main Aug 7, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/06/17/ChunkedZipResponse branch August 7, 2024 07:20
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Aug 7, 2024
* upstream/main: (132 commits)
  Fix compile after several merges
  Update docs with new behavior on skip conditions (elastic#111640)
  Skip on any instance of node or version features being present (elastic#111268)
  Skip on any node capability being present (elastic#111585)
  [DOCS] Publishes Anthropic inference service docs. (elastic#111619)
  Introduce `ChunkedZipResponse` (elastic#109820)
  [Gradle] fix esql compile cacheability (elastic#111651)
  Mute org.elasticsearch.datastreams.logsdb.qa.StandardVersusLogsIndexModeChallengeRestIT testTermsQuery elastic#111666
  Mute org.elasticsearch.datastreams.logsdb.qa.StandardVersusLogsIndexModeChallengeRestIT testMatchAllQuery elastic#111664
  Mute org.elasticsearch.xpack.esql.analysis.VerifierTests testMatchCommand elastic#111661
  Mute org.elasticsearch.xpack.esql.optimizer.LocalPhysicalPlanOptimizerTests testMatchCommandWithMultipleMatches {default} elastic#111660
  Mute org.elasticsearch.xpack.esql.optimizer.LocalPhysicalPlanOptimizerTests testMatchCommand {default} elastic#111659
  Mute org.elasticsearch.xpack.esql.optimizer.LocalPhysicalPlanOptimizerTests testMatchCommandWithWhereClause {default} elastic#111658
  LogsDB qa tests - add specific matcher for source (elastic#111568)
  ESQL: Move `randomLiteral` (elastic#111647)
  [ESQL] Clean up UNSUPPORTED type blocks (elastic#111648)
  ESQL: Remove the `NESTED` DataType (elastic#111495)
  ESQL: Move more out of esql-core (elastic#111604)
  Improve MvPSeriesWeightedSum edge case and add more tests (elastic#111552)
  Add link to flood-stage watermark exception message (elastic#111315)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Aug 7, 2024
Adds a utility for implementing REST APIs which construct a streaming
(i.e. pretty-much-constant-memory) `.zip` file response as a (pausable)
sequence of `ChunkedRestResponseBodyPart` instances, where each entry in
the `.zip` file is itself a (pausable) sequence of
`ChunkedRestResponseBodyPart` instances.

Relates elastic#104851
mhl-b pushed a commit that referenced this pull request Aug 8, 2024
Adds a utility for implementing REST APIs which construct a streaming
(i.e. pretty-much-constant-memory) `.zip` file response as a (pausable)
sequence of `ChunkedRestResponseBodyPart` instances, where each entry in
the `.zip` file is itself a (pausable) sequence of
`ChunkedRestResponseBodyPart` instances.

Relates #104851
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 15, 2024
Similar to `ChunkedZipResponse` (elastic#109820) this utility allows
Elasticsearch to send an `XContent`-based response constructed out of a
sequence of `ChunkedToXContent` fragments, provided in a streaming and
asynchronous fashion.

This will enable elastic#93735 to proceed without needing to create a temporary
index to hold the intermediate results.
DaveCTurner added a commit that referenced this pull request Aug 19, 2024
Similar to `ChunkedZipResponse` (#109820) this utility allows
Elasticsearch to send an `XContent`-based response constructed out of a
sequence of `ChunkedToXContent` fragments, provided in a streaming and
asynchronous fashion.

This will enable #93735 to proceed without needing to create a temporary
index to hold the intermediate results.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Adds a utility for implementing REST APIs which construct a streaming
(i.e. pretty-much-constant-memory) `.zip` file response as a (pausable)
sequence of `ChunkedRestResponseBodyPart` instances, where each entry in
the `.zip` file is itself a (pausable) sequence of
`ChunkedRestResponseBodyPart` instances.

Relates elastic#104851
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Similar to `ChunkedZipResponse` (elastic#109820) this utility allows
Elasticsearch to send an `XContent`-based response constructed out of a
sequence of `ChunkedToXContent` fragments, provided in a streaming and
asynchronous fashion.

This will enable elastic#93735 to proceed without needing to create a temporary
index to hold the intermediate results.
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Sep 5, 2024
Similar to `ChunkedZipResponse` (elastic#109820) this utility allows
Elasticsearch to send an `XContent`-based response constructed out of a
sequence of `ChunkedToXContent` fragments, provided in a streaming and
asynchronous fashion.

This will enable elastic#93735 to proceed without needing to create a temporary
index to hold the intermediate results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Meta label for distributed team (obsolete) v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants