Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate stats for the same blobstore operation with purposes #99615

Merged
merged 23 commits into from
Oct 2, 2023

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Sep 18, 2023

Today blobstore stats are collected against each HTTP operation, e.g. Get, List. This is not granular enough because the same HTTP operration can be performed for different purposes, e.g. cluster state, indices or translog. This PR adds a new Purpose enum to provide further breakdown for the same HTTP operation.

Relates: ES-6800

Comment on lines 84 to 85
@Nullable
private final Purpose purpose;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much prefer this not to be nullable, let's represent the cases that would be null as a proper enum value instead. Indeed at some point in the future I think we'll want to distinguish metadata and data operations too, and this seems like a good mechanism for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David. I replaced null with Purpose.Generic in 0a39942

@@ -29,6 +32,14 @@ public interface BlobStore extends Closeable {
*/
void deleteBlobsIgnoringIfNotExists(Iterator<String> blobNames) throws IOException;

default void deleteBlobsIgnoringIfNotExists(Iterator<String> blobNames, @Nullable Purpose purpose) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also much prefer not to have an implicit default for the purpose parameter like this. Experience shows that we will forget to set it in future work unless the caller is required to make an explicit choice.

That said, temporarily adding an implicit default is a reasonable strategy for avoiding breaking the serverless build, but it must only be temporary. Ideally we'd mark the overloads that don't take the Purpose as @Deprecated(forRemoval = true).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed especially since this interface method is pretty new. I will go through the deprecation and removal process once we are happy with the proposed approach in this PR.

@ywangd ywangd added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Sep 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@ywangd ywangd marked this pull request as ready for review September 20, 2023 02:56
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Comment on lines 32 to 49
@Deprecated(forRemoval = true)
void deleteBlobsIgnoringIfNotExists(Iterator<String> blobNames) throws IOException;

// TODO: Remove the default implementation and require each blob store to implement this method. Once it's done, remove the
// the above overload version that does not take the Purpose parameter.
/**
* Delete all the provided blobs from the blob store. Each blob could belong to a different {@code BlobContainer}
* @param blobNames the blobs to be deleted
* @param purpose the purpose of the {@code BlobContainer} associated to the blobs to be deleted. It should be set
* to {@code Purpose.GENERIC}, if the blobs are from multiple {@code BlobContainer}s.
*/
default void deleteBlobsIgnoringIfNotExists(Iterator<String> blobNames, Purpose purpose) throws IOException {
if (purpose == Purpose.GENERIC) {
deleteBlobsIgnoringIfNotExists(blobNames);
} else {
throw new UnsupportedOperationException();
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As David suggested, I will perform the cleanup of these two methods in a follow-up and avoid breaking builds on either side.

@ywangd
Copy link
Member Author

ywangd commented Sep 20, 2023

@elasticmachine update branch

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think attaching the Purpose to the path to the container will be ok for our current needs, but I worry that it won't be flexible enough for some related future work: we want to distinguish data from metadata operations when taking a snapshot, and I can see value in being able to distinguish searchable-snapshot-related read operations from regular snapshot reads too.

Have we considered making this a per-operation property rather than a property of the BlobContainer? This will involve changing more call sites here of course but no more real complexity and a great deal more flexibility in future.

@henningandersen
Copy link
Contributor

Have we considered making this a per-operation property rather than a property of the BlobContainer? This will involve changing more call sites here of course but no more real complexity and a great deal more flexibility in future.

I agree, I'd like to at least try out that approach too to see how big of a change it is.

@ywangd
Copy link
Member Author

ywangd commented Sep 20, 2023

Thanks for the feedback. I was under the impression that once a BlobContainer is created, it can be used for only a single purpose since it deals with a flat list of files and no nested sub-folders. But apparently this was an invalid assumption. We can store both data and metadata under the same blobPath, e.g. shard level metadata and data files. I will take another attempt to make the change per operation.

@ywangd
Copy link
Member Author

ywangd commented Sep 20, 2023

Assuming "per operation" means "per method of the BlobContainer" class, I think it is achievable without having to change signature of every method. We still attach purpose to BlobContainer so that it is easily referenceable in all methods without having to change method signatures. Add a new BlobContainer.withPurpose(Purpose) method that returns a new BlobContainer with the given purpose. At each call site, we call withPurpose first when necessary and subsequent operation methods can remain unchanged. So it is roughly something like the follows:

class BlobContainer {
    Purpose purpose = Purpose.SNAPSHOT;  // default

    BlobContainer withPurpose(Purpose purpose) {
        return new BlobContainer(..., purpose);
    }

    // operation method
    void writeBlob(...) {
        // the method can reference purpose as instance variable
    }

    ...
}

// Sample call site usage
class TranslogFileUploadTask {
    void doRun() {
        blobContainer.withPurpose(Purpose.TRANSLOG)  // configure the purpose per operation
            .writeBlob(...);  // as is today
    }
}

I think the above can achieve the "per operation" variety with likely less cascading code changes. What do you think?

@DaveCTurner
Copy link
Contributor

Add a new BlobContainer.withPurpose(Purpose) method that returns a new BlobContainer with the given purpose.

I'd rather not create a new BlobContainer for each method as would happen if we did that. And this approach would effectively make the purpose optional, which isn't what we want. It'd be better in the long run to just add a parameter throughout.

Comment on lines +62 to +65
@Deprecated(forRemoval = true)
default InputStream readBlob(String blobName) throws IOException {
return readBlob(OperationPurpose.SNAPSHOT, blobName);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, methods marked as deprecated in this class will be removed once the necessary changes are made in the other repo. They are here to avoid breaking builds for now.

@ywangd
Copy link
Member Author

ywangd commented Sep 26, 2023

Thanks for the reviews. I have updated the PR accordingly. The most notable update is that I removed the changes to actual stats collection so that this PR focuses on adding the new purpose parameter (as suggested here). This is a good call since I think we are not entirely aligned on the new stats structure. Deferring it gives us more time for discussion. Note that I still kept the refactoring for getMetricsCollector since it has no functional change and will be generally more useful in follow-ups.

@ywangd
Copy link
Member Author

ywangd commented Sep 28, 2023

I believe all reviews have been addressed. I'd appreciate if you could take another look. Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple more (somewhat superficial) comments.

final RequestMetricCollector multiPartUploadMetricCollector;
final RequestMetricCollector deleteMetricCollector;
final RequestMetricCollector abortPartUploadMetricCollector;
private final StatsCollectors statsCollectors = new StatsCollectors();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we revert these changes to stats collection too? That way this PR is just about adding the purpose parameter to all the APIs, which is noisy but low-risk, and follow-ups which change the behaviour like here will be easier to review and will let us pin down any later problems with git bisect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I reverted these changes as well. I believe this PR is now a pure refactor. Hence I adjusted the labels accordingly as well.


package org.elasticsearch.common.blobstore;

public enum BlobPurpose {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nitpicking about naming here but I don't like the new name BlobPurpose. We're not distinguishing blobs by purpose, it's the operations on them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I reverted it back to OperationPurpose.

@ywangd ywangd requested a review from DaveCTurner October 2, 2023 08:17
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the extra iterations Yang

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Left a few minor comments but no need for another round. Thanks for the extra iterations on this.


final BlobPath blobPath = repository.basePath().add(randomAlphaOfLength(10));
final BlobContainer blobContainer = blobStore.blobContainer(blobPath);
final OperationPurpose purpose = randomValueOtherThan(OperationPurpose.SNAPSHOT, () -> randomFrom(OperationPurpose.values()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we exclude SNAPSHOT here? I'd expect the test to also succeed for that and furthermore for the validation of SNAPSHOT to be equally valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test had two parts where the 1st part uses SNAPSHOT and the 2nd part randomizes from purposes other than SNAPSHOT. Since this PR is now a pure refactor, the test mostly lost its purpose. Now it really just shows the refactor has no impact on output. In this case, you are right that there is no need to exclude SNAPSHOT. I have updated accordingly.

@@ -243,9 +244,14 @@ public void deleteBlobsIgnoringIfNotExists(Iterator<String> blobNames) throws IO
}
}

private void deletePartition(AmazonS3Reference clientReference, List<String> partition, AtomicReference<Exception> aex) {
private void deletePartition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all other places have purpose as first arg, I'd like to keep it consistent, so can we move it to be first here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it. It was my intention to keep it consistently but apparently missed this one and the one below. Corrected now.

@@ -264,7 +270,7 @@ private void deletePartition(AmazonS3Reference clientReference, List<String> par
}
}

private static DeleteObjectsRequest bulkDelete(S3BlobStore blobStore, List<String> blobs) {
private static DeleteObjectsRequest bulkDelete(S3BlobStore blobStore, List<String> blobs, OperationPurpose purpose) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all other places have purpose as first arg, I'd like to keep it consistent, so can we move it to be first here too?

expectThrows(IOException.class, () -> container.readBlob(blobName, content.length + 1, content.length).read());
expectThrows(
IllegalArgumentException.class,
() -> container.readBlob(OperationPurpose.SNAPSHOT, blobName, -1, content.length).read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to randomize the purpose here (and in many places elsewhere). I think it can be a follow-up though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we should have follow-ups for actual usages of the purpose parameter. In this PR, all callsite simply uses the SNAPSHOT as the default.

* @param blobName
* The name of the blob whose existence is to be determined.
* @return {@code true} if a blob exists in the {@link BlobContainer} with the given name, and {@code false} otherwise.
* @param purpose The purpose of the operation, useful for stats collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are several potential uses of this, stats collection, storage classes, priority. Perhaps we can move the examples to OperationPurpose javadoc and leave the example out here (and in other methods in this class)?

Suggested change
* @param purpose The purpose of the operation, useful for stats collection.
* @param purpose The purpose of the operation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I removed it from individual method and added it (with a bit more explanation) to the Enum class. Btw, the other usages sound pretty exciting!


package org.elasticsearch.common.blobstore;

public enum OperationPurpose {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see us mentioning stats collection in every method javadoc where this is passed. I think that is only one use out of potentially several. Instead, we could supply that information here and leave the method level javadoc without the use case.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 2, 2023
@elasticsearchmachine elasticsearchmachine merged commit 5628392 into elastic:main Oct 2, 2023
@ywangd ywangd deleted the s3-more-granular-metrics branch October 2, 2023 10:37
jakelandis pushed a commit to jakelandis/elasticsearch that referenced this pull request Oct 2, 2023
…lastic#99615)

Today blobstore stats are collected against each HTTP operation, e.g.
Get, List. This is not granular enough because the same HTTP operration
can be performed for different purposes, e.g. cluster state, indices or
translog. This PR adds a new Purpose enum to provide further breakdown
for the same HTTP operation. 

Relates: ES-6800
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Oct 3, 2023
All their usages have been replaced by corresponding new versions.

Relates: elastic#99615
ywangd added a commit that referenced this pull request Oct 3, 2023
All their usages have been replaced by corresponding new versions.

Relates: #99615
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Oct 3, 2023
A new no-op OperationPurpose parameter is added in elastic#99615 to all blob
store/container operation method. This PR updates the s3 stats
collection code to actually use this parameter for finer grained stats
collection and reports.

Stats are reported per combination of operation and operation purpose. A
sample output is as the follows:

```
{
  "ListObjects": 2,
  "GetObject": 1,
  "PutObject": 2,
  "PutMultipartObject": 0,
  "AbortMultipartObject": 0,
  "DeleteObjects": 1,
  "GetObject/ClusterState": 1,
  "PutObject/ClusterState": 1,
  "DeleteObjects/Translog": 1,
  "ListObjects/Indices": 1
}
```

The changes are made with BWC in mind, i.e. existing stats reports with
default operation purpose will remain unchanged. For an example, the key
"ListObjects" is equivalent "ListObjects/Snapshot". But we omit the
default purpose in the stats key so that it is backwards compatible.

Relates: elastic#99615
Relates: ES-6800
ywangd added a commit that referenced this pull request Oct 9, 2023
A new no-op OperationPurpose parameter is added in #99615 to all blob
store/container operation method. This PR updates the s3 stats
collection code to actually use this parameter for finer grained stats
collection and reports. This differentiation between purposes are kept 
internally for now. The stats are currently aggregated over operations for 
existing stats reporting. This means responses from both 
GetRepositoriesMetering API and GetBlobStoreStats API will not be changed. 
We will have follow-ups to expose the finer stats separately.

Relates: #99615
Relates: ES-6800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >refactoring Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants