New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support automatic resizing for variable-size page blobs #329

Open

sebastianburckhardt wants to merge 3 commits into main from pr/fix-pageblob-size

Member

sebastianburckhardt commented Oct 31, 2023

Our current implementation was allocating a default of 512GB for variable-sized page blobs. This was considered sufficient at the time since these blobs are by design not growing forever (they are used for checkpoints and for object logs, both of which have a limited life time).

However, it turns out that it is not always enough (#312).

In this PR, I revised the mechanism so that

when a variable-sized page blob runs out of space, I am enlarging it to double its size.
when a read attempts to read past the end of the blob, I cap the read so it stays within the size bounds of the page blob.

I have tested this manually by reducing the starting size to something very small, and running a sample workload on it.


          add automatic resizing for variable-size page blobs.

160be4f

sebastianburckhardt mentioned this pull request

Repeating Timers on no longer existing durable function and instance #312

Closed

davidmrdavid reviewed

View reviewed changes

Member

davidmrdavid left a comment

Thanks for the quick solution here. I have various nits and questions as I think this is a non-trivial fix. I think adding a few more comments here will help with readability and future maintenance

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs Outdated

Comment on lines 487 to 497

+                                          catch (Azure.RequestFailedException e) when (e.ErrorCode == "InvalidPageRange")
+                                          {
+                                              // this kind of error can indicate that the page blob is too small.
+                                              // first, compute desired size to request
+                                              long currentSize = (await client.GetPropertiesAsync().ConfigureAwait(false)).Value.ContentLength;
+                                              long sizeToRequest = currentSize;
+                                              long sizeToAccommodate = destinationAddress + offset + length + 1;
+                                              while (sizeToAccommodate > sizeToRequest)
+                                              {
+                                                  sizeToRequest <<= 1;
+                                              }

Member

davidmrdavid Oct 31, 2023

can we log the current page size anywhere? Or at least that we're increasing the blog size? I can imagine this is a good signal of a potentially unhealthy app: if we have to continuously increase the size, something may be wrong.

Member Author

sebastianburckhardt Nov 1, 2023

The page size increase does already get logged as a warning, because the ForceRetryException that we throw.

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs Outdated

Comment on lines 428 to 431

		var size = this.segmentSize == -1 ? AzureStorageDevice.MAX_PAGEBLOB_SIZE : this.segmentSize;
		var size = this.segmentSize == -1 ? AzureStorageDevice.STARTING_DEFAULT_PAGEBLOB_SIZE : this.segmentSize;

Member

davidmrdavid Oct 31, 2023

Would it make sense to also have a configurable max blob size? Sometimes it's good to fail fast if state is getting too large and to ask the user to deliberately opt into increasing some config.

Member Author

sebastianburckhardt Nov 1, 2023

Setting a max size for page blobs in settings makes sense to me. Will add that.

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs Outdated

Comment on lines 619 to 620

		// this type of error can be caused by the page blob being smaller than where FASTER is reading from.
		// to handle this, first determine current page blob size.

Member

davidmrdavid Oct 31, 2023

not sure I fully understand this comment. What does it mean to be smaller than where FASTER is reading from? Can we add more detail?

Member Author

sebastianburckhardt Nov 1, 2023

FASTER may be reading a range that extends past the end of the blob.

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs

                       {
                           using (stream)
                           {
+                              // we use this to prevent reading past the end of the page blob

Member

davidmrdavid Oct 31, 2023

I don't immediately understand why enlarging the blobs can suddenly cause reads to go past the end of the page blob. Can you please call out the connection?

Member Author

sebastianburckhardt Nov 1, 2023

Because these blobs don't actually have a fixed size from the perspective of FASTER, the FASTER reads may easily read past the end of the blob. This is not new - we just never saw it because 512GB was always way larger than needed.

Member

davidmrdavid Nov 1, 2023

I see. I think it would be good to write that in the inlined comment being referenced here. This also answers my question here: #329 (comment)

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs

                                           {
-                                              throw new InvalidDataException($"wrong amount of data received from page blob, expected={length}, actual={stream.Position}");
+                                              var client = (numAttempts > 1 || requestedLength == MAX_DOWNLOAD_SIZE) ? blob.Default : blob.Aggressive;

Member

davidmrdavid Oct 31, 2023

can you comment on what this controls? What is blob.Aggressive?

Member Author

sebastianburckhardt Nov 1, 2023

they control the length of the timeout (with newer storage SDK the timeout can no longer be passed as a parameter to the operation, since it is a constructor parameter for the client, so we need multiple clients to control the timeout length).

BTW this was not modified in this PR, the diff just does not show this well.

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs Outdated

Comment on lines 591 to 592

		// we have observed that we may get 206 responses where the actual length is less than the requested length
		// this seems to only happen if blob was enlarged in the past

Member

davidmrdavid Oct 31, 2023

So what are we doing if we get a 206? I actually don't see us checking the response status explicitly, should we?
nit: let's also add in parenthesis the meaning of 206: "partial response", right?

Member Author

sebastianburckhardt Nov 1, 2023

The SDK internally takes care of interpreting the status response when I access the streaming response (response.Value). From my perspective, what I see in this case is that the stream returned has fewer bytes than requested.

BTW, I have seen a few 206 during testing early on but later no longer saw them. I figured it makes sense to handle them just in case they pop up. It is pretty easy to handle them by just adjusting the next portion to start where the 206 ended.

sebastianburckhardt added 2 commits

November 1, 2023 13:47


          make starting page blob size and maximum page blob size configurable …

177560f

…parameters


          address PR feedback (add comments)

2153deb

Base automatically changed from dev to main

March 15, 2024 20:21

sebastianburckhardt added P2 bug labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug P2