Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeating Timers on no longer existing durable function and instance #312

Closed
rmbeaty opened this issue Oct 7, 2023 · 19 comments
Closed
Assignees
Labels
bug Something isn't working P1 Priority 1

Comments

@rmbeaty
Copy link

rmbeaty commented Oct 7, 2023

Hi,

I've am using Netherite in Azure Functions. At one point I had a durable functions that repeated on a durable timer. However that class was deleted from the source, but now I am still seeing some type of activity for it in the Log Info, subset of these below (these seem to repeat every few seconds)....

2023-10-07T13:39:28Z [Information] Part05.21148384599712 Replayed internal UpdateEvent TimerFired eventId=05T147841722 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384599712,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384601104 Replayed internal UpdateEvent TimerFired eventId=05T147841723 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384601104,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384602496 Replayed internal UpdateEvent TimerFired eventId=05T147841724 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384602496,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384603884 Replayed internal UpdateEvent TimerFired eventId=05T147841725 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384603884,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384605276 Replayed internal UpdateEvent TimerFired eventId=05T147841726 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384605276,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384606668 Replayed internal UpdateEvent TimerFired eventId=05T147841727 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384606668,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384608060 Replayed internal UpdateEvent TimerFired eventId=05T147841728 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384608060,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384609448 Replayed internal UpdateEvent TimerFired eventId=05T147841729 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384609448,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384610840 Replayed internal UpdateEvent TimerFired eventId=05T147841730 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384610840,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384612232 Replayed internal UpdateEvent TimerFired eventId=05T147841731 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384612232,0) latency=(0, 49094, 0)

Any ideas on how I can make these stop. I tried recreating the Durable Function durableaiimagegenerator and instantiated a new entity with the same ID AIImageGenerator_0 thinking it might let the events process, but no luck.

@sebastianburckhardt
Copy link
Member

Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed.

So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name.

@rmbeaty
Copy link
Author

rmbeaty commented Oct 24, 2023 via email

@sebastianburckhardt
Copy link
Member

After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception

Azure.RequestFailedException: The page range specified is invalid.
  at Azure.Storage.Blobs.PageBlobRestClient.UploadPagesAsync(Int64 contentLength, Stream body, Byte[] transactionalContentMD5, Byte[] transactionalContentCrc64, Nullable`1 timeout, String range, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 ifSequenceNumberLessThanOrEqualTo, Nullable`1 ifSequenceNumberLessThan, Nullable`1 ifSequenceNumberEqualTo, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesInternal(Stream content, Int64 offset, UploadTransferValidationOptions transferValidationOverride, PageBlobRequestConditions conditions, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
   at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesAsync(Stream content, Int64 offset, Byte[] transactionalContentHash, PageBlobRequestConditions conditions, IProgress`1 progressHandler, CancellationToken cancellationToken)
   at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass39_1.<<WritePortionToBlobAsync>b__0>d.MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 454
--- End of stack trace from previous location ---
   at DurableTask.Netherite.Faster.BlobManager.PerformWithRetriesAsync(SemaphoreSlim semaphore, Boolean requireLease, String name, String intent, String data, String target, Int32 expectedLatencyBound, Boolean isCritical, Func`2 operationAsync, Func`1 readETagAsync) in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/StorageOperations.cs:line 68

The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state.

To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing:

{
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "StorageLogLevelLimit": "Trace"
       }
    }
  }
}

Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing.

@rmbeaty
Copy link
Author

rmbeaty commented Oct 24, 2023 via email

@sebastianburckhardt
Copy link
Member

Thanks.

Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September.

@rmbeaty
Copy link
Author

rmbeaty commented Oct 25, 2023 via email

@rmbeaty
Copy link
Author

rmbeaty commented Oct 29, 2023

I've updated host.json to log more information with "StorageLogLevelLimit": "Trace". Please have a look. I would like to retain this hub, rather that start a new one, because it has prod data in it.

@sebastianburckhardt sebastianburckhardt added bug Something isn't working P1 Priority 1 and removed Needs: Triage 🔍 P2 Priority 2 labels Oct 30, 2023
@sebastianburckhardt sebastianburckhardt self-assigned this Oct 30, 2023
@sebastianburckhardt
Copy link
Member

Thanks for the traces, they were helpful. I have found the cause.

What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB).

Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix.

Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer.

Here are some things to consider:

  1. it is usually a good idea to purge completed orchestrations after a while (e.g. see the discussion in Blob Storage Continually Growing using Netherite #229)
  2. If you have a significant amount of data you want to store persistently for a long time, I would consider writing such data to external storage suitable for long term storage.

@rmbeaty
Copy link
Author

rmbeaty commented Oct 30, 2023 via email

@sebastianburckhardt
Copy link
Member

I have implemented a fix (#329) that should unblock your partition once we can get it to you.

can you tell if there is a certain type of data taking most of the space

There is no quick way to see this type of information right now (maybe something we should consider monitoring).
Are you using entities and/or orchestrations?

The objects we are storing are probably not more than 255k or so

Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB.

can you tell if there is a certain type of data taking most of the space

there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored.

@rmbeaty
Copy link
Author

rmbeaty commented Oct 31, 2023 via email

@davidmrdavid
Copy link
Member

davidmrdavid commented Nov 2, 2023

@rmbeaty:

We just published a Netherite private package with @sebastianburckhardt's fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1

The version is 1.4.1-blob-growth-fix-1. To be able to install it, you need to add the following key to your nuget.config:

<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />

Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks!

@rmbeaty
Copy link
Author

rmbeaty commented Nov 2, 2023 via email

@davidmrdavid
Copy link
Member

@rmbeaty: the source branch can be seen in this PR: #329

I'll investigate those 1.4.x upgrade issues. I seem to recall similar reports.

@davidmrdavid
Copy link
Member

@rmbeaty: the known upgrade issues to 1.4.0 were thought to be addressed in 1.4.1. Just to confirm, when you say "I saw the same issue with trying to upgrade to 1.4 " do you mean that you saw these failures in "1.4.0" but not in "1.4.1", or do you mean that you see these "silent failures" in both versions. Note that I'm asking about the official releases, not the private package I sent you a few hours ago.

@sebastianburckhardt
Copy link
Member

Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it.

(for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329)

@rmbeaty
Copy link
Author

rmbeaty commented Nov 2, 2023 via email

@davidmrdavid
Copy link
Member

@rmbeaty: Any updates here? We should be able to help if you can provide us any more details about the kind of failure you're seeing.

Since you have an internal support case with us, you can work your support point of contact to schedule a meeting with us to help you debug, for example. I'll ask the support person on our end to see if we can set that up.

@sebastianburckhardt
Copy link
Member

Closing for inactivity. Feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Priority 1
Projects
None yet
Development

No branches or pull requests

4 participants