Add new logs to recovery path #337

davidmrdavid · 2024-01-03T06:12:19Z

We've seen a few cases where partitions fail to transition from the Starting to the Started state, instead going straight to Terminated. In these cases, it appears the partition is getting stuck in the recover step.

To help with diagnosing future cases, I'm adding a log after fht.RecoverAsync is called so we can know for certain if we're returning from that call.

I've also added a Details column to the FasterAzureStorageAccessCompleted which contains more verbose information on the total range of data we're trying to access.

… FasterStorageAccessCompleted

davidmrdavid

I like this new Position column, but I worry about two things:

(1) The fact that this is in a new column may exacerbate our column corruption issues. Are you against re-using a pre-existing column like Details? / do you see a reason why Position should be its own column if it's mostly "0" for most operations?

(2) Since a sufficiently large Read operation will be split into several smaller reads, I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation. A position column sort of gives us this same data (we can assume that a sequence of reads with increasing starting positions come from the same large read operation) but it isn't as fool proof as an "operation ID" field. Do we have an ID like this (I see a candidate id parameter in ReadFromBlobAsync) and can we log it if so?

davidmrdavid · 2024-01-04T23:36:20Z

src/DurableTask.Netherite/Tracing/EtwSource.cs

+        [Event(266, Level = EventLevel.Verbose, Version = 3)]
+        public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)


Suggested change

[Event(266, Level = EventLevel.Verbose, Version = 3)]

public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)

[Event(266, Level = EventLevel.Verbose, Version = 2)]

public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)

Should this have been Version=2?

just in case you already ran this in the cloud I decided to bump the version again

Oh I didn't run it on Azure yet

davidmrdavid · 2024-01-04T23:41:51Z

src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs

+                var position = destinationAddress + offset;
                long originalStreamPosition = stream.Position;
                await this.BlobManager.PerformWithRetriesAsync(
                    BlobManager.AsynchronousStorageWriteMaxConcurrency,
                    true,
                    "PageBlobClient.UploadPagesAsync",
                    "WriteToDevice",
-                    $"id={id} length={length} destinationAddress={destinationAddress + offset}",
+                    position,
+                    $"id={id} position={position} length={length}",


Help me understand: why is position here destinationAddress +offset and not sourceAddress + offset? Perhaps I misunderstand what source and destination correspond to in this case?

the position is meant to be the offset within the page blob in storage.

Since this is a write (from an in-memory stream to the page blob), the page blob position in this case is the destination.

If "the page blob position [...] is the destination", then is it a problem that it we currently have var position = destinationAddress + offset instead of just destination?

The write is broken into smaller chunks because there is a max on how many bytes can be written in a single access (see WriteToBlobAsync which then calls WritePortionToBlobUnsafeAsync multiple times)

sebastianburckhardt · 2024-01-05T17:37:06Z

do you see a reason why Position should be its own column if it's mostly "0" for most operations

quantitatively speaking, not sure if "most" operations will have zero (there are often a lot of page blob accesses).

In ETW writing an integer is a lot more efficient than writing a string (even if the integer is zero).

I am not sure if this still matters once the data reaches Kusto. I got the impression at times that it is better to keep columns typed consistently. If its all numbers Kusto can detect that (e.g. column ElapsedMs is typed as 'real'). However, it may also be that writing an empty string is better than writing 0. Not sure really.

I don't have a strong opinion on this.

I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation.

That is possible but I am not sure it is worth it long term (we already have it in the detailed tracing).

If you really want to add this also then it may not be worth it to keep separate columns

davidmrdavid · 2024-01-05T17:56:24Z

I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation.

That is possible but I am not sure it is worth it long term (we already have it in the detailed tracing).

If you really want to add this also then it may not be worth it to keep separate columns

Yeah I think I would like to add this "operation ID" simply because most customers don't have detailed tracing enabled (for good reason!) which means we may be missing key information when diagnosing an already mitigated incident. In that case, I think we agree it may be best to merge the columns / add them to Details. Mind if I add this to the PR or do you want to add it yourself?

sebastianburckhardt · 2024-02-12T22:48:13Z

I have revised this according to PR feedback.

davidmrdavid · 2024-02-12T22:53:54Z

LGTM

add simple log

b47ba44

davidmrdavid requested a review from sebastianburckhardt January 3, 2024 06:12

add detail column to FasterAzureStorageAccessCompleted

6b91ad3

davidmrdavid changed the title ~~Add simple log after fht.RecoverAsync~~ Add new logs to recovery path Jan 4, 2024

davidmrdavid and others added 2 commits January 4, 2024 10:38

refactor read range log

1befd93

pass just the position parameter instead of printing full details for…

785fb23

… FasterStorageAccessCompleted

sebastianburckhardt approved these changes Jan 4, 2024

View reviewed changes

davidmrdavid commented Jan 4, 2024

View reviewed changes

davidmrdavid requested a review from sebastianburckhardt January 4, 2024 23:51

put position and id information into details

8be561a

sebastianburckhardt merged commit 0abc16d into dev Feb 13, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new logs to recovery path #337

Add new logs to recovery path #337

davidmrdavid commented Jan 3, 2024 •

edited

Loading

davidmrdavid left a comment

davidmrdavid Jan 4, 2024

sebastianburckhardt Jan 5, 2024

davidmrdavid Jan 5, 2024

davidmrdavid Jan 4, 2024

sebastianburckhardt Jan 5, 2024

davidmrdavid Jan 5, 2024

sebastianburckhardt Jan 16, 2024

sebastianburckhardt commented Jan 5, 2024

davidmrdavid commented Jan 5, 2024

sebastianburckhardt commented Feb 12, 2024

davidmrdavid commented Feb 12, 2024

		[Event(266, Level = EventLevel.Verbose, Version = 3)]
		public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)

Add new logs to recovery path #337

Add new logs to recovery path #337

Conversation

davidmrdavid commented Jan 3, 2024 • edited Loading

davidmrdavid left a comment

Choose a reason for hiding this comment

davidmrdavid Jan 4, 2024

Choose a reason for hiding this comment

sebastianburckhardt Jan 5, 2024

Choose a reason for hiding this comment

davidmrdavid Jan 5, 2024

Choose a reason for hiding this comment

davidmrdavid Jan 4, 2024

Choose a reason for hiding this comment

sebastianburckhardt Jan 5, 2024

Choose a reason for hiding this comment

davidmrdavid Jan 5, 2024

Choose a reason for hiding this comment

sebastianburckhardt Jan 16, 2024

Choose a reason for hiding this comment

sebastianburckhardt commented Jan 5, 2024

davidmrdavid commented Jan 5, 2024

sebastianburckhardt commented Feb 12, 2024

davidmrdavid commented Feb 12, 2024

davidmrdavid commented Jan 3, 2024 •

edited

Loading