Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new logs to recovery path #337

Merged
merged 5 commits into from
Feb 13, 2024
Merged

Conversation

davidmrdavid
Copy link
Member

@davidmrdavid davidmrdavid commented Jan 3, 2024

We've seen a few cases where partitions fail to transition from the Starting to the Started state, instead going straight to Terminated. In these cases, it appears the partition is getting stuck in the recover step.

To help with diagnosing future cases, I'm adding a log after fht.RecoverAsync is called so we can know for certain if we're returning from that call.

I've also added a Details column to the FasterAzureStorageAccessCompleted which contains more verbose information on the total range of data we're trying to access.

@davidmrdavid davidmrdavid changed the title Add simple log after fht.RecoverAsync Add new logs to recovery path Jan 4, 2024
Copy link
Member Author

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new Position column, but I worry about two things:

(1) The fact that this is in a new column may exacerbate our column corruption issues. Are you against re-using a pre-existing column like Details? / do you see a reason why Position should be its own column if it's mostly "0" for most operations?

(2) Since a sufficiently large Read operation will be split into several smaller reads, I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation. A position column sort of gives us this same data (we can assume that a sequence of reads with increasing starting positions come from the same large read operation) but it isn't as fool proof as an "operation ID" field. Do we have an ID like this (I see a candidate id parameter in ReadFromBlobAsync) and can we log it if so?

Comment on lines 381 to 382
[Event(266, Level = EventLevel.Verbose, Version = 3)]
public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Event(266, Level = EventLevel.Verbose, Version = 3)]
public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)
[Event(266, Level = EventLevel.Verbose, Version = 2)]
public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion)

Should this have been Version=2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case you already ran this in the cloud I decided to bump the version again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't run it on Azure yet

Comment on lines 454 to 462
var position = destinationAddress + offset;
long originalStreamPosition = stream.Position;
await this.BlobManager.PerformWithRetriesAsync(
BlobManager.AsynchronousStorageWriteMaxConcurrency,
true,
"PageBlobClient.UploadPagesAsync",
"WriteToDevice",
$"id={id} length={length} destinationAddress={destinationAddress + offset}",
position,
$"id={id} position={position} length={length}",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand: why is position here destinationAddress +offset and not sourceAddress + offset? Perhaps I misunderstand what source and destination correspond to in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the position is meant to be the offset within the page blob in storage.

Since this is a write (from an in-memory stream to the page blob), the page blob position in this case is the destination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "the page blob position [...] is the destination", then is it a problem that it we currently have var position = destinationAddress + offset instead of just destination?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write is broken into smaller chunks because there is a max on how many bytes can be written in a single access (see WriteToBlobAsync which then calls WritePortionToBlobUnsafeAsync multiple times)

@sebastianburckhardt
Copy link
Member

do you see a reason why Position should be its own column if it's mostly "0" for most operations

quantitatively speaking, not sure if "most" operations will have zero (there are often a lot of page blob accesses).

In ETW writing an integer is a lot more efficient than writing a string (even if the integer is zero).

I am not sure if this still matters once the data reaches Kusto. I got the impression at times that it is better to keep columns typed consistently. If its all numbers Kusto can detect that (e.g. column ElapsedMs is typed as 'real'). However, it may also be that writing an empty string is better than writing 0. Not sure really.

I don't have a strong opinion on this.

I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation.

That is possible but I am not sure it is worth it long term (we already have it in the detailed tracing).

If you really want to add this also then it may not be worth it to keep separate columns

@davidmrdavid
Copy link
Member Author

I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation.

That is possible but I am not sure it is worth it long term (we already have it in the detailed tracing).

If you really want to add this also then it may not be worth it to keep separate columns

Yeah I think I would like to add this "operation ID" simply because most customers don't have detailed tracing enabled (for good reason!) which means we may be missing key information when diagnosing an already mitigated incident. In that case, I think we agree it may be best to merge the columns / add them to Details. Mind if I add this to the PR or do you want to add it yourself?

@sebastianburckhardt
Copy link
Member

I have revised this according to PR feedback.

@davidmrdavid
Copy link
Member Author

LGTM

@sebastianburckhardt sebastianburckhardt merged commit 0abc16d into dev Feb 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants