-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new logs to recovery path #337
Conversation
fht.RecoverAsync
… FasterStorageAccessCompleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new Position
column, but I worry about two things:
(1) The fact that this is in a new column may exacerbate our column corruption issues. Are you against re-using a pre-existing column like Details
? / do you see a reason why Position should be its own column if it's mostly "0" for most operations?
(2) Since a sufficiently large Read operation will be split into several smaller reads, I would like to emit a kind of "operation ID" to the storage access logs so that we can group smaller reads as part of a larger operation. A position column sort of gives us this same data (we can assume that a sequence of reads with increasing starting positions come from the same large read operation) but it isn't as fool proof as an "operation ID" field. Do we have an ID like this (I see a candidate id
parameter in ReadFromBlobAsync
) and can we log it if so?
[Event(266, Level = EventLevel.Verbose, Version = 3)] | ||
public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Event(266, Level = EventLevel.Verbose, Version = 3)] | |
public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion) | |
[Event(266, Level = EventLevel.Verbose, Version = 2)] | |
public void FasterAzureStorageAccessCompleted(string Account, string TaskHub, int PartitionId, string Intent, long Position, long Size, string Operation, string Target, double Latency, int Attempt, string AppName, string ExtensionVersion) |
Should this have been Version=2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just in case you already ran this in the cloud I decided to bump the version again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't run it on Azure yet
var position = destinationAddress + offset; | ||
long originalStreamPosition = stream.Position; | ||
await this.BlobManager.PerformWithRetriesAsync( | ||
BlobManager.AsynchronousStorageWriteMaxConcurrency, | ||
true, | ||
"PageBlobClient.UploadPagesAsync", | ||
"WriteToDevice", | ||
$"id={id} length={length} destinationAddress={destinationAddress + offset}", | ||
position, | ||
$"id={id} position={position} length={length}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help me understand: why is position here destinationAddress +offset
and not sourceAddress + offset
? Perhaps I misunderstand what source
and destination
correspond to in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the position is meant to be the offset within the page blob in storage.
Since this is a write (from an in-memory stream to the page blob), the page blob position in this case is the destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "the page blob position [...] is the destination", then is it a problem that it we currently have var position = destinationAddress + offset
instead of just destination
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write is broken into smaller chunks because there is a max on how many bytes can be written in a single access (see WriteToBlobAsync
which then calls WritePortionToBlobUnsafeAsync
multiple times)
quantitatively speaking, not sure if "most" operations will have zero (there are often a lot of page blob accesses). In ETW writing an integer is a lot more efficient than writing a string (even if the integer is zero). I am not sure if this still matters once the data reaches Kusto. I got the impression at times that it is better to keep columns typed consistently. If its all numbers Kusto can detect that (e.g. column ElapsedMs is typed as 'real'). However, it may also be that writing an empty string is better than writing 0. Not sure really. I don't have a strong opinion on this.
That is possible but I am not sure it is worth it long term (we already have it in the detailed tracing). If you really want to add this also then it may not be worth it to keep separate columns |
Yeah I think I would like to add this "operation ID" simply because most customers don't have detailed tracing enabled (for good reason!) which means we may be missing key information when diagnosing an already mitigated incident. In that case, I think we agree it may be best to merge the columns / add them to |
I have revised this according to PR feedback. |
LGTM |
We've seen a few cases where partitions fail to transition from the
Starting
to theStarted
state, instead going straight toTerminated
. In these cases, it appears the partition is getting stuck in the recover step.To help with diagnosing future cases, I'm adding a log after
fht.RecoverAsync
is called so we can know for certain if we're returning from that call.I've also added a
Details
column to theFasterAzureStorageAccessCompleted
which contains more verbose information on the total range of data we're trying to access.