-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user to specify MaxHistoryEvents and constraint history loading to include the latest generation only #1119
base: main
Are you sure you want to change the base?
Conversation
How do we expect the user to choose a good An alternative idea to consider: measure time required to read the history, and let the user configure the maximum time. |
@AnatoliB - I also considered measuring time instead, but I think that's a much more unstable statistic - the time to read from Azure Storage can vary greatly depending on network availability, IO resource constraints, throttling from Azure Storage, etc. So I think that, as a signal, it can vary greatly between measurements (one load of some history might be slow, but it might be fast again a bit later). So I think measuring the history size might provide more predictable behavior for users, and for ourselves. It also correspond more closely to the terminology we use in our best practices doc - to keep history sizes small, not "fast". As for guidance - I think that'll be a matter of documentation. For instance - the IntelliSense could suggest a default (in my experience - anything in the low thousands is fine, ideally below 1k, but apps are generally healthy enough up to ~10k events) I'd be happy to add that here, and in the DF-level PR (which will need to re-export this feature) |
return tableEntitiesResponseInfo; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove this extra whitespace before merging. Did not notice it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the downside of doing the extra sentinel load only occurs when we don't know the execution ID, correct? If that's the case, it seems like it will only happen in a smaller number of conditions, like when the new events don't contain execution IDs, like if the new events are only event raised events, etc. If that's true, then I'm not really worried about the extra load. This new logic wouldn't impact the most common scenarios.
Some code specific comments:
|
||
// Fail the orchestrator by creating Terminate event | ||
ExecutionTerminatedEvent terminateEvent = new ExecutionTerminatedEvent(-1, $"Orchestrator exceeded maximum history size: {this.settings.MaxHistoryEvents}"); | ||
history.Events.Add(terminateEvent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add this logic to CompleteTaskOrchestrationWorkItemAsync
rather than doing it here. That way you can terminate the orchestration without needing to load and execute the excessive history, which is required in this case. There would be some logging benefits as well since the termination happens on the same logical thread as the orchestration execution, rather than on a background thread, which is the case for history loading.
@@ -226,22 +225,36 @@ public override async Task<OrchestrationHistory> GetHistoryEventsAsync(string in | |||
|
|||
return new OrchestrationHistory(historyEvents, checkpointCompletionTime, eTagValue, trackingStoreContext); | |||
} | |||
|
|||
#nullable enable | |||
async Task<TableEntitiesResponseInfo<DynamicTableEntity>> GetHistoryEntitiesResponseInfoAsync(string instanceId, string expectedExecutionId, IList<string> projectionColumns, CancellationToken cancellationToken = default(CancellationToken)) | |||
{ | |||
var sanitizedInstanceId = KeySanitation.EscapePartitionKey(instanceId); | |||
string filterCondition = TableQuery.GenerateFilterCondition(PartitionKeyProperty, QueryComparisons.Equal, sanitizedInstanceId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're always overwriting this variable value with this change. To make the intent more clear, I suggest we rename this variable to to something like instanceIdFilterCondition
and then declare filterCondition
below (on line 256) as a new variable.
@davidmrdavid I'm sure we can recommend a number, but my concern is about being confident that using this number is better than having no limit at all. By suggesting any specific number, we can be easily off by orders of magnitude. If the number is significantly higher than what a specific configuration can manage (considering all the factors: I/O performance, memory, etc.), this limit will not matter. If the number is significantly lower, it will unnecessarily hurt a potentially healthy app. So, why would the customer want to set any specific number we suggest? What value are they getting from it? Where would their peace of mind come from? As a customer, I would want to set this limit only if it is connected to a metric that is truly meaningful to me: e.g. time, memory consumption, etc. Based on my use case, I can often claim that, for example, spending more than 5 minutes on loading history feels way too long. We can address timing fluctuations by measuring averages, medians, or percentiles instead of acting on any single measurement. I'm not strictly against introducing a limit in the proposed form, but I do worry that it's easy to measure but difficult to meaningfully apply. At the very least, we need to provide enough guidance for the user to understand when 1k is better than 10k or 100k, and not just suggest a seemingly arbitrary number. |
I see your point and agree that history size much less tangible to customers than measuring time. Let me make a few comments on your points to help provide more context on how I view this.
Yes, that's true. However, for me to measure averages/median/etc I need to be able to save multiple history timing samples per instanceID. Today, we don't have the storage for that - we only have the history and instances table for long-term storage. In theory, I suppose we could augment either table to start storing this sort of data, but that would come at an extra customer cost (more columns in some table) as being a more intrusive change. In general, if we have a good place to store these measurements (so we can calculate averages/medians and other aggregate statistics), I'd be more open to making this configuration work in terms of time.
Yeah, I see your point. One possible reason why 1k is particularly good, is that Azure Storage can only load 1k rows at a table from tables. Therefore, any history larger than 1k rows requires multiple reads to load into memory. So perhaps that's a principled reason to suggest 1k rows as having some special significance? I still believe the ultimately users need to handle this on a case-by-case basis, but Azure Storage is already giving special significance to 1k rows so perhaps we can rely on that precedent.
One way to help them decide what number to choose is to write a detector that uses telemetry to determine what history sizes take too long to load, and have the detector dynamically suggest a number depending on the size of those histories and how long they took to load. We already have a "slow histories" detector, so we could easily augment that to provide personalized guidance. |
I think we could avoid the extra query for execution ID if we create a special ExecuteQueryAsunc. Azure storage has pagination, but we are buffering it all into a list. With a special implementation, we could start the query, grab the first execution ID that appears, then stop loading more pages when the execution ID changes. |
This PR makes two small (diff-wise) but significant changes to the Azure Storage provider.
New MaxHistoryEvents configuration
First - it introduces a new
MaxHistoryEvents
configuration. It allows the user to specify the maximum number of events they want to allow an orchestration to have before it is automatically terminated. Many incidents, and sometimes outages, result from orchestrators having too large of a history, creating application instability. This configuration should give users peace of mind so that, even if they accidentally generate a large history, the framework will protect itself and terminate offending orchestrations. This is an opt-in feature.ContinueAsNew bug fix
It also fixes a longstanding bug that was only recently discovered - it appears that, in many cases, an orchestrator that was "continued as new" will still load fragments of it's old history (those that were not already overridden by the new history). This is because we don't always know the latest executionID of an orchestrator (which allows us to distinguish between generations), meaning that we loaded more data than necessary. The end result was this: if an orchestrator instanceID ever generated a history of 10k records, it did not matter that later generations only generated 1k records, we would always load 10k (the 1k of the current generation, and 9k stale records that haven't been overridden yet).
The fix for this bug is simple: we need to get the latest instanceID by first reading only sentinel row of the History table, which reports the latest executionID. Then, using that executionID as an extra filter, we read the history as normal. This extra table read will be a new cost to our users, but it is necessary for correctness. I'm hoping the cost is low enough to be 'ok'.
I did try to make this work without the extra table read, by a paginated query and read the history table "in parts" and stopped as soon as the executionID of the results changed, which tells us that now we're reading a stale history. However, if we did this, we would stop before reading the sentinel row (the sentinel row is the last row of the table), which DTFx needs for it's
CheckpointCompletedTimestamp
which it uses to determine out of order messages. In the end, I found it simpler and safer to make that extra request, but this is up for debate.