Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.
/ NuGet.Jobs Public archive

Status aggregator job #494

Merged
merged 53 commits into from
Jul 30, 2018
Merged

Status aggregator job #494

merged 53 commits into from
Jul 30, 2018

Conversation

scottbommarito
Copy link

NuGet/NuGetGallery#6107

This job aggregates incidents and exports them in a blob that can be consumed by other services.

It maintains the state of every incident and groups incidents that affect the same component at the same time and reports them as an event.

Incidents are parsed through their titles. We discard incidents that we can't parse, and maintain a cursor that described what incidents have been processed.

The job will create up to two messages for each event. If all incidents attached to an event are mitigated within a short period of time, the job will never create any messages for an event. Otherwise, the job will create a message for the start of an event and the end of an event.

See https://nuget-dev-0-status.azurewebsites.net/ to see an example of how the data exported by this job will be used.

private const int _defaultEventEndDelayMinutes = 10;
private const int _defaultEventVisibilityPeriod = 10;

private static void AddConfiguration(IServiceCollection serviceCollection, IDictionary<string, string> jobArgsDictionary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddConfiguration [](start = 28, length = 16)

Any chance we could use json files like in the validation jobs? There's quite a lot of arguments here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with how the JSON configuration works, but I can check it out.

public static void Main(string[] args)
{
var job = new Job();
JobRunner.Run(job, args).Wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait [](start = 37, length = 4)

Wait() should be GetAwaiter().GetResult()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this for our other jobs as well. Filed NuGet/NuGetGallery#6224

? ""
: $" and CreateDate gt datetime'{cursor.ToString("o")}'";

return $"$filter=OwningTeamId eq '{_incidentApiTeamId}'{cursorPart}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{cursorPart} [](start = 67, length = 12)

This might read a little bit better if you flip this around. if cursorPart != DateTime.MinValue, append to the query string the CreateDate bit. Feel free to ignore :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        private string GetRecentIncidentsQuery(DateTime cursor)
        {
            var query = $"$filter = OwningTeamId eq '{_incidentApiTeamId}'";

            if (cursor != DateTime.MinValue)
            {
                query += $" and CreateDate gt datetime'{cursor.ToString("o")}'";
            }

            return query;
        }

Looks much better, good catch!

{
parsedIncident = null;
var title = incident.Title;
var match = Regex.Match(title, _regExPattern);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match [](start = 34, length = 5)

I forget all the requirements around Regex, but, I know there's a timeout one. I would double check what's necessary to run regex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at other uses of Regex in our code, it appears the requirement is that we add a timeout. Will add a timeout of 5 seconds.

{
public static class ServiceProviderExtensions
{
public static T GetService<T>(this IServiceProvider provider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetService [](start = 24, length = 10)

This already exists: GetRequiredService<T>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and replaced the uses of this new extension

public override Task Run()
{
return _serviceProvider
.GetService<StatusAggregator>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetService [](start = 17, length = 10)

You can use GetRequiredService<StatusAggregator> here.

ITableWrapper table,
ILogger<Cursor> logger)
{
_table = table;
Copy link
Contributor

@loic-sharma loic-sharma Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table [](start = 21, length = 5)

I would add null checks to all constructors. From my experience, it makes catching null reference exceptions easier :)

using StatusAggregator.Table;


namespace StatusAggregator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatusAggregator [](start = 10, length = 16)

There's two empty lines above this one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

{
value = DateTime.MinValue;
_logger.LogInformation("Could not fetch cursor.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super clear from this log that the cursor will be set to the beginning of time. Maybe log the value here? Feel free to ignore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that to the log

var shouldDeactivate = !incidentsLinkedToEventQuery
.Where(i => i.IsActive || i.MitigationTime > cursor - _eventEndDelay)
.ToList()
.Any();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ToList() forces all incidents to be inspected. This LINQ query will do less work if you remove the ToList() and do just Any().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IQueryable returned by _table.CreateQuery doesn't support Any, so I have to do ToList.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha :(


In reply to: 205907243 [](ancestors = 205907243)

using (_logger.Scope("Creating message for start of event."))
{
if (cursor > eventEntity.StartTime + _eventStartMessageDelay)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to read if you invert the "error" flow:

if (!condition1)
{
  _logger.LogInformation("Reason 1 for not doing anything");
  return;
}

if (!condition2)
{
  _logger.LogInformation("Reason 2 for not doing anything");
  return;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!


var componentNames = path.Split(Constants.ComponentPathDivider);
var name = string.Join(" ", componentNames.Skip(1).Reverse());
var boldedPart = $"<b>{name} {boldedPartInnerString} {_componentStatusNames[eventEntity.AffectedComponentStatus].ToLowerInvariant()}.</b>";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_componentStatusNames[eventEntity.AffectedComponentStatus].ToLowerInvariant() [](start = 66, length = 77)

Why not eventEntity.AffectedComponentStatus.ToString().ToLowerInvariant()?

The current code will break if we ever modify ComponentStatus to have values that aren't sequential (ex: Up = 0, Down = 1, Degraded = 3).

}
}

private bool TryGetContentsForEventStartForEvent(EventEntity eventEntity, out string contents)
Copy link
Contributor

@loic-sharma loic-sharma Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TryGetContentsForEventStartForEvent [](start = 21, length = 35)

This code is hard to understand. Can we simplify it?

Some ideas:

  1. Get rid of variables like _youMayEncounterIssues and just place the text directly into the message. It's fine to have duplication.
  2. Instead of passing parameters like boldedPartInnerString and suffix, pass a message template. Something like <b>{ComponentName} is {ComponentStatus}</b>. {EventMessage}. and <b>{ComponentName} is no longer {ComponentStatus}</b>. {EventMessage}. Thank you for your patience.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

But this makes the package publishing degraded message less informative, unfortunately.

Copy link
Contributor

@cristinamanum cristinamanum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments and decide if what makes sense to update.

@@ -126,6 +127,13 @@ public static bool TryGetBoolArgument(IDictionary<string, string> jobArgsDiction
return null;
}

public static T TryGetEnumArgument<T>(IDictionary<string, string> jobArgsDictionary, string argName, T defaultValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method used anywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it initially, must have removed it. Great catch!


namespace NuGet.Jobs.Extensions
{
public static class LoggerExtensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments. Why the current login is not enough? Why this method should be used over other methods?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BeginScope on its own doesn't log a message when entering and leaving scope. This method does.

Will add comments.


namespace StatusAggregator
{
public static class ComponentFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments. What is a Component?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NuGet/ServerCommon#170 for more context. I agree that this class should have more comments however.

using (_logger.Scope("Fetching cursor."))
{
var cursor = await _table.Retrieve<CursorEntity>(
CursorEntity.DefaultPartitionKey, CursorEntity.DefaultRowKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to get a null cursor back? Should it throw instead of setting to minvalue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no cursor, it likely means the table is empty and this is the first time the job has been run. So it should start at the beginning of time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment to make this behavior clearer.

using (_logger.Scope("Updating cursor to {Cursor}.", value))
{
var cursorEntity = new CursorEntity(value);
return _table.InsertOrReplaceAsync(cursorEntity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this table thread safe? Does it need to be thread safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to this either way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's Azure table storage, which is thread safe.


var parsedIncidents = incidents
.SelectMany(i => _aggregateIncidentParser.ParseIncident(i))
.ToList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include orderby here and not do the ToList?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

}
}

private string GetRecentIncidentsQuery(DateTime cursor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is private method. Can it be tested?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that this can be tested as a part of tests for FetchNewIncidents because all this method does is create an argument for an _incidentApiClient call that can be mocked.


private static string[] _componentStatusNames = Enum.GetNames(typeof(ComponentStatus));
private bool TryGetContentsForEventHelper(
EventEntity eventEntity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method testable? Will it be better to have it public for easier testing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there would be any benefit unless we abstracted this out to a different class. We still need to test that the other methods that call this method do what this method does.

string regExPattern,
ILogger<IncidentParser> logger)
{
_regExPattern = regExPattern;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null checks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
public class ValidationDurationIncidentParser : EnvironmentPrefixIncidentParser
{
private const string SubtitleRegEx = "Too many packages are stuck in the \"Validating\" state!";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we change the incident title ? Do we have any way to monitor when that will happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any change there we will need to rebuild and redeploy. Would it make sense to put it in a config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We shouldn't be changing our incident titles. They should be descriptive enough on their own. If they aren't descriptive enough, we should change them to be more descriptive so that there isn't any desire to change them. Additionally, the job only needs to parse the title on ingestion, so if the name is changed after the job reads the incident (which should be within the first minute or two after it's created) it's fine.
  2. Jobs deployments are generally pretty quick, so if the incident titles are changed (which should be infrequently), I don't think it would be a big deal to redeploy it.

public bool TryParseIncident(Incident incident, out ParsedIncident parsedIncident)
{
using (_logger.Scope("Parsing incident with parser {IncidentParserType} using {RegExPattern}",
GetType(), _regExPattern))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this log line would be a better place to log the incident's title than line 71. Something like: Attempting to use parser {IncidentParserType} using {RegExPattern} to parse title {IncidentTitle}.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. done

@scottbommarito scottbommarito merged commit f18edac into dev Jul 30, 2018
@scottbommarito scottbommarito deleted the sb-icm branch July 30, 2018 19:20
joelverhagen added a commit that referenced this pull request Sep 27, 2019
joelverhagen pushed a commit that referenced this pull request Oct 26, 2020
joelverhagen added a commit that referenced this pull request Oct 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants