Skip to content
This repository has been archived by the owner on Aug 3, 2024. It is now read-only.
/ ServerCommon Public archive

Add the Killswitch client #169

Closed
wants to merge 1 commit into from
Closed

Add the Killswitch client #169

wants to merge 1 commit into from

Conversation

loic-sharma
Copy link
Contributor

@loic-sharma loic-sharma commented Jul 6, 2018

Killswitches allow you to turn off services' features remotely. This will be used to turn off the revalidation job if it causes pressure on the ingestion pipeline. The API is simple:

if (!_killswitchClient.IsActive("MyService.MyFeature"))
{
  _logger.LogWarning("My feature is disabled, don't do anything!");
  return;
}

Like the ITelemetryClient, services are expected to wrap the IKillswitchClient with higher-level abstractions. For example:

if (!_killswitchService.ShouldEnqueueRevalidation())
{
  _logger.LogWarning("Revalidations have been disabled!");
  return;
}

Killswitch Endpoint

Killswitches work by polling an endpoint that returns a JSON list of activated killswitches:

[
  "Revalidation.Enqueue",
  "Gallery.SymbolUploads"
  "Orchestrator.RepositorySigning"
]

This endpoint will be a JSON file hosted on Azure Blob Storage. An admin panel will be created on the Gallery to modify this file.

Design Decisions

Client uses HTTP endpoint instead of Azure Blob Storage

The fact that we use blob storage on the backend is an implementation detail. With blob storage, we'd need to add storage credentials to all services, which would increase the cost to on-board. As for privacy concerns, we could make the killswitch blob private but generate a URI with read-only access for services to consume.

Lastly, an HTTP implementation can support a push model through long-polling. As far as I know, this cannot be done using Azure Blob Storage.

Background Refresh

In this implementation, IsActive does not refresh the killswitch cache. Instead, a background task refreshes the cache periodically (every 5 minutes by default). This lets consumers of the killswitch client call IsActive as frequently as they'd like, including in hot paths.

Compared to Gallery's ContentObjectService

The HttpKillswitchClient is very similar to the ContentObjectService:

  • Both refresh their state in the background
  • Both store their state as JSON on an endpoint
  • Both allow fast lookups of the currently cached values

The differences are:

  • The ContentObjectService gives you high-level objects, allowing you to build complex feature flag logic. The HttpKillswitchClient can only determine whether a features should be on or off
  • To add a new feature, the ContentObjectService requires code changes in both the Gallery as well as the service that owns the feature
  • The ContentObjectService uses APIs specific to ASP.NET for background refreshes, and cannot be used by jobs (see this)
  • The ContentObjectService fails silently if the latest settings cannot be fetched/parsed
  • The ContentObjectService depends on Azure Blob Storage

Part of https://github.com/NuGet/Engineering/issues/1440

@loic-sharma loic-sharma self-assigned this Jul 6, 2018
@loic-sharma loic-sharma changed the title Add the Killswitch library [WIP] Add the Killswitch library Jul 6, 2018

public void Start(CancellationToken cancellationToken)
{
Task.Run(async () =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Make this async, call RefreshAsync once before returning to ensure cache is primed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

_logger.LogInformation("Refreshed killswitches");
}

public bool IsActive(string name) => _active.Contains(name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Throw if not started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

public Task<KillswitchTestServer> GetTestServerAsync() => _testServer.Value;

[Fact]
public async Task ReturnsTrueForActiveKillswitches()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Test malformed responses, that StartAsync refreshes in background once, that IsActive throws if not started

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return;
}

Console.WriteLine($"Unexpected error code: {ex.ErrorCode}. Exception: {ex.ToString()}");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex.ToString() -> ex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

<Reference Include="System.Data" />
<Reference Include="System.Net.Http" />
<Reference Include="System.Xml" />
<Reference Include="xunit.abstractions, Version=2.0.0.0, Culture=neutral, PublicKeyToken=8d05b1bb7a6fdb6c, processorArchitecture=MSIL">
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of packages.config references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

try
{
await RefreshAsync();
await Task.Delay(_config.RefreshInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the intended RefreshInterval? How often will IsActive be called?
If RefreshInterval is small and IsActive is not frequent this approach is inefficient. A different approach is to refresh when IsActive is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming IsActive will be called very frequently, specifically for revalidation job.. this could get out of hand. A decent RefreshInterval should work fine. It could be tweaked as per the calling job via config. Or may be add another config to determine when to refresh? Either via interval or with IsActive call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each job can have its own refresh cadence. I'm thinking 5 minutes for the revalidate job.

The idea is to make IsActive as cheap as possible so that it can be called anywhere including critical and hot paths. Moving the network request into IsActive would limit where we can call IsActive.

_active = new HashSet<string>();
}

public void Start(CancellationToken cancellationToken)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread safe. Start() can be called multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this thread-safe? As far as I can tell, it is. Reference assignment is atomic, and that's the only thing that the background thread interacts with.

Calling Start multiple times would cause the cache to be refreshed at a faster cadence. I can add a check to Start to ensure it's called just once, but I personally prefer simpler code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Start multiple times would cause the cache to be refreshed at a faster cadence. I can add a check to Start to ensure it's called just once, but I personally prefer simpler code.

This is the reason. The code won't work as expected if Start() is called multiple times.

Copy link
Contributor Author

@loic-sharma loic-sharma Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check to ensure Start is called just once.

await RefreshAsync();
await Task.Delay(_config.RefreshInterval);
}
catch (Exception e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are swallowing the exception and hiding issues. The correct behavior will be to add retries to protect against network issues, but throw on deserialization issues and other unrecoverable exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs on a background thread, throwing would cause the background thread to end.

That being said, I agree that we should have some "hard fail" behavior when we repeatedly fail to reload the killswitches. I'll make IsActive throw if the cache's staleness reaches some threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made IsActive throw if the cache reaches the configured staleness.

cancellationToken);
}

public async Task RefreshAsync()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public if not part of the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful for downstream consumers. Its not on the interface as not all killswitch services that support push model don't need a concept of refresh.

It also enables easy testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @shishirx34, I will add RefreshAsync to the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

private HashSet<string> _active;

public HttpKillswitchService(
HttpClient httpClient,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using HttpClient you assume the config blob is public, but it's an internal service config and should be private. Use IStorage interface.

Copy link
Contributor Author

@loic-sharma loic-sharma Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this but decided against it. With blob storage, we'd need to add storage credentials to all jobs, which makes it harder to on-board a new service to killswitches. I don't think there's a need for killswitches to be private and there's precedent for public blobs: search auxiliary data, cursors, etc.. Really, the fact that we will use blob storage on the backend is an implementation detail.

Lastly, an HTTP implementation can support for a push model through long-polling. This requires a backend that supports long polling.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a SAS URL to a non-public blob with expiration 100 years in the future, can't we?

Copy link
Contributor Author

@loic-sharma loic-sharma Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agr Yup, that could work.

/// <summary>
/// A killswitch service that queries an HTTP service periodically for the list of activate killswitches.
/// </summary>
public class HttpKillswitchService : IKillswitchService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be two roles in the killswitch world: reader and writer. If you intend to separate them into two interfaces you should suffix this interface with Reader. You could also combine the two into a single KillSwitchService, however, since there's going to be only one writer (the gallery) I don't think it's a good idea.

Copy link
Contributor Author

@loic-sharma loic-sharma Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. What do you think of HttpKillswitchClient so that it matches our naming in logging and service bus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@skofman1 skofman1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@skofman1
Copy link
Contributor

skofman1 commented Jul 6, 2018

I like the idea of having a system wide killswitch service, however this creates feature duplication in the gallery with the "feature flag" service. Both turn abilities on/off. We should discuss this on a team level.

namespace NuGet.Services.KillSwitch
{
/// <summary>
/// A killswitch service that queries an HTTP service periodically for the list of activate killswitches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: active killswitches

try
{
await RefreshAsync();
await Task.Delay(_config.RefreshInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming IsActive will be called very frequently, specifically for revalidation job.. this could get out of hand. A decent RefreshInterval should work fine. It could be tweaked as per the calling job via config. Or may be add another config to determine when to refresh? Either via interval or with IsActive call?

@loic-sharma
Copy link
Contributor Author

@skofman1 Sure. Let's talk about it after the next standup?

@loic-sharma loic-sharma changed the title [WIP] Add the Killswitch library Add the Killswitch client Jul 6, 2018
@loic-sharma
Copy link
Contributor Author

We would like to implement this using storage APIs. Once services use the same APIs, we can look into a universal killswitch system

@loic-sharma loic-sharma closed this Jul 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants