Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.
/ NuGet.Jobs Public archive

Commit

Permalink
[GH Idx] Save blob to azurestorage (#785)
Browse files Browse the repository at this point in the history
* [GH Index] Initial commit

* [GH Index] Fixed build

* Added License headers

* Changed Nuspec Id

* Changed Nuspec script include

* Added empty job

* [GH Idx] Added Octokit and LibGit2Sharp dependencies

* [GH Idx] Add initial GHSearcher

* [GH Idx] Add GitRepoSearcher

* [GH Idx] Add dependency injection

* [GH Idx] Add null check

* [GH Idx] Add tests

* [GH Idx] Extracted constants

* [GH Idx] Fixed tests

* [GH Idx] Add Filters class

* Update src/NuGet.Jobs.GitHubIndexer/GitRepoSearchers/GitHubSearcher.cs

Co-Authored-By: Loïc Sharma <[email protected]>

* Update src/NuGet.Jobs.GitHubIndexer/GitRepoSearchers/GitHubSearcher.cs

Co-Authored-By: Loïc Sharma <[email protected]>

* [GH Idx] Removed duplicate class RepositoryInformation

* [GH Idx] Refactored the code a bit

* [GH Idx] Fix possible deadlock

* [GH Idx] Add config section in the appsettings.json

* [GH Idx] GitHubSearcher is not recursive anymore!

* [GH Idx] Removed redundant comparer

* [GH Idx] Fix upperStarBound wrongly set on request

* [GH Idx] Fixed sleep time

* [GH Idx] Fix typo

* [GH Idx] Made fields private

* [GH Idx] Changed UA

* [GH Idx] Made the configuration not static

* [GH Idx] Add ApiInfo doc in the tests

* [GH Idx] Refactor GH Search API requester

* [GH Idx] Removed redundant import in csproj

* [GH Idx] Add documentation to the configuration

* [GH Idx] Move the IGitHubClient to the GitHubSearchWrapper

* [GH Idx] Remove redundant variable

* [GH Idx] Trim tests Assembly info

* [GH Idx] Add checks to ensure the required info is in the GitHub response

* [GH Idx] Moved public method before private methods

* [GH Idx] Extract retry time in a static variable

* [GH Idx] Add typecheck and fix tests

* [GH Idx] Remove redundant using

* [GH Idx] Nit space formatting

* [GH Idx] Change UserAgent to use assembly name and version

* [GH Idx] Remove extra line

* [GH Idx] Fix nit picks

* [GH Idx] Fix merge

* [GH Idx] First iteration of the filtering

* [GH Idx] Simplified Job class

* [GH Idx] WIP

* [GH Idx] Process repo is now run in parallel

* [GH Idx] Removed debug code

* [GH Idx] WIP 2

* [GH Idx] Modify Filters doc

* [GH Idx] Refactor WritableRepositoryInformation

* [GH Idx] WIP 3

* [GH Idx] Add WritableRepoInfo doc

* [GH Idx] Made the MaxDegreeOfParallelism configurable

* [GH Idx] WIP before tests

The pipeline is working, now I need to refactor it to make it testable  by adding a bunch of interfaces to decouple with LibGit2Sharp

* [GH Idx] Refactor to decouple from LibGit2Sharp

* [GH Idx] Using immutable collections

* [GH Idx] Add tests

* [GH Idx] Clean old code

* [GH Idx] Remove unused imports

* [GH Idx] Add logging

* [GH Idx] Cleanup

* [GH Idx] Bumping up the NuGetGalery.Core dependency version

* [GH Idx] Reverting changes to web.config

* [GH Idx] Add docs

* Cleaned up dependencies and bumped up NuGetGalery.Core version

* [GH Idx] Fix PascalCase method name

* [GH Idx] Fix space

* [GH Idx] Remove redundant comment

* [GH Idx] Now using proper logger creation

* [GH Idx] Add new line for constructor

* [GH Idx] Remove redundant filter config file type

* [GH Idx] Add RegEx timeout

* [GH Idx] Remove empty line

* [GH Idx] Add named params and remove redundant code for FetchedRepo

* [GH Idx] Add basePathLength to optimize Select

* [GH Idx] Move the static constructor

* [GH Idx] Cache hit now logged as an information

* [GH Idx] Use Path.Combine instead of string concatenation

* [GH Idx] Remove redundant comment

* [GH Idx] Extract GitFileInfo class

* [GH Idx] Remove redundant imports

* [GH Idx] Replace "as" cast

* [GH Idx] Simplify LINQ statement

* [GH Idx] Simplify config file parsing

* [GH Idx] Simplify Thread construction

* [GH Idx] Move cache files to their own directory

* [GH Idx] Remove transitive exception throws in documentation

* [GH Idx] Wrap long line in Filters

* [GH Idx] Make dependencies case-insensitive

* [GH Idx] Use Path.Combine and remove extra line

* [GH Idx] Add named param

* Update src/NuGet.Jobs.GitHubIndexer/CheckedOutFile.cs

Co-Authored-By: Loïc Sharma <[email protected]>

* [GH Idx] Move config in same section

* [GH Idx] Remove redundant documentation

* [GH Idx] Check for unhandled config file type

* [GH Idx] Rename function

* [GH Idx] Using Path.Combine in RepoUtils

* [GH Idx] Move isValidPackageId to RepoUtils

* [GH Idx] Optimal LINQ usage  in ReposIndexer

* [GH Idx] Move TODO

* [GH Idx] Log Trace and Debug --> Information

* [GH IDx] Expanded msBuild and PkgConfig enums

* [GH Idx] Remove special regex case

* [GH Idx] Using stringComparer instead of ToLower() then comparing

* [GH Idx] Use repo.FullName instead of manually creating it

* [GH Idx] Filters early return

* [GH Idx] Log warning for long paths

* [GH Idx] Run --> RunAsync

* [GH Idx] Remove ServicePointManager init setup

* [GH Idx] workdir --> work

* [GH Idx] Remove as cast

* [GH Idx] LogTrace --> LogInformation for disk cache

* [GH Idx] Save final blob to Azure Storage

* [GH Idx] Forgot a LogTrace there...

* Update src/NuGet.Jobs.GitHubIndexer/ConfigFileParser.cs

Co-Authored-By: Loïc Sharma <[email protected]>

* [GH Idx] Update FetchedRepo comment

* [GH Idx] Update EndsWith --> Equals

* [GH Idx] Fix wrong documentation

* [GH Idx] Revert Config properties for Azure BlobStorage

* [GH Idx] Simplify LINQ statement

* [GH Idx] Format RepoUtils line to make it more readable

* [GH Idx] Got rid of few Singletons

* [GH Idx] Scope logging

* [GH Idx] "No Description." --> ""

* [GH Idx] Fix config

* Update src/NuGet.Jobs.GitHubIndexer/FetchedRepo.cs

Co-Authored-By: Loïc Sharma <[email protected]>

* Update src/NuGet.Jobs.GitHubIndexer/ReposIndexer.cs

Co-Authored-By: Loïc Sharma <[email protected]>

* [GH Idx] Inverted if stattement in TryGetCachedVersion

* [GH Idx] Fix timing to use UTC

* [GH Idx] Fix timing to use UTC

* [GH Idx] Move assignment

* [GH Idx][ Extract container name to constant

* [GH Idx] Move serializer

* [GH Idx] Function rename

* [GH Idx] Add tests to make sure blob is serialized correctly
  • Loading branch information
mogah authored Jul 29, 2019
1 parent 6c2da57 commit 3070907
Show file tree
Hide file tree
Showing 4 changed files with 104 additions and 9 deletions.
10 changes: 10 additions & 0 deletions src/NuGet.Jobs.GitHubIndexer/GitHubIndexerConfiguration.cs
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,15 @@ public class GitHubIndexerConfiguration
/// The number of concurrent threads running to index Git repositories
/// </summary>
public int MaxDegreeOfParallelism { get; set; } = 32;

/// <summary>
/// The connection string to be used for a <see cref="NuGetGallery.CloudBlobClientWrapper"/> instance.
/// </summary>
public string StorageConnectionString { get; set; }

/// <summary>
/// Gets a setting if Read Access Geo Redundant is enabled in azure storage
/// </summary>
public bool StorageReadAccessGeoRedundant { get; set; }
}
}
6 changes: 6 additions & 0 deletions src/NuGet.Jobs.GitHubIndexer/Job.cs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
using Autofac;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Options;
using NuGetGallery;
using Octokit;

namespace NuGet.Jobs.GitHubIndexer
Expand Down Expand Up @@ -33,6 +35,10 @@ protected override void ConfigureJobServices(IServiceCollection services, IConfi
services.AddTransient<IRepositoriesCache, DiskRepositoriesCache>();
services.AddTransient<IConfigFileParser, ConfigFileParser>();
services.AddTransient<IRepoFetcher, RepoFetcher>();
services.AddTransient<ICloudBlobClient>(provider => {
var config = provider.GetRequiredService<IOptionsSnapshot<GitHubIndexerConfiguration>>();
return new CloudBlobClientWrapper(config.Value.StorageConnectionString, config.Value.StorageReadAccessGeoRedundant);
});

services.Configure<GitHubIndexerConfiguration>(configurationRoot.GetSection(GitHubIndexerConfigurationSectionName));
}
Expand Down
23 changes: 20 additions & 3 deletions src/NuGet.Jobs.GitHubIndexer/ReposIndexer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,28 @@ namespace NuGet.Jobs.GitHubIndexer
public class ReposIndexer
{
private const string WorkingDirectory = "work";
private const string BlobStorageContainerName = "content";
private const string GitHubUsageFileName = "GitHubUsage.v1.json";

private static readonly string GitHubUsageFilePath = Path.Combine(WorkingDirectory, "GitHubUsage.v1.json");
public static readonly string RepositoriesDirectory = Path.Combine(WorkingDirectory, "repos");
public static readonly string CacheDirectory = Path.Combine(WorkingDirectory, "cache");
private static readonly JsonSerializer Serializer = new JsonSerializer();

private readonly IGitRepoSearcher _searcher;
private readonly ILogger<ReposIndexer> _logger;
private readonly int _maxDegreeOfParallelism;
private readonly IRepositoriesCache _repoCache;
private readonly IRepoFetcher _repoFetcher;
private readonly IConfigFileParser _configFileParser;
private readonly ICloudBlobClient _cloudClient;

public ReposIndexer(
IGitRepoSearcher searcher,
ILogger<ReposIndexer> logger,
IRepositoriesCache repoCache,
IConfigFileParser configFileParser,
IRepoFetcher repoFetcher,
ICloudBlobClient cloudClient,
IOptionsSnapshot<GitHubIndexerConfiguration> configuration)
{
_searcher = searcher ?? throw new ArgumentNullException(nameof(searcher));
Expand All @@ -50,6 +54,7 @@ public ReposIndexer(
}

_maxDegreeOfParallelism = configuration.Value.MaxDegreeOfParallelism;
_cloudClient = cloudClient ?? throw new ArgumentNullException(nameof(cloudClient));
}

public async Task RunAsync()
Expand Down Expand Up @@ -80,14 +85,26 @@ await ProcessInParallel(inputBag, repo =>
.ThenBy(x => x.Id)
.ToList();

// TODO: Replace with upload to Azure Blob Storage (https://github.com/NuGet/NuGetGallery/issues/7211)
File.WriteAllText(GitHubUsageFilePath, JsonConvert.SerializeObject(finalList));
await WriteFinalBlobAsync(finalList);

// Delete the repos and cache directory
Directory.Delete(RepositoriesDirectory, recursive: true);
Directory.Delete(CacheDirectory, recursive: true);
}

private async Task WriteFinalBlobAsync(List<RepositoryInformation> finalList)
{
var blobReference = _cloudClient.GetContainerReference(BlobStorageContainerName).GetBlobReference(GitHubUsageFileName);

using (var stream = await blobReference.OpenWriteAsync(accessCondition: null))
using (var streamWriter = new StreamWriter(stream))
using (var jsonTextWriter = new JsonTextWriter(streamWriter))
{
blobReference.Properties.ContentType = "application/json";
Serializer.Serialize(jsonTextWriter, finalList);
}
}

private RepositoryInformation ProcessSingleRepo(WritableRepositoryInformation repo)
{
if (_repoCache.TryGetCachedVersion(repo, out var cachedVersion))
Expand Down
74 changes: 68 additions & 6 deletions tests/NuGet.Jobs.GitHubIndexer.Tests/ReposIndexerFacts.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,16 @@

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using Moq;
using Newtonsoft.Json;
using NuGetGallery;
using Xunit;

Expand All @@ -18,6 +23,7 @@ public class ReposIndexerFacts
private static ReposIndexer CreateIndexer(
WritableRepositoryInformation searchResult,
IReadOnlyList<GitFileInfo> repoFiles,
Action<string> onDisposeHandler,
Func<ICheckedOutFile, IReadOnlyList<string>> configFileParser = null)
{
var mockConfig = new Mock<IOptionsSnapshot<GitHubIndexerConfiguration>>();
Expand Down Expand Up @@ -57,12 +63,36 @@ private static ReposIndexer CreateIndexer(
.Setup(x => x.FetchRepo(It.IsAny<WritableRepositoryInformation>()))
.Returns(mockFetchedRepo.Object);

var mockBlobClient = new Mock<ICloudBlobClient>();
var mockBlobContainer = new Mock<ICloudBlobContainer>();
var mockBlob = new Mock<ISimpleCloudBlob>();

mockBlobClient
.Setup(x => x.GetContainerReference(It.IsAny<string>()))
.Returns(() => mockBlobContainer.Object);
mockBlobContainer
.Setup(x => x.GetBlobReference(It.IsAny<string>()))
.Returns(() => mockBlob.Object);
mockBlob
.Setup(x => x.ETag)
.Returns("\"some-etag\"");
mockBlob
.Setup(x => x.Properties)
.Returns(new CloudBlockBlob(new Uri("https://example/blob")).Properties);
mockBlob
.Setup(x => x.OpenWriteAsync(It.IsAny<AccessCondition>()))
.ReturnsAsync(() => new RecordingStream(bytes =>
{
onDisposeHandler?.Invoke(Encoding.UTF8.GetString(bytes));
}));

return new ReposIndexer(
mockSearcher.Object,
new Mock<ILogger<ReposIndexer>>().Object,
mockRepoCache.Object,
mockConfigFileParser.Object,
mockRepoFetcher.Object,
mockBlobClient.Object,
mockConfig.Object);
}
public class TheRunMethod
Expand All @@ -71,7 +101,7 @@ public class TheRunMethod
public async Task TestNoDependenciesInFiles()
{
var repo = new WritableRepositoryInformation("owner/test", url: "", stars: 100, description: "", mainBranch: "master");
var configFileNames = new string[] { "packages.config", "someProjFile.csproj", "someProjFile.props", "someProjFile.targets"};
var configFileNames = new string[] { "packages.config", "someProjFile.csproj", "someProjFile.props", "someProjFile.targets" };
var repoFiles = new List<GitFileInfo>()
{
new GitFileInfo("file1.txt", 1),
Expand All @@ -82,7 +112,7 @@ public async Task TestNoDependenciesInFiles()
new GitFileInfo(configFileNames[3], 1)
};

var indexer = CreateIndexer(repo, repoFiles);
var indexer = CreateIndexer(repo, repoFiles, onDisposeHandler: null);
await indexer.RunAsync();

var result = repo.ToRepositoryInformation();
Expand All @@ -93,8 +123,8 @@ public async Task TestNoDependenciesInFiles()
public async Task TestWithDependenciesInFiles()
{
var repo = new WritableRepositoryInformation("owner/test", url: "", stars: 100, description: "", mainBranch: "master");
var configFileNames = new string[] { "packages.config", "someProjFile.csproj", "someProjFile.props", "someProjFile.targets"};
var repoDependencies = new string[] { "dependency1", "dependency2", "dependency3", "dependency4"};
var configFileNames = new string[] { "packages.config", "someProjFile.csproj", "someProjFile.props", "someProjFile.targets" };
var repoDependencies = new string[] { "dependency1", "dependency2", "dependency3", "dependency4" };
var repoFiles = new List<GitFileInfo>()
{
new GitFileInfo("file1.txt", 1),
Expand All @@ -104,12 +134,17 @@ public async Task TestWithDependenciesInFiles()
new GitFileInfo(configFileNames[2], 1),
new GitFileInfo(configFileNames[3], 1)
};

var indexer = CreateIndexer(repo, repoFiles, (ICheckedOutFile file) =>
var writeToBlobCalled = false;
var indexer = CreateIndexer(repo, repoFiles, configFileParser: (ICheckedOutFile file) =>
{
// Make sure that the Indexer filters out the non-config files
Assert.True(Array.Exists(configFileNames, x => string.Equals(x, file.Path)));
return repoDependencies;
},
onDisposeHandler: (string serializedText) =>
{
writeToBlobCalled = true;
Assert.Equal(JsonConvert.SerializeObject(new RepositoryInformation[] { repo.ToRepositoryInformation() }), serializedText);
});
await indexer.RunAsync();

Expand All @@ -123,6 +158,33 @@ public async Task TestWithDependenciesInFiles()
Assert.Equal(repo.Id, result.Id);
Assert.Equal(repo.Stars, result.Stars);
Assert.Equal(repo.Url, result.Url);

// Make sure the blob has been written
Assert.True(writeToBlobCalled);
}
}
private class RecordingStream : MemoryStream
{
private readonly object _lock = new object();
private Action<byte[]> _onDispose;

public RecordingStream(Action<byte[]> onDispose)
{
_onDispose = onDispose;
}

protected override void Dispose(bool disposing)
{
lock (_lock)
{
if (_onDispose != null)
{
_onDispose(ToArray());
_onDispose = null;
}
}

base.Dispose(disposing);
}
}
}
Expand Down

0 comments on commit 3070907

Please sign in to comment.