Skip to content
This repository has been archived by the owner on Jul 30, 2024. It is now read-only.
/ NuGet.Jobs Public archive

Commit

Permalink
Add README for the search service (#784)
Browse files Browse the repository at this point in the history
Add supporting documents for the Azure Search indexes and auxiliary data files
Address NuGet/NuGetGallery#8006
  • Loading branch information
joelverhagen committed Jun 14, 2020
1 parent b9f34ca commit 725e891
Show file tree
Hide file tree
Showing 6 changed files with 643 additions and 0 deletions.
111 changes: 111 additions & 0 deletions docs/Azure-Search-indexes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Azure Search indexes

**Subsystem: Search 🔎**

The search subsystem heavily depends on Azure Search for storing package metadata and performing package queries. Within
a single Azure Search resource, there can be multiple indexes. An index is simply a collection of documents with a
common schema. For the NuGet search subsystem, there are two indexes expected in each Azure Search resource:

- [`search-XXX`](#search-index) - this is the "search" index which contains documents for *discovery* queries
- [`hijack-XXX`](#hijack-index) - this is the "hijack" index which contains documents for *metadata lookup* queries

## Search index

The search index is designed to fulfill queries for package discovery. This is likely the scenario you would think about
first when you imagine how package search would work. It's optimized for searching package metadata field by one or more
keywords and has a scoring profile that returns the most relevant package first.

This index has up to four documents per package ID. Each of the four ID-specific documents represents a different view
of available package versions. There are two factors for filtering in and out package versions: whether or not to
consider prerelease versions and whether or not to consider SemVer 2.0.0 versions.

This may seem is a little strange at first, so it's best to consider an example. Consider a package
[`BaseTestPackage.SearchFilters`](https://www.nuget.org/packages/BaseTestPackage.SearchFilters) that has four versions:

- `1.1.0` - stable, SemVer 1.0.0
- `1.2.0-beta`, prerelease, SemVer 1.0.0
- `1.3.0+metadata`, stable, SemVer 2.0.0 (due to build metadata)
- `1.4.0-delta.4`, prerelease, SemVer 2.0.0 (due to a dot in the prerelease label)

As mentioned before there are up to four documents per package ID. In the case of the example package
`BaseTestPackage.SearchFilters`, there will be four documents, each with a different set of versions included in the
document.

- Stable + SemVer 1.0.0: contains only `1.1.0` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters))
- Stable/Prerelease + SemVer 1.0.0: contains `1.1.0` and `1.2.0-beta` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&prerelease=true))
- Stable + SemVer 2.0.0: contains `1.1.0` and `1.3.0+metadata` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&semVerLevel=2.0.0))
- Stable/Prerelease + SemVer 2.0.0: contains all versions ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&prerelease=true&semVerLevel=2.0.0))

The four "flavors" of search documents per ID are referred to as **search filters**.

The documents in the search index are identified (via the `key` property) by a unique string with the following format:

```
{sanitized lowercase ID}-{base64 lowercase ID}-{search filter}
```

The `sanitized lowercase ID` removes all characters from the package ID that are not acceptable for Azure Search
document keys, like dots and non-ASCII word characters (like Chinese characters). This component of the document key is
included for readability purposes only.

The `base64 lowercase ID` is the base64 encoding of the package ID's bytes, encoded with UTF-8. This string is
guaranteed to be a 1:1 mapping with the lowercase package ID and is included for uniqueness. The
`HttpServerUtility.UrlTokenEncode` API is used for base64 encoding.

The `search filter` has one of four values:

- `Default` - Stable + SemVer 1.0.0
- `IncludePrerelease` - Stable/Prerelease + SemVer 1.0.0
- `IncludeSemVer2` - Stable + SemVer 2.0.0
- `IncludePrereleaseAndSemVer2` - Stable/Prerelease + SemVer 2.0.0

For the package ID `BaseTestPackage.SearchFilters`, the Stable + 1.0.0 document key would be:

```
basetestpackage_searchfilters-YmFzZXRlc3RwYWNrYWdlLnNlYXJjaGZpbHRlcnM1-Default
```

Each document contains a variety of metadata fields originating from the latest version in the application version list
as well as a field listing all versions. See the
[`NuGet.Services.AzureSearch.SearchDocument.Full`](../src/NuGet.Services.AzureSearch/Models/SearchDocument.cs) class and
its inherited members for a full list of the fields.

Unlisted package versions do not appear in the search index at all.

## Hijack index

The hijack index is used by the gallery to fulfill specific metadata lookup operations. For example, if a
customer is looking for metadata about all versions of the package ID `Newtonsoft.Json`, in certain cases the gallery
will query the search service for this metadata and the search service will use the hijack index to fetch the
data.

This index has one document for every version of every package ID, whether it is unlisted or not. The search service
uses this index to find all versions of a package via the `ignoreFilter=true` parameter including,

- unlisted packages ([example query](https://azuresearch-usnc.nuget.org/search/query?q=packageid:BaseTestPackage.Unlisted&ignoreFilter=true))
- multiple versions of a single ID ([example query](https://azuresearch-usnc.nuget.org/search/query?q=packageid:BaseTestPackage.SearchFilters&ignoreFilter=true&semVerLevel=2.0.0))

The documents in the hijack index are identified (via the `key` property) by a unique string with the following format:

```
{sanitized ID/version}-{base64 ID/version}
```

The `sanitized ID/version` removes all characters from the `{lowercase package ID}/{lowercase, normalized version}`
that are not acceptable for Azure Search document keys, like dots and non-ASCII word characters (like Chinese
characters). This component of the document key is included for readability purposes only.

The `base64 ID/version` is the base64 encoding of the previously mentioned concatenation of ID and version, encoded
with UTF-8. This string is guaranteed to be a 1:1 mapping with the lowercase package ID and version and is included
for uniqueness. The `HttpServerUtility.UrlTokenEncode` API is used for base64 encoding.

For the package ID `BaseTestPackage.SearchFilters` and version `1.3.0+metadata`, the document key would be:

```
basetestpackage_searchfilters_1_3_0-YmFzZXRlc3RwYWNrYWdlLnNlYXJjaGZpbHRlcnMvMS4zLjA1
```

Each document contains a variety of metadata fields originating from the latest version in the application version list
as well as a field listing all versions. See the
[`NuGet.Services.AzureSearch.HijackDocument.Full`](../src/NuGet.Services.AzureSearch/Models/HijackDocument.cs) class and
its inherited members for a full list of the fields.
169 changes: 169 additions & 0 deletions docs/Search-auxiliary-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Search auxiliary files

**Subsystem: Search 🔎**

Aside from metadata stored in the [Azure Search indexes](Azure-Search-indexes.md), there is data stored in Azure Blob
Storage for bookkeeping and performance reasons. These data files are called **auxiliary files**. The data files
mentioned here are those explicitly managed by the search subsystem. Other data files exist (manually created,
created by the statistics subsystem, etc.). Those will not be covered here but are mentioned in the job-specific
documentation that uses them as input.

Each search auxiliary file is copied to the individual region that a [search service](../src/NuGet.Services.SearchService/README.md)
is deployed. For nuget.org, we run search in four regions, so there are four copies of each of these files.

The search auxiliary files are:

- [`downloads/downloads.v2.json`](#download-count-data) - total download count for every package version
- [`owners/owners.v2.json` and change history](#package-ownership-data) - owners for every package ID
- [`verified-packages/verified-packages.v1.json`](#verified-packages-data) - package IDs that are verified
- [`popularity-transfers/popularity-transfers.v1.json`](#popularity-transfer-data) - popularity transfers between package IDs

## Download count data

The `downloads/downloads.v2.json` file has the total download count for all package versions. The total download count
for a package ID as a whole can be calculated simply by adding all version download counts.

The downloads data file looks like this:

```json
{
"Newtonsoft.Json": {
"8.0.3": 10508321,
"9.0.1": 55801938
},
"NuGet.Versioning": {
"5.6.0-preview.3.6558": 988,
"5.6.0": 10224
}
}
```

The package ID and version keys are not guaranteed to have the original (author-intended) casing and should be treated
in a case insensitive manner. The version keys will always be normalized via [standard `NuGetVersion` normalization rules](https://docs.microsoft.com/en-us/nuget/concepts/package-versioning#normalized-version-numbers)
(e.g. no build metadata will appear, no leading zeroes, etc.).

If a package ID or version does not exist in the data file, this only indicates that there is no download count data and
does not imply that the package ID or version does not exist on the package source. It is possible for package IDs or
versions that do not exist (perhaps due to deletion) to exist in the data file.

The order of the IDs and versions in the file is undefined.

This file has a "v2" in the file name because it is the second version of this data. The "v1" format is still produced
by the statistics subsystem and has a less friendly data format.

The class for reading and writing this file to Blob Storage is [`DownloadDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/DownloadDataClient.cs).

## Package ownership data

The `owners/owners.v2.json` file contains the owner information about all package IDs. Each time this file is updated,
the set of package IDs that changed is written to a "change history" file with a path pattern like
`owners/changes/TIMESTAMP.json`.

The class for reading and writing these files to Blob Storage is [`OwnerDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/OwnerDataClient.cs).

### `owners/owners.v2.json`

The owners data file looks like this:

```json
{
"Newtonsoft.Json": [
"dotnetfoundation",
"jamesnk",
"newtonsoft"
],
"NuGet.Versioning": [
"Microsoft",
"nuget"
]
}
```

The package ID key is not guaranteed to have the original (author-intended) casing and should be treated
in a case insensitive manner. The owner values will have the same casing that is shown on NuGetGallery but should be
treated in a case insensitive manner.

If a package ID does not exist in the data file, this indicates that the package ID has no owners (a possible but
relatively rare scenario for NuGetGallery). It is possible for a package ID with no versions to appear in this file.

The order of the IDs and owner usernames in the file is case insensitive ascending lexicographical order.

This file has a "v2" in the file name because it is the second version of this data. The "v1" format was deprecated when
nuget.org moved from a Lucene-based search service to Azure Search. The "v1" format had a less friendly data format.

### Change history

The change history files do not contain owner usernames for GDPR reasons but mention all of the package IDs that had
ownership changes since the last time that the `owners.v2.json` file was generated. If a package ID is not mentioned in
a file, that means that there were no ownership changes in the time window. An ownership change is defined as one or
more owners being added or removed from the set of owners for that package ID.

Each change history data file has a file name with timestamp format `yyyy-MM-dd-HH-mm-ss-FFFFFFF` (UTC) and a file
extension of `.json`.

The files look like this:

```json
[
"Newtonsoft.Json",
"NuGet.Versioning"
]
```

By processing the files in order of their timestamp file name, a rough log of ownership changes can be produced. These
files are currently not read by any job and are produced for future investigative purposes.

The package ID key is not guaranteed to have the original (author-intended) casing and should be treated
in a case insensitive manner.

The order of the package IDs in the file is undefined.

## Verified packages data

The `verified-packages/verified-packages.v1.json` data file contains all package IDs that are considered verified by the [prefix reservation feature](https://docs.microsoft.com/en-us/nuget/nuget-org/id-prefix-reservation). This essentially defines the verified checkmark icon in the search UIs.

The data file looks like this:

```json
[
"Newtonsoft.Json",
"NuGet.Versioning"
]
```

If a package ID is in the file, then it is verified. The package ID is not guaranteed to have the original
(author-intended) casing and should be treated in a case insensitive manner.

The order of the package IDs is undefined.

The class for reading and writing this file to Blob Storage is [`VerifiedPackagesDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/VerifiedPackagesDataClient.cs).

## Popularity transfer data

The `popularity-transfers/popularity-transfers.v1.json` data file has a mapping of all package IDs that have
transferred their popularity to one or more other packages.

The data file looks like this:

```json
{
"OldPackageA": [
"NewPackage1",
"NewPackage2"
],
"OldPackageB": [
"NewPackage3"
]
}
```

For each key-value pair, the package ID key has its popularity transferred to the package ID values. The implementation
of the popularity transfer is out of scope for the data file format. Package IDs that do not appear as a key in this
file do not have their popularity transferred.

The package ID keys and values are not guaranteed to have the original (author-intended) casing and should be treated
in a case insensitive manner.

The order of the package ID keys and values is case insensitive ascending lexicographical order.

The class for reading and writing this file to Blob Storage is [`PopularityTransferDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/PopularityTransferDataClient.cs).
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
<ItemGroup>
<None Include="App.config" />
<None Include="NuGet.Jobs.Db2AzureSearch.nuspec" />
<None Include="README.md" />
<None Include="Scripts\PostDeploy.ps1" />
</ItemGroup>
<ItemGroup>
Expand Down
1 change: 1 addition & 0 deletions src/NuGet.Jobs.Db2AzureSearch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TODO: https://github.com/NuGet/NuGetGallery/issues/8005
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
<Content Include="Settings\local.json" />
<Content Include="README.md" />
<None Include="Web.Debug.config">
<DependentUpon>Web.config</DependentUpon>
</None>
Expand Down
Loading

0 comments on commit 725e891

Please sign in to comment.