Add README for the search service (#784)

Add supporting documents for the Azure Search indexes and auxiliary data files Address NuGet/NuGetGallery#8006
NuGet · Jun 14, 2020 · e9e8569 · e9e8569
1 parent 3a54e76
commit e9e8569
Show file tree

Hide file tree

Showing 6 changed files with 643 additions and 0 deletions.
diff --git a/docs/Azure-Search-indexes.md b/docs/Azure-Search-indexes.md
@@ -0,0 +1,111 @@
+# Azure Search indexes
+
+**Subsystem: Search 🔎**
+
+The search subsystem heavily depends on Azure Search for storing package metadata and performing package queries. Within
+a single Azure Search resource, there can be multiple indexes. An index is simply a collection of documents with a
+common schema. For the NuGet search subsystem, there are two indexes expected in each Azure Search resource:
+
+- [`search-XXX`](#search-index) - this is the "search" index which contains documents for *discovery* queries
+- [`hijack-XXX`](#hijack-index) - this is the "hijack" index which contains documents for *metadata lookup* queries
+
+## Search index
+
+The search index is designed to fulfill queries for package discovery. This is likely the scenario you would think about
+first when you imagine how package search would work. It's optimized for searching package metadata field by one or more
+keywords and has a scoring profile that returns the most relevant package first.
+
+This index has up to four documents per package ID. Each of the four ID-specific documents represents a different view
+of available package versions. There are two factors for filtering in and out package versions: whether or not to
+consider prerelease versions and whether or not to consider SemVer 2.0.0 versions.
+
+This may seem is a little strange at first, so it's best to consider an example. Consider a package
+[`BaseTestPackage.SearchFilters`](https://www.nuget.org/packages/BaseTestPackage.SearchFilters) that has four versions:
+
+- `1.1.0` - stable, SemVer 1.0.0
+- `1.2.0-beta`, prerelease, SemVer 1.0.0
+- `1.3.0+metadata`, stable, SemVer 2.0.0 (due to build metadata)
+- `1.4.0-delta.4`, prerelease, SemVer 2.0.0 (due to a dot in the prerelease label)
+
+As mentioned before there are up to four documents per package ID. In the case of the example package
+`BaseTestPackage.SearchFilters`, there will be four documents, each with a different set of versions included in the
+document.
+
+- Stable + SemVer 1.0.0: contains only `1.1.0` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters))
+- Stable/Prerelease + SemVer 1.0.0: contains `1.1.0` and `1.2.0-beta` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&prerelease=true))
+- Stable + SemVer 2.0.0: contains `1.1.0` and `1.3.0+metadata` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&semVerLevel=2.0.0))
+- Stable/Prerelease + SemVer 2.0.0: contains all versions ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&prerelease=true&semVerLevel=2.0.0))
+
+The four "flavors" of search documents per ID are referred to as **search filters**.
+
+The documents in the search index are identified (via the `key` property) by a unique string with the following format:
+
+```
+{sanitized lowercase ID}-{base64 lowercase ID}-{search filter}
+```
+
+The `sanitized lowercase ID` removes all characters from the package ID that are not acceptable for Azure Search
+document keys, like dots and non-ASCII word characters (like Chinese characters). This component of the document key is
+included for readability purposes only.
+
+The `base64 lowercase ID` is the base64 encoding of the package ID's bytes, encoded with UTF-8. This string is
+guaranteed to be a 1:1 mapping with the lowercase package ID and is included for uniqueness. The
+`HttpServerUtility.UrlTokenEncode` API is used for base64 encoding.
+
+The `search filter` has one of four values:
+
+- `Default` - Stable + SemVer 1.0.0
+- `IncludePrerelease` - Stable/Prerelease + SemVer 1.0.0
+- `IncludeSemVer2` - Stable + SemVer 2.0.0
+- `IncludePrereleaseAndSemVer2` - Stable/Prerelease + SemVer 2.0.0
+
+For the package ID `BaseTestPackage.SearchFilters`, the Stable + 1.0.0 document key would be:
+
+```
+basetestpackage_searchfilters-YmFzZXRlc3RwYWNrYWdlLnNlYXJjaGZpbHRlcnM1-Default
+```
+
+Each document contains a variety of metadata fields originating from the latest version in the application version list
+as well as a field listing all versions. See the
+[`NuGet.Services.AzureSearch.SearchDocument.Full`](../src/NuGet.Services.AzureSearch/Models/SearchDocument.cs) class and
+its inherited members for a full list of the fields.
+
+Unlisted package versions do not appear in the search index at all.
+
+## Hijack index
+
+The hijack index is used by the gallery to fulfill specific metadata lookup operations. For example, if a
+customer is looking for metadata about all versions of the package ID `Newtonsoft.Json`, in certain cases the gallery
+will query the search service for this metadata and the search service will use the hijack index to fetch the
+data.
+
+This index has one document for every version of every package ID, whether it is unlisted or not. The search service
+uses this index to find all versions of a package via the `ignoreFilter=true` parameter including,
+
+- unlisted packages ([example query](https://azuresearch-usnc.nuget.org/search/query?q=packageid:BaseTestPackage.Unlisted&ignoreFilter=true))
+- multiple versions of a single ID ([example query](https://azuresearch-usnc.nuget.org/search/query?q=packageid:BaseTestPackage.SearchFilters&ignoreFilter=true&semVerLevel=2.0.0))
+
+The documents in the hijack index are identified (via the `key` property) by a unique string with the following format:
+
+```
+{sanitized ID/version}-{base64 ID/version}
+```
+
+The `sanitized ID/version` removes all characters from the `{lowercase package ID}/{lowercase, normalized version}`
+that are not acceptable for Azure Search document keys, like dots and non-ASCII word characters (like Chinese
+characters). This component of the document key is included for readability purposes only.
+
+The `base64 ID/version` is the base64 encoding of the previously mentioned concatenation of ID and version, encoded
+with UTF-8. This string is guaranteed to be a 1:1 mapping with the lowercase package ID and version and is included
+for uniqueness. The `HttpServerUtility.UrlTokenEncode` API is used for base64 encoding.
+
+For the package ID `BaseTestPackage.SearchFilters` and version `1.3.0+metadata`, the document key would be:
+
+```
+basetestpackage_searchfilters_1_3_0-YmFzZXRlc3RwYWNrYWdlLnNlYXJjaGZpbHRlcnMvMS4zLjA1
+```
+
+Each document contains a variety of metadata fields originating from the latest version in the application version list
+as well as a field listing all versions. See the
+[`NuGet.Services.AzureSearch.HijackDocument.Full`](../src/NuGet.Services.AzureSearch/Models/HijackDocument.cs) class and
+its inherited members for a full list of the fields.
diff --git a/docs/Search-auxiliary-files.md b/docs/Search-auxiliary-files.md
@@ -0,0 +1,169 @@
+# Search auxiliary files
+
+**Subsystem: Search 🔎**
+
+Aside from metadata stored in the [Azure Search indexes](Azure-Search-indexes.md), there is data stored in Azure Blob
+Storage for bookkeeping and performance reasons. These data files are called **auxiliary files**. The data files
+mentioned here are those explicitly managed by the search subsystem. Other data files exist (manually created,
+created by the statistics subsystem, etc.). Those will not be covered here but are mentioned in the job-specific
+documentation that uses them as input.
+
+Each search auxiliary file is copied to the individual region that a [search service](../src/NuGet.Services.SearchService/README.md)
+is deployed. For nuget.org, we run search in four regions, so there are four copies of each of these files.
+
+The search auxiliary files are:
+
+  - [`downloads/downloads.v2.json`](#download-count-data) - total download count for every package version
+  - [`owners/owners.v2.json` and change history](#package-ownership-data) - owners for every package ID
+  - [`verified-packages/verified-packages.v1.json`](#verified-packages-data) - package IDs that are verified
+  - [`popularity-transfers/popularity-transfers.v1.json`](#popularity-transfer-data) - popularity transfers between package IDs
+
+## Download count data
+
+The `downloads/downloads.v2.json` file has the total download count for all package versions. The total download count
+for a package ID as a whole can be calculated simply by adding all version download counts.
+
+The downloads data file looks like this:
+
+```json
+{
+  "Newtonsoft.Json": {
+    "8.0.3": 10508321,
+    "9.0.1": 55801938
+  },
+  "NuGet.Versioning": {
+    "5.6.0-preview.3.6558": 988,
+    "5.6.0": 10224
+  }
+}
+```
+
+The package ID and version keys are not guaranteed to have the original (author-intended) casing and should be treated
+in a case insensitive manner. The version keys will always be normalized via [standard `NuGetVersion` normalization rules](https://docs.microsoft.com/en-us/nuget/concepts/package-versioning#normalized-version-numbers)
+(e.g. no build metadata will appear, no leading zeroes, etc.).
+
+If a package ID or version does not exist in the data file, this only indicates that there is no download count data and
+does not imply that the package ID or version does not exist on the package source. It is possible for package IDs or
+versions that do not exist (perhaps due to deletion) to exist in the data file. 
+
+The order of the IDs and versions in the file is undefined.
+
+This file has a "v2" in the file name because it is the second version of this data. The "v1" format is still produced
+by the statistics subsystem and has a less friendly data format.
+
+The class for reading and writing this file to Blob Storage is [`DownloadDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/DownloadDataClient.cs).
+
+## Package ownership data
+
+The `owners/owners.v2.json` file contains the owner information about all package IDs. Each time this file is updated,
+the set of package IDs that changed is written to a "change history" file with a path pattern like
+`owners/changes/TIMESTAMP.json`.
+
+The class for reading and writing these files to Blob Storage is [`OwnerDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/OwnerDataClient.cs).
+
+### `owners/owners.v2.json`
+
+The owners data file looks like this:
+
+```json
+{
+  "Newtonsoft.Json": [
+    "dotnetfoundation",
+    "jamesnk",
+    "newtonsoft"
+  ],
+  "NuGet.Versioning": [
+    "Microsoft",
+    "nuget"
+  ]
+}
+```
+
+The package ID key is not guaranteed to have the original (author-intended) casing and should be treated
+in a case insensitive manner. The owner values will have the same casing that is shown on NuGetGallery but should be
+treated in a case insensitive manner.
+
+If a package ID does not exist in the data file, this indicates that the package ID has no owners (a possible but
+relatively rare scenario for NuGetGallery). It is possible for a package ID with no versions to appear in this file.
+
+The order of the IDs and owner usernames in the file is case insensitive ascending lexicographical order.
+
+This file has a "v2" in the file name because it is the second version of this data. The "v1" format was deprecated when
+nuget.org moved from a Lucene-based search service to Azure Search. The "v1" format had a less friendly data format.
+
+### Change history
+
+The change history files do not contain owner usernames for GDPR reasons but mention all of the package IDs that had
+ownership changes since the last time that the `owners.v2.json` file was generated. If a package ID is not mentioned in
+a file, that means that there were no ownership changes in the time window. An ownership change is defined as one or
+more owners being added or removed from the set of owners for that package ID.
+
+Each change history data file has a file name with timestamp format `yyyy-MM-dd-HH-mm-ss-FFFFFFF` (UTC) and a file
+extension of `.json`.
+
+The files look like this:
+
+```json
+[
+  "Newtonsoft.Json",
+  "NuGet.Versioning"
+]
+```
+
+By processing the files in order of their timestamp file name, a rough log of ownership changes can be produced. These
+files are currently not read by any job and are produced for future investigative purposes.
+
+The package ID key is not guaranteed to have the original (author-intended) casing and should be treated
+in a case insensitive manner.
+
+The order of the package IDs in the file is undefined.
+
+## Verified packages data
+
+The `verified-packages/verified-packages.v1.json` data file contains all package IDs that are considered verified by the [prefix reservation feature](https://docs.microsoft.com/en-us/nuget/nuget-org/id-prefix-reservation). This essentially defines the verified checkmark icon in the search UIs.
+
+The data file looks like this:
+
+```json
+[
+  "Newtonsoft.Json",
+  "NuGet.Versioning"
+]
+```
+
+If a package ID is in the file, then it is verified. The package ID is not guaranteed to have the original
+(author-intended) casing and should be treated in a case insensitive manner. 
+
+The order of the package IDs is undefined.
+
+The class for reading and writing this file to Blob Storage is [`VerifiedPackagesDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/VerifiedPackagesDataClient.cs).
+
+## Popularity transfer data
+
+The `popularity-transfers/popularity-transfers.v1.json` data file has a mapping of all package IDs that have
+transferred their popularity to one or more other packages.
+
+The data file looks like this:
+
+```json
+{
+  "OldPackageA": [
+    "NewPackage1",
+    "NewPackage2"
+  ],
+  "OldPackageB": [
+    "NewPackage3"
+  ]
+}
+```
+
+For each key-value pair, the package ID key has its popularity transferred to the package ID values. The implementation
+of the popularity transfer is out of scope for the data file format. Package IDs that do not appear as a key in this
+file do not have their popularity transferred.
+
+The package ID keys and values are not guaranteed to have the original (author-intended) casing and should be treated
+in a case insensitive manner.
+
+The order of the package ID keys and values is case insensitive ascending lexicographical order.
+
+The class for reading and writing this file to Blob Storage is [`PopularityTransferDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/PopularityTransferDataClient.cs).
diff --git a/src/NuGet.Jobs.Db2AzureSearch/NuGet.Jobs.Db2AzureSearch.csproj b/src/NuGet.Jobs.Db2AzureSearch/NuGet.Jobs.Db2AzureSearch.csproj
@@ -46,6 +46,7 @@
   <ItemGroup>
     <None Include="App.config" />
     <None Include="NuGet.Jobs.Db2AzureSearch.nuspec" />
+    <None Include="README.md" />
     <None Include="Scripts\PostDeploy.ps1" />
   </ItemGroup>
   <ItemGroup>

diff --git a/src/NuGet.Jobs.Db2AzureSearch/README.md b/src/NuGet.Jobs.Db2AzureSearch/README.md
@@ -0,0 +1 @@
+TODO: https://github.com/NuGet/NuGetGallery/issues/8005
diff --git a/src/NuGet.Services.SearchService/NuGet.Services.SearchService.csproj b/src/NuGet.Services.SearchService/NuGet.Services.SearchService.csproj
@@ -68,6 +68,7 @@
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
     <Content Include="Settings\local.json" />
+    <Content Include="README.md" />
     <None Include="Web.Debug.config">
       <DependentUpon>Web.config</DependentUpon>
     </None>
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		TODO: https://github.com/NuGet/NuGetGallery/issues/8005