This repository has been archived by the owner on Jul 30, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add README for the search service (#784)
Add supporting documents for the Azure Search indexes and auxiliary data files Address NuGet/NuGetGallery#8006
- Loading branch information
1 parent
b9f34ca
commit 725e891
Showing
6 changed files
with
643 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# Azure Search indexes | ||
|
||
**Subsystem: Search 🔎** | ||
|
||
The search subsystem heavily depends on Azure Search for storing package metadata and performing package queries. Within | ||
a single Azure Search resource, there can be multiple indexes. An index is simply a collection of documents with a | ||
common schema. For the NuGet search subsystem, there are two indexes expected in each Azure Search resource: | ||
|
||
- [`search-XXX`](#search-index) - this is the "search" index which contains documents for *discovery* queries | ||
- [`hijack-XXX`](#hijack-index) - this is the "hijack" index which contains documents for *metadata lookup* queries | ||
|
||
## Search index | ||
|
||
The search index is designed to fulfill queries for package discovery. This is likely the scenario you would think about | ||
first when you imagine how package search would work. It's optimized for searching package metadata field by one or more | ||
keywords and has a scoring profile that returns the most relevant package first. | ||
|
||
This index has up to four documents per package ID. Each of the four ID-specific documents represents a different view | ||
of available package versions. There are two factors for filtering in and out package versions: whether or not to | ||
consider prerelease versions and whether or not to consider SemVer 2.0.0 versions. | ||
|
||
This may seem is a little strange at first, so it's best to consider an example. Consider a package | ||
[`BaseTestPackage.SearchFilters`](https://www.nuget.org/packages/BaseTestPackage.SearchFilters) that has four versions: | ||
|
||
- `1.1.0` - stable, SemVer 1.0.0 | ||
- `1.2.0-beta`, prerelease, SemVer 1.0.0 | ||
- `1.3.0+metadata`, stable, SemVer 2.0.0 (due to build metadata) | ||
- `1.4.0-delta.4`, prerelease, SemVer 2.0.0 (due to a dot in the prerelease label) | ||
|
||
As mentioned before there are up to four documents per package ID. In the case of the example package | ||
`BaseTestPackage.SearchFilters`, there will be four documents, each with a different set of versions included in the | ||
document. | ||
|
||
- Stable + SemVer 1.0.0: contains only `1.1.0` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters)) | ||
- Stable/Prerelease + SemVer 1.0.0: contains `1.1.0` and `1.2.0-beta` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&prerelease=true)) | ||
- Stable + SemVer 2.0.0: contains `1.1.0` and `1.3.0+metadata` ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&semVerLevel=2.0.0)) | ||
- Stable/Prerelease + SemVer 2.0.0: contains all versions ([example query](https://azuresearch-usnc.nuget.org/query?q=packageid:BaseTestPackage.SearchFilters&prerelease=true&semVerLevel=2.0.0)) | ||
|
||
The four "flavors" of search documents per ID are referred to as **search filters**. | ||
|
||
The documents in the search index are identified (via the `key` property) by a unique string with the following format: | ||
|
||
``` | ||
{sanitized lowercase ID}-{base64 lowercase ID}-{search filter} | ||
``` | ||
|
||
The `sanitized lowercase ID` removes all characters from the package ID that are not acceptable for Azure Search | ||
document keys, like dots and non-ASCII word characters (like Chinese characters). This component of the document key is | ||
included for readability purposes only. | ||
|
||
The `base64 lowercase ID` is the base64 encoding of the package ID's bytes, encoded with UTF-8. This string is | ||
guaranteed to be a 1:1 mapping with the lowercase package ID and is included for uniqueness. The | ||
`HttpServerUtility.UrlTokenEncode` API is used for base64 encoding. | ||
|
||
The `search filter` has one of four values: | ||
|
||
- `Default` - Stable + SemVer 1.0.0 | ||
- `IncludePrerelease` - Stable/Prerelease + SemVer 1.0.0 | ||
- `IncludeSemVer2` - Stable + SemVer 2.0.0 | ||
- `IncludePrereleaseAndSemVer2` - Stable/Prerelease + SemVer 2.0.0 | ||
|
||
For the package ID `BaseTestPackage.SearchFilters`, the Stable + 1.0.0 document key would be: | ||
|
||
``` | ||
basetestpackage_searchfilters-YmFzZXRlc3RwYWNrYWdlLnNlYXJjaGZpbHRlcnM1-Default | ||
``` | ||
|
||
Each document contains a variety of metadata fields originating from the latest version in the application version list | ||
as well as a field listing all versions. See the | ||
[`NuGet.Services.AzureSearch.SearchDocument.Full`](../src/NuGet.Services.AzureSearch/Models/SearchDocument.cs) class and | ||
its inherited members for a full list of the fields. | ||
|
||
Unlisted package versions do not appear in the search index at all. | ||
|
||
## Hijack index | ||
|
||
The hijack index is used by the gallery to fulfill specific metadata lookup operations. For example, if a | ||
customer is looking for metadata about all versions of the package ID `Newtonsoft.Json`, in certain cases the gallery | ||
will query the search service for this metadata and the search service will use the hijack index to fetch the | ||
data. | ||
|
||
This index has one document for every version of every package ID, whether it is unlisted or not. The search service | ||
uses this index to find all versions of a package via the `ignoreFilter=true` parameter including, | ||
|
||
- unlisted packages ([example query](https://azuresearch-usnc.nuget.org/search/query?q=packageid:BaseTestPackage.Unlisted&ignoreFilter=true)) | ||
- multiple versions of a single ID ([example query](https://azuresearch-usnc.nuget.org/search/query?q=packageid:BaseTestPackage.SearchFilters&ignoreFilter=true&semVerLevel=2.0.0)) | ||
|
||
The documents in the hijack index are identified (via the `key` property) by a unique string with the following format: | ||
|
||
``` | ||
{sanitized ID/version}-{base64 ID/version} | ||
``` | ||
|
||
The `sanitized ID/version` removes all characters from the `{lowercase package ID}/{lowercase, normalized version}` | ||
that are not acceptable for Azure Search document keys, like dots and non-ASCII word characters (like Chinese | ||
characters). This component of the document key is included for readability purposes only. | ||
|
||
The `base64 ID/version` is the base64 encoding of the previously mentioned concatenation of ID and version, encoded | ||
with UTF-8. This string is guaranteed to be a 1:1 mapping with the lowercase package ID and version and is included | ||
for uniqueness. The `HttpServerUtility.UrlTokenEncode` API is used for base64 encoding. | ||
|
||
For the package ID `BaseTestPackage.SearchFilters` and version `1.3.0+metadata`, the document key would be: | ||
|
||
``` | ||
basetestpackage_searchfilters_1_3_0-YmFzZXRlc3RwYWNrYWdlLnNlYXJjaGZpbHRlcnMvMS4zLjA1 | ||
``` | ||
|
||
Each document contains a variety of metadata fields originating from the latest version in the application version list | ||
as well as a field listing all versions. See the | ||
[`NuGet.Services.AzureSearch.HijackDocument.Full`](../src/NuGet.Services.AzureSearch/Models/HijackDocument.cs) class and | ||
its inherited members for a full list of the fields. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
# Search auxiliary files | ||
|
||
**Subsystem: Search 🔎** | ||
|
||
Aside from metadata stored in the [Azure Search indexes](Azure-Search-indexes.md), there is data stored in Azure Blob | ||
Storage for bookkeeping and performance reasons. These data files are called **auxiliary files**. The data files | ||
mentioned here are those explicitly managed by the search subsystem. Other data files exist (manually created, | ||
created by the statistics subsystem, etc.). Those will not be covered here but are mentioned in the job-specific | ||
documentation that uses them as input. | ||
|
||
Each search auxiliary file is copied to the individual region that a [search service](../src/NuGet.Services.SearchService/README.md) | ||
is deployed. For nuget.org, we run search in four regions, so there are four copies of each of these files. | ||
|
||
The search auxiliary files are: | ||
|
||
- [`downloads/downloads.v2.json`](#download-count-data) - total download count for every package version | ||
- [`owners/owners.v2.json` and change history](#package-ownership-data) - owners for every package ID | ||
- [`verified-packages/verified-packages.v1.json`](#verified-packages-data) - package IDs that are verified | ||
- [`popularity-transfers/popularity-transfers.v1.json`](#popularity-transfer-data) - popularity transfers between package IDs | ||
|
||
## Download count data | ||
|
||
The `downloads/downloads.v2.json` file has the total download count for all package versions. The total download count | ||
for a package ID as a whole can be calculated simply by adding all version download counts. | ||
|
||
The downloads data file looks like this: | ||
|
||
```json | ||
{ | ||
"Newtonsoft.Json": { | ||
"8.0.3": 10508321, | ||
"9.0.1": 55801938 | ||
}, | ||
"NuGet.Versioning": { | ||
"5.6.0-preview.3.6558": 988, | ||
"5.6.0": 10224 | ||
} | ||
} | ||
``` | ||
|
||
The package ID and version keys are not guaranteed to have the original (author-intended) casing and should be treated | ||
in a case insensitive manner. The version keys will always be normalized via [standard `NuGetVersion` normalization rules](https://docs.microsoft.com/en-us/nuget/concepts/package-versioning#normalized-version-numbers) | ||
(e.g. no build metadata will appear, no leading zeroes, etc.). | ||
|
||
If a package ID or version does not exist in the data file, this only indicates that there is no download count data and | ||
does not imply that the package ID or version does not exist on the package source. It is possible for package IDs or | ||
versions that do not exist (perhaps due to deletion) to exist in the data file. | ||
|
||
The order of the IDs and versions in the file is undefined. | ||
|
||
This file has a "v2" in the file name because it is the second version of this data. The "v1" format is still produced | ||
by the statistics subsystem and has a less friendly data format. | ||
|
||
The class for reading and writing this file to Blob Storage is [`DownloadDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/DownloadDataClient.cs). | ||
|
||
## Package ownership data | ||
|
||
The `owners/owners.v2.json` file contains the owner information about all package IDs. Each time this file is updated, | ||
the set of package IDs that changed is written to a "change history" file with a path pattern like | ||
`owners/changes/TIMESTAMP.json`. | ||
|
||
The class for reading and writing these files to Blob Storage is [`OwnerDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/OwnerDataClient.cs). | ||
|
||
### `owners/owners.v2.json` | ||
|
||
The owners data file looks like this: | ||
|
||
```json | ||
{ | ||
"Newtonsoft.Json": [ | ||
"dotnetfoundation", | ||
"jamesnk", | ||
"newtonsoft" | ||
], | ||
"NuGet.Versioning": [ | ||
"Microsoft", | ||
"nuget" | ||
] | ||
} | ||
``` | ||
|
||
The package ID key is not guaranteed to have the original (author-intended) casing and should be treated | ||
in a case insensitive manner. The owner values will have the same casing that is shown on NuGetGallery but should be | ||
treated in a case insensitive manner. | ||
|
||
If a package ID does not exist in the data file, this indicates that the package ID has no owners (a possible but | ||
relatively rare scenario for NuGetGallery). It is possible for a package ID with no versions to appear in this file. | ||
|
||
The order of the IDs and owner usernames in the file is case insensitive ascending lexicographical order. | ||
|
||
This file has a "v2" in the file name because it is the second version of this data. The "v1" format was deprecated when | ||
nuget.org moved from a Lucene-based search service to Azure Search. The "v1" format had a less friendly data format. | ||
|
||
### Change history | ||
|
||
The change history files do not contain owner usernames for GDPR reasons but mention all of the package IDs that had | ||
ownership changes since the last time that the `owners.v2.json` file was generated. If a package ID is not mentioned in | ||
a file, that means that there were no ownership changes in the time window. An ownership change is defined as one or | ||
more owners being added or removed from the set of owners for that package ID. | ||
|
||
Each change history data file has a file name with timestamp format `yyyy-MM-dd-HH-mm-ss-FFFFFFF` (UTC) and a file | ||
extension of `.json`. | ||
|
||
The files look like this: | ||
|
||
```json | ||
[ | ||
"Newtonsoft.Json", | ||
"NuGet.Versioning" | ||
] | ||
``` | ||
|
||
By processing the files in order of their timestamp file name, a rough log of ownership changes can be produced. These | ||
files are currently not read by any job and are produced for future investigative purposes. | ||
|
||
The package ID key is not guaranteed to have the original (author-intended) casing and should be treated | ||
in a case insensitive manner. | ||
|
||
The order of the package IDs in the file is undefined. | ||
|
||
## Verified packages data | ||
|
||
The `verified-packages/verified-packages.v1.json` data file contains all package IDs that are considered verified by the [prefix reservation feature](https://docs.microsoft.com/en-us/nuget/nuget-org/id-prefix-reservation). This essentially defines the verified checkmark icon in the search UIs. | ||
|
||
The data file looks like this: | ||
|
||
```json | ||
[ | ||
"Newtonsoft.Json", | ||
"NuGet.Versioning" | ||
] | ||
``` | ||
|
||
If a package ID is in the file, then it is verified. The package ID is not guaranteed to have the original | ||
(author-intended) casing and should be treated in a case insensitive manner. | ||
|
||
The order of the package IDs is undefined. | ||
|
||
The class for reading and writing this file to Blob Storage is [`VerifiedPackagesDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/VerifiedPackagesDataClient.cs). | ||
|
||
## Popularity transfer data | ||
|
||
The `popularity-transfers/popularity-transfers.v1.json` data file has a mapping of all package IDs that have | ||
transferred their popularity to one or more other packages. | ||
|
||
The data file looks like this: | ||
|
||
```json | ||
{ | ||
"OldPackageA": [ | ||
"NewPackage1", | ||
"NewPackage2" | ||
], | ||
"OldPackageB": [ | ||
"NewPackage3" | ||
] | ||
} | ||
``` | ||
|
||
For each key-value pair, the package ID key has its popularity transferred to the package ID values. The implementation | ||
of the popularity transfer is out of scope for the data file format. Package IDs that do not appear as a key in this | ||
file do not have their popularity transferred. | ||
|
||
The package ID keys and values are not guaranteed to have the original (author-intended) casing and should be treated | ||
in a case insensitive manner. | ||
|
||
The order of the package ID keys and values is case insensitive ascending lexicographical order. | ||
|
||
The class for reading and writing this file to Blob Storage is [`PopularityTransferDataClient`](../src/NuGet.Services.AzureSearch/AuxiliaryFiles/PopularityTransferDataClient.cs). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
TODO: https://github.com/NuGet/NuGetGallery/issues/8005 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.