Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2789 Convert BSON Corpus Spec to Markdown #1499

Merged
merged 1 commit into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
367 changes: 367 additions & 0 deletions source/bson-corpus/bson-corpus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,367 @@
# BSON Corpus

- Status: Accepted
- Minimum Server Version: N/A

## Abstract

The official BSON specification does not include test data, so this pseudo-specification describes tests for BSON
encoding and decoding. It also includes tests for MongoDB's "Extended JSON" specification (hereafter abbreviated as
`extjson`).

## Meta

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Motivation for Change

To ensure correct operation, we want drivers to implement identical tests for important features. BSON (and `extjson`)
are critical for correct operation and data exchange, but historically had no common test corpus. This
pseudo-specification provides such tests.

### Goals

- Provide machine-readable test data files for BSON and `extjson` encoding and decoding.
- Cover all current and historical BSON types.
- Define test data patterns for three cases:
- conversion/roundtrip,
- decode errors, and
- parse errors.

### Non-Goals

- Replace or extend the official BSON spec at <http://bsonspec.org>.
- Provide a formal specification for `extjson`.

## Specification

The specification for BSON lives at <http://bsonspec.org>. The `extjson` format specification is
[here](../extended-json.rst).

## Test Plan

This test plan describes a general approach for BSON testing. Future BSON specifications (such as for new types like
Decimal128) may specialize or alter the approach described below.

### Description of the BSON Corpus

This BSON test data corpus consists of a JSON file for each BSON type, plus a `top.json` file for testing the overall,
enclosing document and a `multi-type.json` file for testing a document with all BSON types. There is also a
`multi-type-deprecated.json` that includes deprecated keys.

#### Top level keys

- `description`: human-readable description of what is in the file
- `bson_type`: hex string of the first byte of a BSON element (e.g. "0x01" for type "double"); this will be the
synthetic value "0x00" for "whole document" tests like `top.json`.
- `test_key`: (optional) name of a field in a single-BSON-type `valid` test case that contains the data type being
tested.
- `valid` (optional): an array of validity test cases (see below).
- `decodeErrors` (optional): an array of decode error cases (see below).
- `parseErrors` (optional): an array of type-specific parse error case (see below).
- `deprecated` (optional): this field will be present (and true) if the BSON type has been deprecated (i.e. Symbol,
Undefined and DBPointer)

#### Validity test case keys

Validity test cases include 'canonical' forms of BSON and Extended JSON that are deemed equivalent and may provide
additional cases or metadata for additional assertions. For each case, keys include:

- `description`: human-readable test case label.
- `canonical_bson`: an (uppercase) big-endian hex representation of a BSON byte string. Be sure to mangle the case as
appropriate in any roundtrip tests.
- `canonical_extjson`: a string containing a Canonical Extended JSON document. Because this is itself embedded as a
*string* inside a JSON document, characters like quote and backslash are escaped.
- `relaxed_extjson`: (optional) a string containing a Relaxed Extended JSON document. Because this is itself embedded as
a *string* inside a JSON document, characters like quote and backslash are escaped.
- `degenerate_bson`: (optional) an (uppercase) big-endian hex representation of a BSON byte string that is technically
parseable, but not in compliance with the BSON spec. Be sure to mangle the case as appropriate in any roundtrip tests.
- `degenerate_extjson`: (optional) a string containing an invalid form of Canonical Extended JSON that is still
parseable according to type-specific rules. (For example, "1e100" instead of "1E+100".)
- `converted_bson`: (optional) an (uppercase) big-endian hex representation of a BSON byte string. It may be present for
deprecated types. It represents a possible conversion of the deprecated type to a non-deprecated type, e.g. symbol to
string.
- `converted_extjson`: (optional) a string containing a Canonical Extended JSON document. Because this is itself
embedded as a *string* inside a JSON document, characters like quote and backslash are escaped. It may be present for
deprecated types and is the Canonical Extended JSON representation of `converted_bson`.
- `lossy` (optional) -- boolean; present (and true) iff `canonical_bson` can't be represented exactly with extended JSON
(e.g. NaN with a payload).

#### Decode error case keys

Decode error cases provide an invalid BSON document or field that should result in an error. For each case, keys
include:

- `description`: human-readable test case label.
- `bson`: an (uppercase) big-endian hex representation of an invalid BSON string that should fail to decode correctly.

#### Parse error case keys

Parse error cases are type-specific and represent some input that can not be encoded to the `bson_type` under test. For
each case, keys include:

- `description`: human-readable test case label.
- `string`: a text or numeric representation of an input that can't be parsed to a valid value of the given type.

### Extended JSON encoding, escaping and ordering

Because the `canonical_extjson` and other Extended JSON fields are embedded in a JSON document, all their JSON
metacharacters are escaped. Control characters and non-ASCII codepoints are represented with `\uXXXX`. Note that this
means that the corpus JSON will appear to have double-escaped characters `\\uXXXX`. This is by design to ensure that the
Extended JSON fields remain printable ASCII without embedded null characters to ensure maximum portability to different
language JSON or extended JSON decoders.

There are legal differences in JSON representation that may complicate testing for particular codecs. The JSON in the
corpus may not resemble the JSON generated by a codec, even though they represent the same data. Some known differences
include:

- JSON only requires certain characters to be escaped but allows any character to be escaped.
- The JSON format is *unordered* and whitespace (outside of strings) is not significant.

Implementations using these tests MUST normalize JSON comparisons however necessary for effective comparison.

### Language-specific differences

Some programming languages may not be able to represent or transmit all types accurately. In such cases, implementations
SHOULD ignore (or modify) any tests which are not supported on that platform.

### Testing validity

To test validity of a case in the `valid` array, we consider up to five possible representations:

- Canonical BSON (denoted herein as "cB") -- fully valid, spec-compliant BSON
- Degenerate BSON (denoted herein as "dB") -- invalid but still parseable BSON (bad array keys, regex options out of
order)
- Canonical Extended JSON (denoted herein as "cEJ") -- A string format based on the JSON standard that emphasizes type
preservation at the expense of readability and interoperability.
- Degenerate Extended JSON (denoted herin as "dEJ") -- An invalid form of Canonical Extended JSON that is still
parseable. (For example, "1e100" instead of "1E+100".)
- Relaxed Extended JSON (denoted herein as "rEJ") -- A string format based on the JSON standard that emphasizes
readability and interoperability at the expense of type preservation.

Not all input types will exist for a given test case.

There are two forms of BSON/Extended JSON codecs: ones that have a language-native "intermediate" representation and
ones that do not.

For a codec *without* an intermediate representation (i.e. one that translates directly from BSON to JSON or back), the
following assertions MUST hold (function names are for clarity of illustration only):

- for cB input:
- bson_to_canonical_extended_json(cB) = cEJ
- bson_to_relaxed_extended_json(cB) = rEJ (if rEJ exists)
- for cEJ input:
- json_to_bson(cEJ) = cB (unless lossy)
- for dB input (if it exists):
- bson_to_canonical_extended_json(dB) = cEJ
- bson_to_relaxed_extended_json(dB) = rEJ (if rEJ exists)
- for dEJ input (if it exists):
- json_to_bson(dEJ) = cB (unless lossy)
- for rEJ input (if it exists):
- bson_to_relaxed_extended_json( json_to_bson(rEJ) ) = rEJ

For a codec that has a language-native representation, we want to test both conversion and round-tripping. For these
codecs, the following assertions MUST hold (function names are for clarity of illustration only):

- for cB input:
- native_to_bson( bson_to_native(cB) ) = cB
- native_to_canonical_extended_json( bson_to_native(cB) ) = cEJ
- native_to_relaxed_extended_json( bson_to_native(cB) ) = rEJ (if rEJ exists)
- for cEJ input:
- native_to_canonical_extended_json( json_to_native(cEJ) ) = cEJ
- native_to_bson( json_to_native(cEJ) ) = cB (unless lossy)
- for dB input (if it exists):
- native_to_bson( bson_to_native(dB) ) = cB
- for dEJ input (if it exists):
- native_to_canonical_extended_json( json_to_native(dEJ) ) = cEJ
- native_to_bson( json_to_native(dEJ) ) = cB (unless lossy)
- for rEJ input (if it exists):
- native_to_relaxed_extended_json( json_to_native(rEJ) ) = rEJ

Implementations MAY test assertions in an implementation-specific manner.

### Testing decode errors

The `decodeErrors` cases represent BSON documents that are sufficiently incorrect that they can't be parsed even with
liberal interpretation of the BSON schema (e.g. reading arrays with invalid keys is possible, even though technically
invalid, so they are *not* `decodeErrors`).

Drivers SHOULD test that each case results in a decoding error. Implementations MAY test assertions in an
implementation-specific manner.

### Testing parsing errors

The interpretation of `parseErrors` is type-specific. The structure of test cases within `parseErrors` is described in
[Parse error case keys](#parse-error-case-keys).

Drivers SHOULD test that each case results in a parsing error (e.g. parsing Extended JSON, constructing a language
type). Implementations MAY test assertions in an implementation-specific manner.

#### Top-level Document (type 0x00)

For type "0x00" (i.e. top-level documents), the `string` field contains input for an Extended JSON parser. Drivers MUST
parse the Extended JSON input using an Extended JSON parser and verify that doing so yields an error. Drivers that parse
Extended JSON into language types instead of directly to BSON MAY need to additionally convert the resulting language
type(s) to BSON to expect an error.

Drivers SHOULD also parse the Extended JSON input using a regular JSON parser (not an Extended JSON one) and verify the
input is parsed successfully. This serves to verify that the `parseErrors` test cases are testing Extended JSON-specific
error conditions and that they do not have, for example, unintended syntax errors.

Note: due to the generic nature of these tests, they may also be used to test Extended JSON parsing errors for various
BSON types appearing within a document.

#### Binary (type 0x05)

For type "0x05" (i.e. binary), the rules for handling `parseErrors` are the same as those for
[Top-level Document (type 0x00)](#top-level-document-type-0x00).

#### Decimal128 (type 0x13)

For type "0x13" (i.e. Decimal128), the `string` field contains input for a Decimal128 parser that converts string input
to a binary Decimal128 value (e.g. Decimal128 constructor). Drivers MUST assert that these strings cannot be
successfully converted to a binary Decimal128 value and that parsing the string produces an error.

### Deprecated types

The corpus files for deprecated types are provided for informational purposes. Implementations MAY ignore or modify them
to match legacy treatment of deprecated types. The `converted_bson` and `converted_extjson` fields MAY be used to test
conversion to a standard type or MAY be ignored.

## Prose Tests

The following tests have not yet been automated, but MUST still be tested.

### 1. Prohibit null bytes in null-terminated strings when encoding BSON

The BSON spec uses null-terminated strings to represent document field names and regex components (i.e. pattern and
flags/options). Drivers MUST assert that null bytes are prohibited in the following contexts when encoding BSON (i.e.
creating raw BSON bytes or constructing BSON-specific type classes):

- Field name within a root document
- Field name within a sub-document
- Pattern for a regular expression
- Flags/options for a regular expression

Depending on how drivers implement BSON encoding, they MAY expect an error when constructing a type class (e.g. BSON
Document or Regex class) or when encoding a language representation to BSON (e.g. converting a dictionary, which might
allow null bytes in its keys, to raw BSON bytes).

## Implementation Notes

### A tool for visualizing BSON

The test directory includes a Perl script `bsonview`, which will decompose and highlight elements of a BSON document. It
may be used like this:

```bash
echo "0900000010610005000000" | perl bsonview -x
```

### Notes for certain types

#### Array

Arrays can have degenerate BSON if the array indexes are not set as "0", "1", etc.

#### Boolean

The only valid values are 0 and 1. Other non-zero numbers MUST be interpreted as errors rather than "true" values.

#### Binary

The Base64 encoded text in the extended JSON representation MUST be padded.

#### Code

There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize
provided and generated extended JSON before comparison.

#### Decimal

NaN with payload can't be represented in extended JSON, so such conversions are lossy.

#### Double

There is not yet a way to represent Inf, -Inf or NaN in extended JSON. Even if a `$numberDouble` is added, it is
unlikely to support special values with payloads, so such doubles would be lossy when converted to extended JSON.

String representation of doubles is fairly unportable so it's hard to provide a single string that all
platforms/languages will generate. Testers may need to normalize/modify the test cases.

#### String

There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize
provided and generated extended JSON before comparison.

#### DBPointer

This type is deprecated. The provided converted form (`converted_bson`) represents them as DBRef documents, but such
conversion is outside the scope of this spec.

#### Symbol

This type is deprecated. The provided converted form converts these to strings, but such conversion is outside the scope
of this spec.

#### Undefined

This type is deprecated. The provided converted form converts these to Null, but such conversion is outside the scope of
this spec.

## Reference Implementation

The Java, C# and Perl drivers.

## Design Rationale

### Use of extjson

Testing conversion requires an "input" and an "output". With a BSON string as both input and output, we can only test
that it roundtrips correctly --we can't test that the decoded value visible to the language is correct.

For example, a pathological encoder/decoder could invert Boolean true and false during decoding and encoding. The BSON
would roundtrip but the program would see the wrong values.

Therefore, we need a separate, semantic description of the contents of a BSON string in a machine readable format.
Fortunately, we already have extjson as a means of doing so. The extended JSON strings contained within the tests adhere
to the Extended JSON Specification.

### Repetition across cases

Some validity cases may result in duplicate assertions across cases, particularly if the `degenerate_bson` field is
different in different cases, but the `canonical_bson` field is the same. This is by design so that each case stands
alone and can be confirmed to be internally consistent via the assertions. This makes for easier and safer test case
development.

## Changelog

- 2024-01-22: Migrated from reStructuredText to Markdown.

- 2023-06-14: Add decimal128 Extended JSON parse tests for clamped zeros with\
very large exponents.

- 2022-10-05: Remove spec front matter and reformat changelog.

- 2021-09-09: Clarify error expectation rules for `parseErrors`.

- 2021-09-02: Add spec and prose tests for prohibiting null bytes in\
null-terminated strings within document field
names and regular expressions. Clarify type-specific rules for `parseErrors`.

- 2017-05-26: Revised to be consistent with Extended JSON spec 2.0: valid case\
fields have changed, as have the test
assertions.

- 2017-01-23: Added `multi-type.json` to test encoding and decoding all BSON\
types within the same document. Amended
all extended JSON strings to adhere to the Extended JSON Specification. Modified the "Use of extjson" section of this
specification to note that canonical extended JSON is now used.

- 2016-11-14: Removed "invalid flags" BSON Regexp case.

- 2016-10-25: Added a "non-alphabetized flags" case to the BSON Regexp corpus\
file; decoders must be able to read
non-alphabetized flags, but encoders must emit alphabetized flags. Added an "invalid flags" case to the BSON Regexp
corpus file.
Loading
Loading