Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds read/write support for v1.1 system symbols #847

Merged
merged 14 commits into from
Nov 6, 2024
Merged

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Nov 4, 2024

  • Adds static v1.1 system symbol table definition
  • Adds SystemSymbol_1_1 variant of RawSymbolRef
  • Renames old RawSymbolRef::as_raw_symbol_token_ref to shorter as_raw_symbol_ref for consistency.
  • Adds read/write support for 0xEE opcode, system symbol values
  • Adds read/write support for system symbols appearing in FlexSym
  • The application-level writer now uses system symbols for its encoding directives

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour 🧭

// Before a reader using `AnyEncoding` begins reading, it expects text Ion v1.0.
// At the outset of the stream, it inspects the first bytes to see if the stream is binary or text.
// If it encounters a version marker, the expected version will change.
const INITIAL_ENCODING_EXPECTED: IonEncoding = IonEncoding::Text_1_0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ When a reader is initialized for a given Decoder, it needs to know what initial context to set up. For the four encoding-specific decoders, this is trivial; AnyEncoding is the only one that needs an explicit hint.

Comment on lines -1864 to +1880
reader.next(context_ref).expect_err("Ion 1.1 IVM should return an error.");
return Ok(())
reader
.next(context_ref)
.expect_err("Ion 1.1 IVM should return an error.");
return Ok(());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ rustfmt, no logic change

Comment on lines -1215 to -1220
#[case::single_address(&[0xE4, 0x07], 1, 1)]
#[case::two_addresses(&[0xE5, 0x07, 0x09], 1, 2)]
#[case::three_addresses(&[0xE6, 0x07, 0x07, 0x09, 0x0B], 2, 3)]
#[case::single_flex_sym(&[0xE7, 0x07], 1, 1)]
#[case::two_flex_syms(&[0xE8, 0x07, 0x09], 1, 2)]
#[case::three_flex_syms(&[0xE9, 0x07, 0x07, 0x09, 0x0B], 2, 3)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Previously, this test was only confirming that the annotations sequence parts were the expected length after parsing. I augmented the test to include confirming that the annotations being read were the correct ones.

Comment on lines -354 to -361
assert_eq!(
reader
.next(context)?
.expect_value()?
.read()?
.expect_symbol()?,
"".into()
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I rewrote this unit test to loop a list of expected symbols and compare them to the input. This made the test much more concise.

Comment on lines +840 to +842
// In order to minimize the changes needed to introduce a second address space
// for symbols in Ion 1.1, system symbol IDs are resolved eagerly and returned
// as `Text`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty much how I tried to do it when I added read support to ion-java.

Comment on lines 57 to 59
RawSymbolRef::SystemSymbol_1_1(_system_symbol) => {
unreachable!("Ion 1.1 symbol in Ion 1.0 writer")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this provably unreachable with the use of Never or something like that, or would that add too much complexity in the way of generic parameters that are polluting signatures everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best we could do without pervasive generics is have a separate enum with only SymbolId/Text and 'downgrade' to that whenever we enter a 1.0-only code path.

src/lazy/encoder/binary/v1_0/value_writer.rs Outdated Show resolved Hide resolved
src/lazy/encoder/binary/v1_1/flex_sym.rs Outdated Show resolved Hide resolved
src/lazy/encoder/binary/v1_1/flex_sym.rs Outdated Show resolved Hide resolved
}
}

pub(crate) trait SystemSymbol: Sized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this trait used? I recall seeing SystemSymbol_1_1 used somewhere, but not this trait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had a SystemSymbol struct that held an (IonVersion, SymbolAddress) tuple. SystemSymbolTable instances would return SystemSymbols, and RawSymbolRef had a SystemSymbol variant that wasn't Ion-1.1-specific.

However, this led to a lot of situations in Ion 1.0 code where it was unclear whether an Ion 1.0 system symbol would be represented as a SymbolId or a SystemSymbol. To eliminate that ambiguity, I split SystemSymbol into SystemSymbol_1_1 and SystemSymbol_1_0 and narrowed RawSymbolRef::SystemSymbol to RawSymbolRef::SystemSymbol_1_1. Then I turned the SystemSymbol type into a trait.

While I think this was a reasonable way to preserve the abstraction, we aren't using the abstraction much in practice right now. In the spirit of YAGNI, I can strip it out for the moment and we can reintroduce it later if/when we have a more concrete use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to having it, necessarily, but if we're adding something that isn't used yet, we should probably document that in the code.

src/raw_symbol_ref.rs Outdated Show resolved Hide resolved
src/raw_symbol_ref.rs Outdated Show resolved Hide resolved
src/text/text_formatter.rs Outdated Show resolved Hide resolved
@@ -98,6 +99,12 @@ impl Symbol {
}
}

pub fn system(text: &'static str) -> Symbol {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually something that maybe useful for non-system symbols, right? For example, if I'm implementing WriteAsIon for Ion Schema Language models, there's a whole lot of symbol text that is statically known (i.e. all of the constraint names and other ISL keywords).

(Unfortunately, we can't update impl From<&str> for Symbol in such a way that it specializes for static and non-static strings, so use of a constructor function is required to specifically construct a symbol with static text.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally named this method static but... that's a keyword :(

I renamed it static_text--howzat?

Comment on lines +31 to +34
/// A static, read-only map of text to Ion v1.0 system symbol addresses.
/// Because the set of string keys is known at compile time, this map is able to use a
/// perfect hashing function (PHF) to optimize lookup operations for those keys.
pub(crate) static SYSTEM_SYMBOL_TEXT_TO_ID: phf::Map<&str, usize> = phf_map! {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/lazy/encoder/writer.rs Show resolved Hide resolved
src/symbol_table.rs Outdated Show resolved Hide resolved
src/raw_symbol_ref.rs Outdated Show resolved Hide resolved
Comment on lines +186 to +205
/// Creates a new system symbol if the provided address is non-zero and in range for the
/// corresponding system symbol table.
fn new(symbol_address: SymbolAddress) -> Option<Self> {
match symbol_address {
0 => None,
address if address >= Self::system_symbol_table().len() => None,
address => Some(Self::new_unchecked(address)),
}
}

/// Constructs a system symbol from the provided address, returning `Ok(symbol)`.
/// If the provided address is zero or out of bounds, returns an `Err`.
#[inline]
fn try_new(symbol_address: SymbolAddress) -> IonResult<Self> {
Self::new(symbol_address).ok_or_else(|| {
ice!(IonError::decoding_error(format!(
"system symbol address {symbol_address} is out of bounds"
)))
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://doc.rust-lang.org/nomicon/constructors.html

In concrete contexts, a type will provide a static new method for any kind of "default" constructor. This has no relation to new in other languages and has no special meaning. It's just a naming convention.

I was curious about the try_new here, wondered what I was missing. Apparently it's just a convention amongst libraries? I see the pattern in several others including in the standard library.

Also seen here: https://dev.to/cthutu/rust-5-naming-conventions-3cjf

Nothing to do here, just commenting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it's just a convention amongst libraries?

Yep. Methods that return a Result<T> often have a name starting with try_ to distinguish themselves from sibling methods that return T or Option<T>. For example, into()/try_into() and from()/try_from().

Pulling a snippet from your link for others' convenience:
image

src/text/text_formatter.rs Outdated Show resolved Hide resolved
@zslayton zslayton merged commit 89d3206 into main Nov 6, 2024
32 checks passed
@zslayton zslayton deleted the system-symbols branch November 6, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants