Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export key registry from icu_datagen #1512

Closed
sffc opened this issue Jan 18, 2022 · 13 comments · Fixed by #1670 or #1714
Closed

Export key registry from icu_datagen #1512

sffc opened this issue Jan 18, 2022 · 13 comments · Fixed by #1670 or #1714
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Jan 18, 2022

We should consider making icu_datagen invokable as a library, instead of only as a command-line tool. icu_datagen is unique in the sense that it is the only place in the code base that has access to all struct definitions for all data keys.

Benefits of doing this:

  1. Enables tooling that uses a Postcard or JSON directory as a data source for conversion to other data serialization formats like CrabBake (see Implementing IterableProvider for FsDataProvider and BlobDataProvider #1506)
  2. Enables detection of hash conflicts globally for all ICU4X data keys

FYI, the probability of a hash collision reaches 50% at about 75000 keys, assuming a perfect hash function. So, fairly distant, but definitely high enough for us to worry about it.

CC @robertbastian @Manishearth

@sffc sffc added T-enhancement Type: Nice-to-have but not required C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) labels Jan 18, 2022
@sffc sffc changed the title Export library functions from icu_datagen Export key registry from icu_datagen Feb 4, 2022
@sffc
Copy link
Member Author

sffc commented Feb 4, 2022

This is blocked by #1560 and #1561.

@sffc sffc added the blocked A dependency must be resolved before this is actionable label Feb 4, 2022
@sffc sffc self-assigned this Feb 4, 2022
@sffc sffc added this to the ICU4X 0.6 milestone Feb 4, 2022
@Manishearth
Copy link
Member

Manishearth commented Mar 3, 2022

Here's the design I'm thinking of.

The design hinges on KeyRegistry types that are able to map keys to concrete types. The implementation of a registry will basically just be a giant match statement (or a bunch of if/elses), and we provide a macro that makes it easy to define them. We also provide an implementation that makes it easy to combine registries, so that CldrProvider and the uprops stuff can both have their own registries that get tree'd up.

trait KeyRegistry {
    // Calls `ConcreteDataCallback::callback::<M>();` for the appropriate marker type M
    fn apply_for_concrete_data<R>(&self, key: ResourceKey, callback: &mut impl ConcreteDataCallback) -> Result<R, DataError>;
}

// This is a trait since Rust doesn't support closures with generics, but think of it as a closure.
trait ConcreteDataCallback {
   fn callback<M: DataMarker, R>(&mut self) -> Result<R, DataError> where 
     for<'a> &'a <M::Yokeable as Yokeable<'a>>::Output: serde::Serialize,
     for<'de> YokeTraitHack<<M::Yokeable as Yokeable<'de>>::Output>: Deserialize<'de>,
}

So, for example, to handle #1661, one would probably use something like

struct MyCallback(DataPayload<SerializeMarker>);

impl ConcreteDataCallback for MyCallback {
   fn callback<M: ...>() {
     let mut serializer = postcard::Serializer {output: vec![]};
     let serialized = self.0.serialize(&serializer);
     measure_allocations(|| {
           let d: DataMarker<M> = postcard::deserialize(&serializer.output)?;
     })
   }
}

fn check_resource(provider: &impl IterableDynProvider<SerializeMarker>, key: ResourceKey, req: &DataRequest) {
    let erased: DataPayload<SerializeMarker> = provider.load_payload(key, req)?
    REGISTRY.apply_for_concrete_data(MyCallback(erased));
}

The support types for this would live in provider/core behind the serde feature (maybe behind an additional feature?), and the registries would live in provider/cldr and provider/uprops, with provider/datagen using a "combined" registry.

@Manishearth
Copy link
Member

Manishearth commented Mar 3, 2022

It would be nicer to be able to use ResourceMarker here, but it's by no means necessary, the macro that constructs the registry will just need to be a little bit more involved. This does decouple this work from #1560

@Manishearth
Copy link
Member

Manishearth commented Mar 3, 2022

Also, to give an idea of what the macro would look like, it would take something like

create_registry!(CldrRegistry, 
    JapaneseErasV1Marker::KEY => JapaneseErasV1Marker,
    DatePatternsV1Marker::KEY => DatePatternsV1Marker,
   ....
);

and turn it into

struct CldrRegistry;

impl KeyRegistry for CldrRegistry {
    fn apply_for_concrete_data<R>(&self, key: ResourceKey, callback: &mut impl ConcreteDataCallback) -> Result<R, DataError> {
         match k {
              JapaneseErasV1Marker::KEY => callback.callback::<JapaneseErasV1Marker>(),
              DatePatternsV1Marker::KEY => callback.callback::<DatePatternsV1Marker>(),
              ....
              _ => Err(MissingResourceKey)
         }
    }

}

And we might have a convenient KeyRegistry impl for tuples that chains multiple registries together. If we could use ResourceMarker the macro would need less input.

@Manishearth
Copy link
Member

I have a draft here #1664

@sffc
Copy link
Member Author

sffc commented Mar 4, 2022

There are two traits that are interesting as part of the data exporter:

  • serde::Serialize
  • crabbake::Bakeable

The global registry should enable us to "traitify" an AnyProvider or BufferProvider into either of those two exportable traits.

So, essentially, my mental model of the global registry is something more along the lines of

trait Dynify<M: DataMarker> {
    fn dynify_buffer(key: ResourceKey, payload: BufferPayload) -> DataPayload<M>;
    fn dynify_any(key: ResourceKey, payload: AnyPayload) -> DataPayload<M>;
}

And then the source providers like CLDR and UProps would have

impl Dynify<SerializeMarker> for CldrJsonDataProvider
impl Dynify<BakeableMarker> for CldrJsonDataProvider
// ...

except that these impls would be generated as part of the pre-existing impl_dyn_provider!() macro, which is similar to your create_registry!() macro.

The way to "chain" Dynifiers together is to use a ForkByKeyProvider.

In fact, maybe we should even consider putting these two new methods directly into the DynProvider trait, assuming that is compatible with call sites.

To support the #1661 use case, we could have another marker trait like EmptyMarker so that we don't need to allocate a trait object box. You would do your test as

measure_allocations(|| {
  let _: DataPayload<EmptyMarker> = everything_provider.dynify_buffer(key, buffer)?;
}); 

All that said, we should explore the idea of a callback function with a generic parameter. It's a different way of achieving this goal.

@Manishearth
Copy link
Member

Manishearth commented Mar 4, 2022

except that these impls would be generated as part of the pre-existing impl_dyn_provider!() macro, which is similar to your create_registry!() macro.

I don't think this part is possible, fwiw, Rust doesn't let you piecewise build up a function based on possible generic intermediates

You have to have a separate registry that has a list of keys and types; it cannot be done by individual impl_dyn_provider macros. But this can be made a part of your model by implementing Dynify on separate registry types instead.

(It's easier to just have separate registries so those registries are reusable)

To support the #1661 use case, we could have another marker trait like EmptyMarker so that we don't need to allocate a trait object box. You would do your test as

The reason I was a bit wary of this was that this means that non-test code needs to know about measure_allocations(), which means it may need access to the custom allocator, which can get messy. I'd prefer not having to do that. We could still provide a closure hook though.

@Manishearth
Copy link
Member

Here's an immensely illuminating drawing from a meeting between shane and I. I'm going to prototype this next.

image

@Manishearth
Copy link
Member

Got my attempt at #1669

@Manishearth
Copy link
Member

PR: #1670

@Manishearth Manishearth removed the blocked A dependency must be resolved before this is actionable label Mar 9, 2022
@robertbastian
Copy link
Member

I was expecting something like @Manishearth's design. The registry itself should be kept simple and just be a way to iterate over keys with the correct marker types.

@robertbastian
Copy link
Member

I assume the closing was not on purpose

@robertbastian robertbastian reopened this Mar 10, 2022
@Manishearth
Copy link
Member

Ah, this built the registry but didn't export it, I see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-enhancement Type: Nice-to-have but not required
Projects
None yet
3 participants