Data ownership for zero-copy data structs #667

sffc · 2021-04-21T02:16:23Z

We use 's in data structs to allow them to borrow data via zero-copy deserialization. If data is statically allocated, we simply point to that pre-loaded buffer, as ICU4C does now. However, how do we handle ownership for dynamically loaded data, such as from a filesystem, RPC service, or network request?

Currently, we always return data structs that own their data ('s: 'static), due to #632. When we fix #632, we will need to have a story for how data providers that dynamically load data, like FsDataProvider, fit into the picture.

We also must think about how 'd, the lifetime of the data structs themselves, fits into the picture.

I see ~~two~~ three directions.

Option 1: FsDataProvider owns the data

Option 1 means FsDataProvider would lazy-load data into memory and then return pointers to it. 's equals the lifetime of the FsDataProvider.

This is challenging because a singleton FsDataProvider may have unbounded memory growth, since we cannot "delete" old data since we can't know if someone still has a reference to it. (See Option 3 for a solution based on reference counting.)

We could limit the scope of this problem by recommending or requiring that:

FsDataProvider can load data for only one locale or component at a time
You create a locally-scoped FsDataProvider instead of creating a global

The other problem with this option is that it basically turns FsDataProvider into an (unbounded) data cache. I was hoping to keep the data cache as a separate component.

Option 2: Data struct owns the data

Option 2 means FsDataProvider would only ever return data that is owned by the data struct, such that the data provider is not tied to the lifetime of the data structs it returns.

This is challenging because after #632 is fixed, we would need a way to programmatically convert a data struct to owned ('s: 'static). We would need to make a trait, say IntoOwned, that converts data structs into 'static, and implement that trait on all data structs and all types that are encapsulated by data structs (like ZeroVec). We would want to make sure that IntoOwned is as efficient as possible, since this would be on a critical path in data loading.

Option 2 could be paired with a cache to help with performance and avoiding repeated loads of data. However, doing so would preclude the possibility of returning 'd (borrowing the data struct), because then we fall into the same trap as Option 1.

Option 3: Reference count the data

With this solution, we remove the 's lifetime from the data provider and "replace" it with a reference-counted buffer. @Manishearth explains how this could work in a reply farther down in this thread.

With reference counted buffers, data caches can safely drop data without invalidating downstream users of that data, because the downstream users will continue to retain an Rc to the data. Once the downstream users are dropped, the reference count will reach 0 and the buffers will be deallocated.

There are two locations where we can add the Rcs:

Option 3.1: Add a variant to ZeroVec and friends which is a Rc<[u8]>
Option 3.2: Store a single Rc<[u8]> alongside the data struct and use self-references

Option 3.1 has several problems, not the least of which is that it forces the Rcs to "infect" all fields of the data struct, which likely comes with a performance impact. The only real disadvantage of Option 3.2 is that self-references require a bit of unsafe code in Rust that may be difficult to reason about. I prefer Option 3.2 and taking the time to verify its correctness.

With this solution, we would also drop 'd, because calling .clone() on a data struct will be a cheap operation, since it won't require allocating any memory.

Summary

Option 2 fits best with my mental model. However, we may want to support both Options 1 and 2, because they are useful in different cases. For example, a data provider that downloads a whole language pack may wish to adopt Option 1, whereas one that pulls from an RPC service may wish to use Option 2.

UPDATE: I now think that Option 3.2 is a workable solution that solves our known problems.

CC @Manishearth @nciric

The text was updated successfully, but these errors were encountered:

zbraniecki · 2021-04-21T02:30:02Z

My initial response is that neither of those options really work optimally unfortunately and I don't know if we can have an optional architecture, but I'd like to try.
It may also be that the answer differs between Rust clients and FFI clients.

For Rust clients, in my ideal world DataProvider owns data that is loaded. The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc.

The reason I would prefer that is that I imagine clients wanting to "just" load DateTimeFormat, or NumberFormat and use it. Such formatter constructor would then ask DataProvider for data and on the first request, the provider loads the data, and on the consecutive the data is already there.
In result I'd expect Provider to return borrowed Cow of the data to the constructor and each DateTimeFormat and NumberFormat instance carries that data around for as long as the provider is alive.

This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out.

I can see a claim that when DTF or NF` go out of scope they decrease the rc count in the provider and if it goes to zero the provider may clean up data from oldest to newest as needed.

The story may not cater well to the FFI scenario where we may need to return owned data over the FFI and allow it to live in the instance.

So, seems like I'm also nudging in the direction of option1+2, but I'm a bit stronger on option 1 than you are.

sffc · 2021-04-21T02:43:55Z

The problem of data growth is real and would be addressed either by FIFO cache, or some Rc/Arc.

Option 1 is incompatible with a FIFO cache on the FsDataProvider layer, because the data provider cannot know whether a downstream formatter is retaining references to data. Rc/Arc might work; I just don't have a design for it yet. I think it would require that the DataProvider .load() function returns something with an Rc/Arc in it. Also, I feel like if you need Rc/Arc, you can probably come up with an architecture that doesn't require it.

each DateTimeFormat and NumberFormat instance carries that data around for as long as the provider is alive.

What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?

This is also important for runtime processing. I imagine that we may load unparsed patterns, and parse them on the first load, and the consecutive loads would get parsed patterns unless they got GCed out.

+1. This is an easier problem to solve, and it is compatible with both Option 1 and Option 2. I don't want to conflate that problem with the problem of raw data ownership (e.g., strings, ZeroVecs).

zbraniecki · 2021-04-21T02:45:59Z

What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?

When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the &dp is longer than the lifetime of the DTF instance.
Not sure if this is viable.

sffc · 2021-04-21T02:47:02Z

What do you propose to happen if the provider is destroyed but there are still formatters that have references to data it owns?

When writing that sentence my thinking was to make it impossible in Rust for a DTF instance to outlive the provider. It would just require that the lifetime of the &dp is longer than the lifetime of the DTF instance.
Not sure if this is viable.

Okay. This is actually already the case; DateTimeFormat has a 'd lifetime which is tied to the DataProvider argument.

sffc · 2021-04-22T17:41:00Z

@Manishearth: An option 3 would be to put a Vec<u8> in the data request, and data is owned by that buffer. But in that case it might make more sense to just return owned data. Going in this direction could allow the caller to do interesting things like using an arena or their own reference counting.

sffc · 2021-04-22T17:42:38Z

2021-04-22: The only option really off the table is Option 1 without reference counting.

nciric · 2021-04-27T20:49:56Z

We just finished 2yr long project of removing something like Option 1 from our code base - it would be best if we could avoid cache + reference/pointers combination. Major reason is that it's not really a cache, because you can't evict data.

Idea where we have locale based provider is better, but it's just a thin wrapper around general data provider. Users can still shoot themselves in the foot by creating large number of locale based providers, but in that case it's user mistake.

I would fast forward couple iterations and land on a hybrid model, where in some cases we return a reference to a large dataset (like timezone names), and in other cases we return a value (date patterns). Returning a value works great with caching, and we can implement efficient cache eviction rules in that case.

This approach would require having two types of data providers - reference and value based. We also need to denote, possibly in data, which data set is to be returned as reference and which as value. It can also be combined with locale data providers to limit the data growth.

The question is - are we more concerned with performance hit from complicating data provider logic or potential memory inflation due to user error?

Manishearth · 2021-04-28T03:53:34Z

I and @sffc have been discussing Rc-based approaches all day, and most of it has been him suggesting approaches and me saying the approach won't work well 😄

Recap

Firstly, a recap of existing approaches so that we have a list in one place, and their downsides

Option 1: FsDataProvider owns data
- We have to deal with 'd and 's lifetimes everywhere
- FsDataProvider can never throw away anything, if you want to manage the lifecycle of that data, make new FsDataProviders
Option 2: Data structs own data
- Extra parsing/cloning, or Rcing with a cache
- Still need to care about 's if we're trying to be zero copy
Option 3.1 (new): Have an RcCow (RcZeroVec, etc) which can be owned or Rc, use instead of Cow in data structs
- This is pervasive and needs us to update everything, and needs self referential lifetimes
Option 3.2 (new, good): Have some way of carrying around data structs along with an Rc to their backing buffer
- This also needs self referential lifetimes in a complicated way

I won't discuss option 3.1 as much since option 3.2 is better in every way, but we did discuss ideas in that space today. @sffc may write some more about the various options available to us later.

Design of Option 3.2

On to the actual design of Option 3.2, which I may later turn into a design doc to check in.

The rough idea

The typical data struct will look something like this:

struct MyDataStruct<'s> {
   cow: Cow<'s, str>,
   vec: ZeroVec<'s, u32>
   integer: u64,
   // ... 
}

This struct can contain fully owned data, but it may also contain data that has been zero-copy deserialized from a u8 buffer, containing references to that buffer.

Ideally, we'd be able to carry around a struct like

struct SharedMyDataStruct {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: MyDataStruct<'self>
}

such that the MyDataStruct borrows from the rc. We can then cache this buffer in the data provider as necessary. The data provider can also cache MyDataStruct objects alongside the buffer to avoid reparsing.

Of course, 'self doesn't exist in Rust.

Transforming lifetimes

Now, we can implement SharedMyDataStruct using unsafe code to not need 'self. Ultimately this just needs to behave like a MyDataStruct<'self> with a tagged-on destructor. This is easy enough:

struct SharedMyDataStruct {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: MyDataStruct<'static> // the 'static is a lie
}

impl SharedMyDataStruct {
   fn get<'a>(&'a self) -> &'a MyDataStruct<'a> { // note the lifetime!
       // transmute the lifetime of &self.data
   }
}

It's not safe to return MyDataStruct<'_> with lifetime other than 'a there (since we do not statically know the lifetime of the backing buffer, we just know that it is at least 'a)

This is all well and good. Except this can't be done generically -- we will have many data struct types and it would be a huge pain to write this struct for each of them. What we want is:

struct SharedData<T> {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: T<'static> // the 'static is a lie
}

impl<T> SharedData<T> {
   fn get<'a>(&'a self) -> &'a T<'a> {..}
   fn get_mut<'a>(&'a mut self) -> &'a mut T<'a> {..}
}

// some kind of `Deserialize` impl that can construct this object zero-copy?

except of course this isn't expressible, not without GATs. In current Rust there is no way to generically talk about "T, but with a different lifetime" -- T: 'a bounds T but doesn't swap lifetimes.

Build our own GATs

We can, however, manually build this scaffolding:

unsafe trait LifetimeTransform<'a>: 'static {
   type Output: 'a; // MUST be `Self` with the lifetime swapped
   // used by `SharedData::get()`
   fn transform(&'a self) -> &'a Self::Output;
   fn transform_mut(&'a mut self) -> &'a mut Self::Output;
   // used for deserialization. Safety constraint: `Self` must be
   // destroyed before the data `Self::Output` was deserialized
   // from is
   unsafe fn make(Self::Output) -> Self;
}

unsafe impl<'a> LifetimeTransform<'a> for MyDataStruct<'static> {
  type Output = MyDataStruct<'a>
  // impls are transmutes
}

The trait is unsafe to implement since we're imposing an additional constraint on what we expect Output to be. It might be possible to keep the trait safe and just require unsafe in the implementations, but it's easier to err on the side of caution here.

SharedData becomes:

struct SharedData<T> {
   rc: Rc<[u8]>, // or some kind of reference counting
   data: T<'static> // the 'static is a lie
}

impl<T: for<'a> LifetimeTransform<'a>> SharedData<T> {
   fn get<'a>(&'a self) -> &'a T::Output {..}
   fn get_mut<'a>(&'a mut self) -> &'a mut T::Output {..}
}

// some kind of `Deserialize` impl that can construct this object zero-copy using `LifetimeTransform::make()`
// where you zero-copy deserialize a `T`, `LifetimeTransform::make()` it, and wrap it in a `SharedData<T>` with
// the correct `Rc`

Make it pleasant

Firstly we can swap the manual impl of LifetimeTransform with #[derive(LifetimeTransform)]. This custom derive will just use transmutes, but will also assert that all of the child fields have LifetimeTransform implementations that conform to the constraints.

We do care about patching, and SharedData<T> is patchable on its own. We'd make DataPayload<T> be a thin wrapper around SharedData<T>.

We can allow for non-Rc-backed data by simply making SharedData<T> use a nullable rc field. We can also support Arc by making that field be an enum.

We may eventually need a #[derive(IntoFullyOwned)] custom derive that can deep-copy-on-write to construct fully 'static owned data structs.

Major benefits

We get to get rid of both 's and 'd! No more gnarly lifetimes!
We can easily add Arc support if necessary.
Only data providers which wish to cache need to do reference counting.

Lingering concerns

Is get_mut() actually safe? I have this feeling it should return Pin<&mut T::Output>, but I can't figure out a reason why. It should be possible to write to the T, the borrowed data can only be replaced with owned data, not overwritten.

See unicode-org#667

sffc · 2021-07-13T06:50:27Z

Fixed by #745

sffc added A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design discuss-priority Discuss at the next ICU4X meeting labels Apr 21, 2021

sffc self-assigned this Apr 21, 2021

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Apr 22, 2021

sffc added this to the 2021-Q2-m1 milestone Apr 22, 2021

sffc added the question Unresolved questions; type unclear label Apr 22, 2021

sffc added the discuss-priority Discuss at the next ICU4X meeting label Apr 28, 2021

sffc mentioned this issue Apr 28, 2021

Clean up KeyedDataProvider and get_all_resc_keys() #442

Closed

sffc added T-core Type: Required functionality S-large Size: A few weeks (larger feature, major refactoring) and removed discuss-priority Discuss at the next ICU4X meeting question Unresolved questions; type unclear labels Apr 29, 2021

sffc assigned Manishearth Apr 29, 2021

sffc modified the milestones: 2021-Q2-m1, ICU4X 0.3 Apr 29, 2021

sffc mentioned this issue Apr 29, 2021

Clarify, document, and enforce string types in data structs #571

Closed

Manishearth mentioned this issue May 5, 2021

Add yoke crate #705

Merged

This was referenced May 21, 2021

ICU4X data structs don't actually borrow any data #632

Closed

Store DataPayload instead of Cow throughout ICU4X #729

Merged

Manishearth mentioned this issue May 26, 2021

capi: Separate data loading from ICU4X APIs #560

Closed

sffc added a commit to sffc/omnicu that referenced this issue May 27, 2021

Replace DataPayload Deref with .get() throughout ICU4X

72c85fa

See unicode-org#667

This was referenced Jun 17, 2021

Revisiting no_std in ICU4X #812

Closed

Add RcBuf variant to DataPayload #816

Merged

Migrate some data structs to serde(borrow) #820

Merged

sffc closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data ownership for zero-copy data structs #667

Data ownership for zero-copy data structs #667

sffc commented Apr 21, 2021 •

edited

Loading

zbraniecki commented Apr 21, 2021

sffc commented Apr 21, 2021 •

edited

Loading

zbraniecki commented Apr 21, 2021

sffc commented Apr 21, 2021

sffc commented Apr 22, 2021

sffc commented Apr 22, 2021

nciric commented Apr 27, 2021

Manishearth commented Apr 28, 2021 •

edited by sffc

Loading

sffc commented Jul 13, 2021

Data ownership for zero-copy data structs #667

Data ownership for zero-copy data structs #667

Comments

sffc commented Apr 21, 2021 • edited Loading

Option 1: FsDataProvider owns the data

Option 2: Data struct owns the data

Option 3: Reference count the data

Summary

zbraniecki commented Apr 21, 2021

sffc commented Apr 21, 2021 • edited Loading

zbraniecki commented Apr 21, 2021

sffc commented Apr 21, 2021

sffc commented Apr 22, 2021

sffc commented Apr 22, 2021

nciric commented Apr 27, 2021

Manishearth commented Apr 28, 2021 • edited by sffc Loading

Recap

Design of Option 3.2

The rough idea

Transforming lifetimes

Build our own GATs

Make it pleasant

Major benefits

Lingering concerns

sffc commented Jul 13, 2021

sffc commented Apr 21, 2021 •

edited

Loading

sffc commented Apr 21, 2021 •

edited

Loading

Manishearth commented Apr 28, 2021 •

edited by sffc

Loading