Only denormalize results that have an id #1300

wmertens · 2017-02-16T06:52:28Z

In my app, I have a "medium" amount of data (several hundred small-ish objects) coming in from the server, and getting it and handling it with REST+Redux would not be an issue at all.

With apollo though, the results are denormalized and all lists are split out. The result is that I have thousands upon thousands of entries in the apollo cache, so much that it crashes the apollo dev tools when I inspect it.

Furthermore, satisfying a single object query from cache takes about 30ms, presumably because it needs to be pieced back together

Would it be possible to change the cache so that only objects that have a non-null idFromObject are denormalized? Maybe even via a function that looks at the schema instead?

The text was updated successfully, but these errors were encountered:

calebmer · 2017-02-16T19:49:07Z

We would still have to piece the query back together when reading from the store no matter how we save the data, so I’m not convinced that this would improve read performance 😣

Maybe you could put together a quick test with ESBench to show that our reading algorithm would be faster on a denormalized data structure? Remember that we can’t write aliases to the store so we would still need an algorithm that reads back data and gives it the proper alias when we return it to the user.

There will always be some read performance hit given that we normalize our data. Perhaps in the future, we could explore a client that doesn’t normalize data at all…

As for devtools performance, that is definitely something we should look at.

wmertens · 2017-02-16T21:32:07Z

I think the slowness is partly because the cache has so many keys, and partly because many things need to be pieced together that should not have been denormalized in the first place. In my case, I would have 1 cache lookup of a single object with embedded arrays instead of about 30 lookups. That's an order of magnitude… BTW, if you cache queries outside of the store, you could cache each query in its entirety, except using proxy objects that point to the full cache object and only expose the requested fields. Then you can have your cake and eat it too… (not an actual Proxy object, IE doesn't support that, but an object with getters and a link to the cached object) On Thu, Feb 16, 2017 at 8:49 PM Caleb Meredith <[email protected]> wrote: We would still have to piece the query back together when reading from the store no matter how we save the data, so I’m not convinced that this would improve read performance 😣 Maybe you could put together a quick test with ESBench <https://esbench.com/> to show that our reading algorithm would be faster on a denormalized data structure? Remember that we can’t write aliases to the store so we would still need an algorithm that reads back data and gives it the proper alias when we return it to the user. There will always be some read performance hit given that we normalize our data. Perhaps in the future, we could explore a client that doesn’t normalize data at all… As for devtools performance, that is definitely something we should look at. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADWlj7bVCcY9vd2uN8K3dkOTBf1xEMGks5rdKg4gaJpZM4MCpkU> .

calebmer · 2017-02-16T22:52:36Z

Yep, I agree with your premises, but we’re hesitant to act on this until we see that an alternative where we don’t normalize would actually show a major perf win.

Since we are currently discussing refactoring our store code in a major way, this is a change that we could fit in. If anyone in the community can provide the data and prove that this difference would actually matter for perf 👍

(because currently, it is keeping the implementation simpler)

wmertens · 2017-02-17T09:09:54Z

Well, I have a working app you can inspect (privately) where simple queries take disproportionate amounts of time to get from cache. So for sure it's a problem. However, I assume you want a benchmark of sorts? Is there already something similar in the codebase?

…

On Thu, Feb 16, 2017, 11:52 PM Caleb Meredith ***@***.***> wrote: Yep, I agree with your premises, but we’re hesitant to act on this until we see that an alternative where we don’t normalize would actually show a major perf win. Since we are currently discussing refactoring our store code in a major way, this is a change that we could fit in. If anyone in the community can provide the data and prove that this difference would actually matter for perf 👍 (because currently, it is keeping the implementation simpler) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADWlkl4hC6zzny5XHMcNebOp6VOMnGUks5rdNM3gaJpZM4MCpkU> .

calebmer · 2017-02-17T14:35:49Z

Benchmarking two implementations in apollo-client would take a lot of time. It would be much easier to write a micro-benchmark with https://esbench.com to demonstrate a comparable situation 😊

Here’s the benchmark I would like to see. We want to read the following object out of a cache:

const object = {
  objects: [
    { foo: 1.01, bar: 2.01, baz: 3.01 },
    { foo: 1.02, bar: 2.02, baz: 3.02 },
    { foo: 1.03, bar: 2.03, baz: 3.03 },
    { foo: 1.04, bar: 2.04, baz: 3.04 },
    { foo: 1.05, bar: 2.05, baz: 3.05 },
    // 100+ more objects...
  ],
};

The keys and values are arbitrary. Then we want to test against two different cache shapes. One where the objects array is normalized, and the other where it is denormalized. However, here is the catch, the objects are stored with different keys. So foo is stored under a, bar is stored under b, and baz is stored under c. Our two different cache shapes would look like:

const cache1 = {
  'root': {
    objects: {
      $type: 'reference',
      ids: [
        'root.objects.0',
        'root.objects.1',
        'root.objects.2',
        'root.objects.3',
        'root.objects.4',
        // 100+ more ids...
      ],
    },
  },
  'root.objects.0': { a: 1.00, b: 2.00, c: 3.00 },
  'root.objects.1': { a: 1.01, b: 2.01, c: 3.01 },
  'root.objects.2': { a: 1.02, b: 2.02, c: 3.02 },
  'root.objects.3': { a: 1.03, b: 2.03, c: 3.03 },
  'root.objects.4': { a: 1.04, b: 2.04, c: 3.04 },
  // 100+ more nodes...
};

const cache2 = {
  'root': {
    objects: [
      { a: 1.00, b: 2.00, c: 3.00 },
      { a: 1.01, b: 2.01, c: 3.01 },
      { a: 1.02, b: 2.02, c: 3.02 },
      { a: 1.03, b: 2.03, c: 3.03 },
      { a: 1.04, b: 2.04, c: 3.04 },
      // 100+ more objects
    ],
  },
};

This case isn’t super realistic given that in a real cache there would be some mix of references and actual arrays. If the difference is minor then I don’t think this should be our top concern as there are much more pressing performance issues in the cache that we need to solve, but if the difference is major, say cache2 performs 50% better than cache1, then we should probably consider this change.

For an example of a benchmark on ESBench, see my immutable data implementations benchmark: https://esbench.com/bench/588a5ee399634800a03476b0

Poincare · 2017-02-18T17:54:58Z

Benchmarking two implementations should actually be pretty easy with the benchmarks we currently have. I think that would probably work better since it would capture some of the unforeseen differences in perf due to how things within AC are actually implemented. For example, it may be that doing a de-normalized look up is faster in raw JS but might be slower in reality due to how we actually handle the data within Apollo Client. Agree it would probably take a bit more effort than just using a simple raw benchmark though.

wmertens · 2017-02-28T21:44:43Z

Aside from the benchmarks, I think a nice way to denormalize the queries is as follows:

Receive query result
For every cacheable object in the result (*)
- Copy the object into cache, or merge in the values if an object with the same id is already there
- Create an object consisting a hidden reference to the cache object and getters to get the graphql-defined values (**)
- Replace the cacheable object with this proxy object
Keep this query result keyed by variables in cache

So if you get A{id, foo} as a result for q(v1, v2), right now you store q(v1, v2): "see A_id" and instead you would be storing q(v1, v2): A', where A' = new wrapA(whereYouCopiedA) and class wrapA {constructor(obj){this.__hidden=obj} get id(){return this.__hidden.id} get foo(){return this.__hidden.foo}.

The net result of this is that you can directly read query results, but you also share storage between queries so the denormalization works as before. The slowness I experience upon getting queries from cache would be completely gone since fetching results from cache would be O(1).

Notes:

A cacheable object is one that maps to a unique id so it can be shared and updated between queries. Any other objects are not usefully denormalized, since they are only used in one query
(*) you can determine from the schema which objects are cacheable
(**) you only need to make one such getter object for each cacheable schema section; you can use it as the prototype for all the instances of this data, each with only the hidden reference
To know which active queries need updating, you can keep an array of back references for each denormalized object; after all, the only code that changes the cache is your own
Any cacheable subobjects use the same approach, but the wrapper object is stored in the cached object instead of in the query result

wmertens · 2017-03-01T15:36:50Z

Thinking some more on what I propose above:

it won't quite work with sub-queries that have different variables. You can have a shared query result between two queries but different variables in sub-queries, so those need to be stored in the cached query.
- query {a { id b(foo: 1){id bar} boop } and query {a { id b(foo: 2){id bar} boop } can share a but b_foo1 and b_foo2 need to be stored separately.
- A sub-wrapper could be used. For example, the first query would be in cache as
```
{  // object that holds query cache
query_a: {
 __hidden: {id, boop} // shared reference to cached object with the correct id
 // in prototype
 id() {return __hidden.id}
 boop(){return __hidden.boop}
 b: {
   __hidden: {id, bar} // shared reference to cached object with the correct id
   // in prototype
   bar(){return __hidden.bar}
 }
}
```
Personally I don't care if my query returns more data than I asked for. So if there are no subqueries, a wrapper object to grab only the requested data would not be needed. After all, the only two things you can do when doing a query is passing variables and asking for additional fields. Variables are used to key the cache, and additional fields don't impact the fields that were requested. (except if you use Object.keys() to iterate, which would be easy to work around)
- So that would mean the above would instead be
```
{  // object that holds query cache
query_a: {
 __hidden: {id, boop} // shared reference to cached object with the correct id
 // in prototype
 id() {return __hidden.id}
 boop(){return __hidden.boop}
 b: {id, bar, meep} // shared reference to cached object with the correct id, some other query wanted `meep`
}
}
```

Poincare · 2017-03-02T16:06:30Z

@wmertens This approach does sound interesting, especially when you are dealing with lists of objects returned. Our current approach there is O(n) but this should give us an O(1) read for the list. To clarify, if you have one query that returns a list of, say, to-do items and another query returns an update to a specific to-do item (with a particular id), then we'd only be storing one "real" object between the two but we'd store two different proxy objects which then contain references to the "real" object?

wmertens · 2017-03-02T16:33:03Z

Yes, that's right

…

On Thu, Mar 2, 2017, 5:06 PM Dhaivat Pandya ***@***.***> wrote: @wmertens <https://github.com/wmertens> This approach does sound interesting, especially when you are dealing with lists of objects returned. Our current approach there is O(n) but this should give us an O(1) read for the list. To clarify, if you have one query that returns a list of, say, to-do items and another query returns an update to a specific to-do item (with a particular id), then we'd only be storing one "real" object between the two but we'd store two different proxy objects which then contain references to the "real" object? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADWlpKrDcNULkw86yjsZ6QpPcGSEvxcks5rhukJgaJpZM4MCpkU> .

stubailo · 2017-03-07T01:04:56Z

@wmertens could you provide an example of a single query and dataset so that we could test out different reading/writing methods?

wmertens · 2017-03-07T22:37:01Z

@stubailo I gave @helfer a private link to a project to test things in, is that ok?

wmertens · 2017-03-10T09:01:44Z

a quick note: Storing the cache this way allows updating cached objects (CO) by mutating the single base CO, but it doesn't allow removing objects easily.

To do that, you could keep a WeakSet of queries that refer to the CO (per CO), so they can be traversed and the CO can be removed from lists and references. (on remove, check all the queries to see if they are in the WeakSet, if so, traverse)

helfer · 2017-05-03T19:11:28Z

Let's track this under #1409

calebmer added the idea label Feb 16, 2017

Poincare added the 🐎 performance label Feb 18, 2017

calebmer mentioned this issue Mar 1, 2017

Performance problem with tree structures #1342

Closed

wmertens mentioned this issue Mar 10, 2017

Help improve Apollo Client by contributing use-cases! #1391

Closed

wmertens mentioned this issue Mar 22, 2017

Cashay v2 mattkrick/cashay#150

Open

helfer closed this as completed May 3, 2017

wmertens mentioned this issue May 4, 2017

Performance Umbrella Issue #1409

Closed

github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only denormalize results that have an id #1300

Only denormalize results that have an id #1300

wmertens commented Feb 16, 2017

calebmer commented Feb 16, 2017

wmertens commented Feb 16, 2017 via email

calebmer commented Feb 16, 2017

wmertens commented Feb 17, 2017 via email

calebmer commented Feb 17, 2017 •

edited

Loading

Poincare commented Feb 18, 2017 •

edited

Loading

wmertens commented Feb 28, 2017

wmertens commented Mar 1, 2017

Poincare commented Mar 2, 2017

wmertens commented Mar 2, 2017 via email

stubailo commented Mar 7, 2017

wmertens commented Mar 7, 2017

wmertens commented Mar 10, 2017 •

edited

Loading

helfer commented May 3, 2017

Only denormalize results that have an id #1300

Only denormalize results that have an id #1300

Comments

wmertens commented Feb 16, 2017

calebmer commented Feb 16, 2017

wmertens commented Feb 16, 2017 via email

calebmer commented Feb 16, 2017

wmertens commented Feb 17, 2017 via email

calebmer commented Feb 17, 2017 • edited Loading

Poincare commented Feb 18, 2017 • edited Loading

wmertens commented Feb 28, 2017

wmertens commented Mar 1, 2017

Poincare commented Mar 2, 2017

wmertens commented Mar 2, 2017 via email

stubailo commented Mar 7, 2017

wmertens commented Mar 7, 2017

wmertens commented Mar 10, 2017 • edited Loading

helfer commented May 3, 2017

calebmer commented Feb 17, 2017 •

edited

Loading

Poincare commented Feb 18, 2017 •

edited

Loading

wmertens commented Mar 10, 2017 •

edited

Loading