Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only denormalize results that have an id #1300

Closed
wmertens opened this issue Feb 16, 2017 · 14 comments
Closed

Only denormalize results that have an id #1300

wmertens opened this issue Feb 16, 2017 · 14 comments

Comments

@wmertens
Copy link
Contributor

In my app, I have a "medium" amount of data (several hundred small-ish objects) coming in from the server, and getting it and handling it with REST+Redux would not be an issue at all.

With apollo though, the results are denormalized and all lists are split out. The result is that I have thousands upon thousands of entries in the apollo cache, so much that it crashes the apollo dev tools when I inspect it.

Furthermore, satisfying a single object query from cache takes about 30ms, presumably because it needs to be pieced back together

Would it be possible to change the cache so that only objects that have a non-null idFromObject are denormalized? Maybe even via a function that looks at the schema instead?

@calebmer
Copy link
Contributor

We would still have to piece the query back together when reading from the store no matter how we save the data, so I’m not convinced that this would improve read performance 😣

Maybe you could put together a quick test with ESBench to show that our reading algorithm would be faster on a denormalized data structure? Remember that we can’t write aliases to the store so we would still need an algorithm that reads back data and gives it the proper alias when we return it to the user.

There will always be some read performance hit given that we normalize our data. Perhaps in the future, we could explore a client that doesn’t normalize data at all…

As for devtools performance, that is definitely something we should look at.

@calebmer calebmer added the idea label Feb 16, 2017
@wmertens
Copy link
Contributor Author

wmertens commented Feb 16, 2017 via email

@calebmer
Copy link
Contributor

Yep, I agree with your premises, but we’re hesitant to act on this until we see that an alternative where we don’t normalize would actually show a major perf win.

Since we are currently discussing refactoring our store code in a major way, this is a change that we could fit in. If anyone in the community can provide the data and prove that this difference would actually matter for perf 👍

(because currently, it is keeping the implementation simpler)

@wmertens
Copy link
Contributor Author

wmertens commented Feb 17, 2017 via email

@calebmer
Copy link
Contributor

calebmer commented Feb 17, 2017

Benchmarking two implementations in apollo-client would take a lot of time. It would be much easier to write a micro-benchmark with https://esbench.com to demonstrate a comparable situation 😊

Here’s the benchmark I would like to see. We want to read the following object out of a cache:

const object = {
  objects: [
    { foo: 1.01, bar: 2.01, baz: 3.01 },
    { foo: 1.02, bar: 2.02, baz: 3.02 },
    { foo: 1.03, bar: 2.03, baz: 3.03 },
    { foo: 1.04, bar: 2.04, baz: 3.04 },
    { foo: 1.05, bar: 2.05, baz: 3.05 },
    // 100+ more objects...
  ],
};

The keys and values are arbitrary. Then we want to test against two different cache shapes. One where the objects array is normalized, and the other where it is denormalized. However, here is the catch, the objects are stored with different keys. So foo is stored under a, bar is stored under b, and baz is stored under c. Our two different cache shapes would look like:

const cache1 = {
  'root': {
    objects: {
      $type: 'reference',
      ids: [
        'root.objects.0',
        'root.objects.1',
        'root.objects.2',
        'root.objects.3',
        'root.objects.4',
        // 100+ more ids...
      ],
    },
  },
  'root.objects.0': { a: 1.00, b: 2.00, c: 3.00 },
  'root.objects.1': { a: 1.01, b: 2.01, c: 3.01 },
  'root.objects.2': { a: 1.02, b: 2.02, c: 3.02 },
  'root.objects.3': { a: 1.03, b: 2.03, c: 3.03 },
  'root.objects.4': { a: 1.04, b: 2.04, c: 3.04 },
  // 100+ more nodes...
};

const cache2 = {
  'root': {
    objects: [
      { a: 1.00, b: 2.00, c: 3.00 },
      { a: 1.01, b: 2.01, c: 3.01 },
      { a: 1.02, b: 2.02, c: 3.02 },
      { a: 1.03, b: 2.03, c: 3.03 },
      { a: 1.04, b: 2.04, c: 3.04 },
      // 100+ more objects
    ],
  },
};

This case isn’t super realistic given that in a real cache there would be some mix of references and actual arrays. If the difference is minor then I don’t think this should be our top concern as there are much more pressing performance issues in the cache that we need to solve, but if the difference is major, say cache2 performs 50% better than cache1, then we should probably consider this change.

For an example of a benchmark on ESBench, see my immutable data implementations benchmark: https://esbench.com/bench/588a5ee399634800a03476b0

@Poincare
Copy link
Contributor

Poincare commented Feb 18, 2017

Benchmarking two implementations should actually be pretty easy with the benchmarks we currently have. I think that would probably work better since it would capture some of the unforeseen differences in perf due to how things within AC are actually implemented. For example, it may be that doing a de-normalized look up is faster in raw JS but might be slower in reality due to how we actually handle the data within Apollo Client. Agree it would probably take a bit more effort than just using a simple raw benchmark though.

@wmertens
Copy link
Contributor Author

Aside from the benchmarks, I think a nice way to denormalize the queries is as follows:

  • Receive query result
  • For every cacheable object in the result (*)
    • Copy the object into cache, or merge in the values if an object with the same id is already there
    • Create an object consisting a hidden reference to the cache object and getters to get the graphql-defined values (**)
    • Replace the cacheable object with this proxy object
  • Keep this query result keyed by variables in cache

So if you get A{id, foo} as a result for q(v1, v2), right now you store q(v1, v2): "see A_id" and instead you would be storing q(v1, v2): A', where A' = new wrapA(whereYouCopiedA) and class wrapA {constructor(obj){this.__hidden=obj} get id(){return this.__hidden.id} get foo(){return this.__hidden.foo}.

The net result of this is that you can directly read query results, but you also share storage between queries so the denormalization works as before. The slowness I experience upon getting queries from cache would be completely gone since fetching results from cache would be O(1).

Notes:

  • A cacheable object is one that maps to a unique id so it can be shared and updated between queries. Any other objects are not usefully denormalized, since they are only used in one query
  • (*) you can determine from the schema which objects are cacheable
  • (**) you only need to make one such getter object for each cacheable schema section; you can use it as the prototype for all the instances of this data, each with only the hidden reference
  • To know which active queries need updating, you can keep an array of back references for each denormalized object; after all, the only code that changes the cache is your own
  • Any cacheable subobjects use the same approach, but the wrapper object is stored in the cached object instead of in the query result

@wmertens
Copy link
Contributor Author

wmertens commented Mar 1, 2017

Thinking some more on what I propose above:

  • it won't quite work with sub-queries that have different variables. You can have a shared query result between two queries but different variables in sub-queries, so those need to be stored in the cached query.
    • query {a { id b(foo: 1){id bar} boop } and query {a { id b(foo: 2){id bar} boop } can share a but b_foo1 and b_foo2 need to be stored separately.
    • A sub-wrapper could be used. For example, the first query would be in cache as
      {  // object that holds query cache
      query_a: {
       __hidden: {id, boop} // shared reference to cached object with the correct id
       // in prototype
       id() {return __hidden.id}
       boop(){return __hidden.boop}
       b: {
         __hidden: {id, bar} // shared reference to cached object with the correct id
         // in prototype
         bar(){return __hidden.bar}
       }
      }
  • Personally I don't care if my query returns more data than I asked for. So if there are no subqueries, a wrapper object to grab only the requested data would not be needed. After all, the only two things you can do when doing a query is passing variables and asking for additional fields. Variables are used to key the cache, and additional fields don't impact the fields that were requested. (except if you use Object.keys() to iterate, which would be easy to work around)
    • So that would mean the above would instead be
      {  // object that holds query cache
      query_a: {
       __hidden: {id, boop} // shared reference to cached object with the correct id
       // in prototype
       id() {return __hidden.id}
       boop(){return __hidden.boop}
       b: {id, bar, meep} // shared reference to cached object with the correct id, some other query wanted `meep`
      }
      }

@Poincare
Copy link
Contributor

Poincare commented Mar 2, 2017

@wmertens This approach does sound interesting, especially when you are dealing with lists of objects returned. Our current approach there is O(n) but this should give us an O(1) read for the list. To clarify, if you have one query that returns a list of, say, to-do items and another query returns an update to a specific to-do item (with a particular id), then we'd only be storing one "real" object between the two but we'd store two different proxy objects which then contain references to the "real" object?

@wmertens
Copy link
Contributor Author

wmertens commented Mar 2, 2017 via email

@stubailo
Copy link
Contributor

stubailo commented Mar 7, 2017

@wmertens could you provide an example of a single query and dataset so that we could test out different reading/writing methods?

@wmertens
Copy link
Contributor Author

wmertens commented Mar 7, 2017

@stubailo I gave @helfer a private link to a project to test things in, is that ok?

@wmertens
Copy link
Contributor Author

wmertens commented Mar 10, 2017

a quick note: Storing the cache this way allows updating cached objects (CO) by mutating the single base CO, but it doesn't allow removing objects easily.

To do that, you could keep a WeakSet of queries that refer to the CO (per CO), so they can be traversed and the CO can be removed from lists and references. (on remove, check all the queries to see if they are in the WeakSet, if so, traverse)

@helfer
Copy link
Contributor

helfer commented May 3, 2017

Let's track this under #1409

@helfer helfer closed this as completed May 3, 2017
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants