-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]: feat(gatsby): use lokijs for in-memory database #9338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! A lot going on here, so I'd need to do a deeper dive before we'd get this merged--and obviously because it's a WIP!
Left some comments, mostly on lodash.
- loki without indexes is overall slightly faster than master, except when there are many types | ||
- loki with indexes is about 2x faster on sites with 10k pages, and 5x faster with 20k pages. But is ever so slightly slower when those pages are split across 100 types. | ||
|
||
Overall, loki is a big win for sites with lots of pages of the same type. For smaller sites, the difference is negligible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A caching PR I recently worked on had similar findings, which I think is totally reasonable. If it's mostly the same for smaller sites but it makes Gatsby much more scalable I think that's a big win.
I honestly wonder if using lokijs consistently (e.g. for cache and internal Gatsby utilities) would be worthwhile. I'd lean towards yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fs fallback is interesting, since I used a similar approach for the cache! Basically I kept 100-200 items in memory, and then it fell back to the filesystem as its cache, which seems to work pretty well for large sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleAMathews not explicitly (although I can), but I used the upwards bounds of when it ran out of memory as the basis for which I considered the number of items. It is dependent on machine setup, size of items being cached, etc. too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DSchau Thanks for all the feedback! For caching, using loki would be a step up from using pure arrays, but an even bigger win would probably be to use a real caching library that can also provide persistence. Once that has LRU, paging to disk and other cache-specific semantics, assuming one exists! (I haven't looked)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleAMathews Memory usage should be less since we're actually able to remove some caches. But I'll double check. What's your favorite memory profiling tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I assume too but it's nice to know for sure.
There's probably something on npm but you can also add a setInterval in the code and log out the memory usage every .1 seconds or whatever. You could also log out different events too e.g. when query running starts etc. so you could visualize things easily.
const nodes = getNodes() | ||
|
||
if (!nodes.some(n => n.internal.type === `ContentfulAsset`)) { | ||
const nodes = getNodesByType(`ContentfulAsset`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look through the implementation of this as I go through, but would this be worth returning an empty array or something so we can resolve the empty case more easily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would save us all of those checks below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Good call. I'll return the empty collection
}) | ||
} catch (e) { | ||
report.error( | ||
`Error starting DB. Perhaps try deleting ${path.dirname(dbSaveFile)}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 probably worth just doing it automatically, or at least prompting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts too! Glad you agree. I'll implement it
@@ -6,6 +6,7 @@ const crypto = require(`crypto`) | |||
const glob = require(`glob`) | |||
const { store } = require(`../../redux`) | |||
const existsSync = require(`fs-exists-cached`).sync | |||
const createNodeId = require(`../../utils/create-node-id`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pieh we need a gatsby-utils stat :) Something one of us can work on soon!
*/ | ||
function deleteAllCollections() { | ||
if (db) { | ||
_.forEach(db.listCollections(), collInfo => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth benchmarking _.forEach vs. native forEach (in general, not just here!). I tend to prefer to defer to native array methods, but I may be unique in that regard.
I think that it's a safer bet to presume they'll be optimized very efficiently, if they're not already!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'm actually a relative newcomer to js so any tips like this are greatly appreciated. I may have gone a bit lodash mad in this PR as I come from a functional background. Is this the native for each you mean? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly question, is db.listCollections() an object or an array? I presumed array, but on second glance, it could actually be an object?
If it's the latter, then this _.forEach may actually be preferable!
I'm about to board a plane, but I'll add some more feedback on this tomorrow. Nice work!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to reply. It's an array. I've been using for (const foo of someArray)
and I'm definitely preferring it
* by node.id | ||
*/ | ||
function updateNode(node, oldNode) { | ||
invariant(node.internal, `node has no "internal" field`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of invariant here (and elsewhere!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my approach with invariants like this is to be quite liberal in their application, and then perhaps remove them in the future if this code becomes very stable.
if (!oldNode) { | ||
oldNode = getNode(node.id) | ||
} | ||
const updateNode = _.merge(oldNode, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this doing anything an Object.assign wouldn't do? (again, sorry, I just don't always love lodash!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we actually merge oldNode with new one? This might cause weird effects, like deleting some fields from node will actually not be deleted in db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pieh Great catch again. I just looked at the redux implementation and all it does is overwrite the existing node in redux. Not sure why I did a merge here. I'll change to overwrite.
*/ | ||
function getNodesByType(typeName) { | ||
const coll = db.getCollection(typeName) | ||
if (!coll) return null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe just return empty array here? Makes it easier for a consumer of this "API" to handle the empty and null cases the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in other comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very promising! It would be great if we could keep current implementation while this is being worked on and inject new one when using some kind of feature flag
} | ||
|
||
function chooseQueryEngine(queryArgs) { | ||
if (hasPluginFields(queryArgs)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely need to be more robust - our filter schema support is not really that great but even now we can filter by linked node with ___NODE convention (which need to use custom resolver to access underlying node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I didn't notice the linked field example. Can you think of any other types of fields that might need to be resolved too? How about mapping fields?
When @KyleAMathews and I were discussing this PR, we decided that the vast majority of sites would not be using pluginFields. I assume they also aren't using linked fields. But I could be wrong in that assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't support filtering on mapped fields (that's what I meant that support isn't great), but it's something that we would definitely would want to be able to do (here's one of the issues where people mention they can't really filter by mapped fields - #7251, I have WIP branch for this - pieh@ccd110c).
For now only ___NODE fields would be problem. Worth looking into date or local File fields which do have custom resolvers for output, but we don't create correct filter schema for those right now.
When we create schema type fields with custom resolvers we should track those fields (similar to how you track fields from plugins)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate the "resolve all the things" approach of run-sift, but I get why it exists. It's a shame, because the second we have to resolve all nodes, we lose any ability to perform index based lookups.
For plugin filter fields, I think this will always be the case since their resolvers can do anything. But I wonder if there's an elegant solution for linked and mapped nodes. If we were using a graph database, then we would already have the link setup.
One way or another, we'll need to find a proper solution eventually, because I imagine many sites will want to take advantage of linked and mapped filtering.
const nodes = getNodes() | ||
|
||
if (!nodes.some(n => n.internal.type === `ContentfulAsset`)) { | ||
const nodes = getNodesByType(`ContentfulAsset`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would save us all of those checks below
if (!oldNode) { | ||
oldNode = getNode(node.id) | ||
} | ||
const updateNode = _.merge(oldNode, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we actually merge oldNode with new one? This might cause weird effects, like deleting some fields from node will actually not be deleted in db?
@pieh I love the idea of feature flagging this. I'm kicking myself for not writing all this code with that in mind. I'll have a think about how hard it would be to do that |
// if we can query by pure data or whether we have to resolve the | ||
// fields first. See `./run-query.js` | ||
_.forEach(inferredFieldNames, fieldName => { | ||
pluginFieldTracking.add(fieldName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should namespace fieldNames - if one node type uses resolver for field foo
, we don't want to revert to sift
if we query other type and use field with same foo
name
// Allow page creators to specify that they want indexes | ||
// automatically created for their pages. | ||
if (context.useQueryIndex) { | ||
ensureIndexes(coll, lokiArgs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive this is when indexes were already applied? We definitely will run same query (with different context) when we use same template for multiple pages
This has also potential to keep unneeded indexes - for example when we modify filters (remove some fields from filter), but not sure if it's worth keeping track of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pieh loki handles the case of the index already existing so there's no performance penalty.
But yes, this will keep old indexes around if filters are modified. My solution here for applying queries is very basic, but I think there's a lot more we could do here over time (see original PR description). My suggestion would be to keep as is, but iterate on this based on feedback
In preparation for #9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality. I also added `getNodesByType` since it makes sense even without loki.
**Builds on #9416 . Merge that one before this** In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of #9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines. This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag). Another benefit is that the query engine no longer needs to understand page dependencies.
In preparation for gatsbyjs#9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality. I also added `getNodesByType` since it makes sense even without loki.
**Builds on gatsbyjs#9416 . Merge that one before this** In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of gatsbyjs#9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines. This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag). Another benefit is that the query engine no longer needs to understand page dependencies.
This PR adds a feature flag `GATSBY_DB_NODES` that can be used to change the storage engine for the gatsby data layer (`nodes`). - `redux` (default) which uses the existing redux implementation, and `sift` for querying. Or, you can use - `loki` which uses the loki in-memory database to store and query. This PR re-implements functionality in #9338, but with all tests passing and addressing previous feedback. It should also be easier to review since it builds on several refactorings. Some things to note: 1. I [submitted a PR to lokijs](techfort/LokiJS#718) which still hasn't been merged, though the author says he'll start working on it soon. Therefore, in in interim, I've published [@moocar/lokijs](https://www.npmjs.com/package/@moocar/lokijs). 1. I haven't implemented auto indexing of query fields yet. I'll attack that next. 1. I suggest we ask the community to try out the feature flag once it's merged to get feedback. All the tests pass, but this is such a big change that we'll want to test it gradually 1. While loki uses the same mongo query language as sift, they do have different behavior. Most of my time on this PR was spent ensuring that loki behaves **exactly** like sift. See [db/loki/nodes-query.js](https://github.com/gatsbyjs/gatsby/blob/cddbe893a4ce638babb1cbe5e5da4c13b6f5e57d/packages/gatsby/src/db/loki/nodes-query.js). But there's a chance a few edge cases have slipped through the cracks. 1. the feature flag works with the tests too `GATSBY_DB_NODES=loki yarn test`. We should perhaps look into running this on all PRs
I'll close this PR - as I think everything from here was already merged in more incremental chunks in past couple weeks |
In preparation for gatsbyjs#9338, I've added a new nodes db module that can be implemented by an appropriate db nodes backend. Which is only redux as of now. The loki PR will eventually be implemented so that it can be featured flagged on/off using this functionality. I also added `getNodesByType` since it makes sense even without loki.
**Builds on gatsbyjs#9416 . Merge that one before this** In additional to performing the queries, `run-sift.js` also currently converts the results into connections (if connection query). But as part of gatsbyjs#9338, we will be using `loki` to perform queries too. So we would need to duplicate the connection logic in both query engines. This PR pulls the connection logic out of run-sift and into `build-node-connections` which makes sense to me as the query engine doesn't need to know about connections. Only about whether it should return one result or many (thus the `firstOnly` flag). Another benefit is that the query engine no longer needs to understand page dependencies.
This PR adds a feature flag `GATSBY_DB_NODES` that can be used to change the storage engine for the gatsby data layer (`nodes`). - `redux` (default) which uses the existing redux implementation, and `sift` for querying. Or, you can use - `loki` which uses the loki in-memory database to store and query. This PR re-implements functionality in gatsbyjs#9338, but with all tests passing and addressing previous feedback. It should also be easier to review since it builds on several refactorings. Some things to note: 1. I [submitted a PR to lokijs](techfort/LokiJS#718) which still hasn't been merged, though the author says he'll start working on it soon. Therefore, in in interim, I've published [@moocar/lokijs](https://www.npmjs.com/package/@moocar/lokijs). 1. I haven't implemented auto indexing of query fields yet. I'll attack that next. 1. I suggest we ask the community to try out the feature flag once it's merged to get feedback. All the tests pass, but this is such a big change that we'll want to test it gradually 1. While loki uses the same mongo query language as sift, they do have different behavior. Most of my time on this PR was spent ensuring that loki behaves **exactly** like sift. See [db/loki/nodes-query.js](https://github.com/gatsbyjs/gatsby/blob/cddbe893a4ce638babb1cbe5e5da4c13b6f5e57d/packages/gatsby/src/db/loki/nodes-query.js). But there's a chance a few edge cases have slipped through the cracks. 1. the feature flag works with the tests too `GATSBY_DB_NODES=loki yarn test`. We should perhaps look into running this on all PRs
Draft. Don't merge!! Feedback requested. Questions down the bottom
Summary
One of the most expensive bootstrap phases in Gatsby for large sites is the GraphQL querying phase. There are two aspects to this:
This PR addresses 2, querying.
Gatsby stores all its nodes in a redux namespace (
nodes
). To query, it uses sift. Sift doesn't have any concepts of indexing, so it performs a brute force search by iterating over all nodes until it finds a match (or matches). The same applies for sorting. This is very slow on large sites.This PR speeds up query performance for 10k page sites by about 2x and 20k sites by about 5x. See recording.md for results.
Loki
The main change is to replace the redux
nodes
namespace with loki.js. And then at query time, use loki to query instead of sift. In redux, nodes are stored in a giant array, whereas with loki, I have divided them up into collections per type of node (internal.type
). This is because almost everywhere in Gatsby, we want to query for the nodes of a type, and almost never by all nodes in the system. This means that plugins that currently filter through all nodes looking for those of a certain type can callgetNodesByType()
which as O(1) operation.Differences
Like sift, loki's query syntax matches mongo. There is one major difference though. Sift can query nested fields using datastructures, e.g
{ foo: {bar: {eq: "1"} } }
, whereas loki uses dotted string paths, e.g{ "foo.bar": {eq: "1"} }
.Nested indexes
Loki didn't support indexes on nested (dotted) fields. I've addressed this in techfort/LokiJS#718. Hopefully it will be merged soon. If not, we can create a fork for now and publish
Persistence
Loki is an in-memory DB but supports file serialization. It supports the same autosave functionality that we've built into redux/index.js:saveState() but is supposedly even faster. If we move
pages
,dataDependencies
etc into loki we should be able to remove most of this functionality.Another added benefit is that saves only occur if data is dirty.
Query Benchmark
Added a benchmark project to test query performance.
Questions
db/index.js
directly fromactions.js
and then emit aCREATE_NOTE
event.pages
,componentDataDependencies
and other redux namespaces into loki instead? Benefits would include dirty flagged persistence, and potential querying speed upuseQueryIndex
which can be set on the query context. This tells loki to create indexes for all fields in the query. This allows users to create indexes where applicable, which for most sites, will only be on node groups > 1000 pages or so. But the other option is that we do something more intelligent to auto set indexes. E.g flag groups of nodes that are > 1000, and if a query occurs against them, then index the query's fields.$loki
field. And nodes are mutable, so that field will leak into the rest of Gatsby. It hasn't been an issue so far, but I'm curious if folks think it will be a problem. We could remove the field after query results are returned, but we'd have to clone the whole object which would be a performance penalty.Note, the tests are failing. I'm working on fixing them up now but wanted to get the PR up now.