Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching #101

Closed
wants to merge 18 commits into from
Closed

Caching #101

wants to merge 18 commits into from

Conversation

jamestalmage
Copy link
Member

So...
Caching makes a pretty big difference.

The AVA test suite went from 42s to 32s on my machine, even with this relatively trivial implementation. It generates a new directory: cwd/.nyc_cache and writes cached files there based on the md5 hash of their pre-instrumented content.

It is probably worth checking out cacha for this.

@floatdrop, my impression of avajs/ava#189 was that it was still a WIP. Where is cacha currently? Is it ready for us to push into production? If not, I am happy to help any way I can.

@floatdrop
Copy link

@jamestalmage cacha is ready, but I could not catch up with rebasing PR.

@@ -1,4 +1,5 @@
.nyc_output
.nyc_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be put in the home directory instead? It's annoying having to .gitignore this in every repo using nyc. ESLint does this by default: https://github.com/eslint/eslint/blob/master/docs/user-guide/command-line-interface.md#--cache-location

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESLint does this by default

It looks to me like ESLint's default behavior is similar to this one.

Path to the cache location. Can be a file or a directory. If none specified .eslintcache will be used. The file will be created in the directory where the eslint command is executed.

I agree, it is annoying to have to ignore it every time. I am not sure defaulting to the home directory is the best choice though. One advantage of the annoying folder in your project's base directory is that you notice it, and it becomes obvious what it is for. Hiding it away in your home directory puts it out of mind when you are trying to troubleshoot problems. Maybe we could put it in ./node_modules/nyc/.nyc_cache, that way they end up clearing the cache if they rimraf node_modules while troubleshooting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, maybe we should put it in:

./node_modules/.cache/nyc, and encourage other modules to use ./node_modules/.cache/<module-name> as well. Then trash node_modules/.cache becomes the de facto way to clear caches when troubleshooting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work too.

@jamestalmage
Copy link
Member Author

Moved this comment to here, since it was a big wall of text and not entirely applicable to this issue:
https://gist.github.com/jamestalmage/758b09c42069ad96c792

@@ -161,7 +161,7 @@ describe('nyc', function () {
})

describe('custom require hooks are installed', function () {
it('wraps modules with coverage counters when the custom require hook compiles them', function () {
/* it('wraps modules with coverage counters when the custom require hook compiles them', function () {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@novemberborn - we still have a few tests that test index.js, including this one. I believe you had a PR that fixed some of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I need to go back and see what's still needed…

This provides a minor speed improvement (~10% on the AVA test suite).

It defers source-map processing, saving sourceMap data alongside instrumented sources in the cache.

SourceMapCache is only instantiated for `report` runs.
@jamestalmage
Copy link
Member Author

Just pushed another commit that squeezes out an additional 2 or 3 seconds from the AVA test suite.

It works by caching source-map information as well. Saved only once when it sees a cache miss for the source. If there is a cache hit, it does not even extract the source map (source map extraction can be expensive for large files). It does not apply source map information when exiting the process, but instead adds a contentHash key to each files coverage report. That value is used to load the source-map from disk whenever report is run.

The only downside to this is that you can no longer use just the contents of .nyc_output/*.json to generate coverage, you also need node_modules/.cache/*.map.

@jamestalmage
Copy link
Member Author

Some stats from running the AVA test suite:

- run time coverage penalty reduction in penalty
no coverage 25 sec - -
nyc master 42 sec 17 sec -
cache instrumentation (1d625e8) 32 sec 7 sec 59%
cache source maps (9b014bb) 30 sec 5 sec 29% (70% overall)
drop lodash (1343d4a) ~29 sec 4 sec 20% (76% overall)
shortcut include glob (ca01f07) 28 sec 3 sec 25% (82% overall)

Dropped AVA test suite coverage penalty from 5 seconds to 4 seconds
@jamestalmage
Copy link
Member Author

The latest commit drops lodash from index.js. I did not remove it from source-map-cache, but that is only loaded once per execution for running reports.

That dropped the coverage penalty from 5s to 4s on the AVA test suite.

lodash is a 12,000 line monster that is slow to eval. The only function we were using in index.js that is not available natively was string.endsWith

@jamestalmage
Copy link
Member Author

See: jamestalmage@f14f98f

It lazy-loads any dependency that we are not guaranteed to need in a forked process, that includes:

  • convert-source-map - not needed if everything is cached
  • rimraf && mkdirp - only needed in the top level process (required moving mkdirp execution to cleanup method, which is not the right spot for it)
  • glob - only needed for --all option
  • stripBom - only needed for addFile(), which appears only to be needed for the --all option

That appears to shave of some time. The AVA test runs go from a shaky 29s/30s to a solid 29s on every run. It breaks a bunch of tests so I am going to hold off on it until this is merged and we develop some more accurate benchmarking.

I think I am reaching the limits of using AVA's test suite and npm run test as a viable benchmarking tool. It only provides a resolution of 1 second, and at most there are only 4.5 more seconds of optimizations we can squeeze from that test suite. Also, with it's sizable dependency graph, previous benchmarks become useless with each npm update. Anybody got any clever ideas for a better way to measure the efficacy of optimizations?

@jamestalmage
Copy link
Member Author

Shaved off another second.
Instead of defaulting to a glob pattern of **, I just shortcut that manually with

if (!this.include || micromatch.any(filename, this.include)...


this.exclude = ['**/node_modules/**'].concat(config.exclude || ['test/**', 'test{,-*}.js'])
if (!Array.isArray(this.exclude)) this.exclude = [this.exclude]
this.exclude = ['**/node_modules/**'].concat(arrify(config.exclude || ['test/**', 'test{,-*}.js']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could call this._prepGlobPatterns() here rather than reassigning this.exclude?

})
}

NYC.prototype.cleanup = function () {
if (!process.env.NYC_CWD) rimraf.sync(this.tmpDirectory())
if (!process.env.NYC_CWD) rimraf.sync(this.tempDirectory())
}

NYC.prototype._createOutputDirectory = function () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a misnomer now.

@novemberborn
Copy link
Contributor

@jamestalmage

Not unless I can prove it's actually optimizing something. That's why I pushed it to a different branch and abandoned it for the time being.

Cool.

Another attempted optimization that I couldn't prove worthwhile: jamestalmage/nyc@7f80c30

Tried to use minimatch.matcher() to precompile patterns, did not notice an improvement. That one might still be worth pursuing. It falls inside our require hook (that is executed many thousands of times), and the minimatch documentation says it is better to precompile them that way. The documented optimization is for a single pattern, maybe I optimized the array of patterns incorrectly.

I'm just guessing here but that code path looks like it might be optimized quite well by V8. IIRC it also caches regex compilation.

incorporated your feedback. I am using lazy-req now. Provides the same speed benefits as lazily requiring inline, but pushes your list of dependencies to the top.

That's nicer though if the performance penalty isn't too high I'd still rather require them normally. There's now extra weirdness in using SourceMapCache.

it is probably worth you guys putting a second/third/fourth set of eyeballs on the append-transform code to see if I am missing any optimizations

Nothing jumps out at me other than istanbuljs/append-transform@73f7aba#commitcomment-15067878.


I'll try and push up some commits removing lodash usage from source-map-cache and switch the tests to use addMap().

Source maps are applied to a report that was just read from disk. Cloning is
needed for the tests but otherwise unnecessary.

Change path-related tests to look for the path property on the report, without
comparing the report contents. That's silly now that applySourceMaps() doesn't
do the cloning.

No need to copy coverage.b values when mapping branches.
Use 'report' and fileReport' to be more consistent with index.js.
It was only used by the tests, which can use addMap.
This is no longer used. This was also the last runtime dependency on lodash,
which has now been moved to the dev dependencies.
@novemberborn
Copy link
Contributor

I'll try and push up some commits removing lodash usage from source-map-cache and switch the tests to use addMap().

@jamestalmage see jamestalmage#1 :)

@jamestalmage
Copy link
Member Author

OK, I merged @novemberborn's commits simplifying source-map-cache and reverted the lazy-req stuff since lodash is only a dev-dependency now.

I think this PR is in a good place and ready for a merge.

@jamestalmage jamestalmage changed the title WIP: Caching Caching Dec 18, 2015
@sindresorhus
Copy link
Member

👍 I'm not familiar enough with the code base to be able to review this properly, but from a quick skim of the changes it looks pretty good to me. Excited about the perf improvements! Super nice work @jamestalmage :)

@jamestalmage
Copy link
Member Author

Just added a test to create cache collisions. It spawns 4 processes (all wrapped by nyc), and coordinates an attempt to require the same file at exactly the same time using process.hrtime.

I added some logging statements to our cache hook (not included in the commit), and I can verify that all 4 processes see a cache miss (the coordination worked). All 4 processes also successfully write back to the cache, no error is thrown even though they are all trying to write to the same file. It does not appear we have to worry about write collisions, and that competing fs.writeFileSync(..) calls will just overwrite each-other, and the last one "wins". Since we know each competing process is trying to write identical content, I don't think we need to coordinate any further.

@sindresorhus
Copy link
Member

It does not appear we have to worry about write collisions, and that competing fs.writeFileSync(..) calls will just overwrite each-other

That's not my experience. I don't think it locks the file unless you do fs.open, though that might have changed in more recent Node.js versions.

@jamestalmage
Copy link
Member Author

I have tried a few more iterations of my test with various changes trying to cause a problem, but I can't. Even in Node 0.10. I don't know how configstore works, but this comment makes me think the file interactions are more complex than what we are doing here:

In casual testing, though, the race condition I hit more often was caused by the read-mutate-write sequences in set and del – competing processes end up clobbering one another's changes when those steps interleave, so you get data loss. Not as dire as straight-up corrupting the file, but has a way bigger window it can happen in, so ¯_(ツ)_/¯

We don't care about the previous contents of the file, we just clobber with data we know to be good. There is no read-mutate-write sequence

I don't think it locks the file unless you do fs.open

It seems crazy to me that fs.writeFileSync would not lock the file and try to make writes as atomic as possible. Either way, it is frustrating that I can't find much information on how fs operations work at a lower level. Somebody needs to do a blog post... I guess digging into the Node source is always an option.

@sindresorhus
Copy link
Member

We don't care about the previous contents of the file, we just clobber with data we know to be good.

Good point. Guess it's a non-issue in this case then.

It seems crazy to me that fs.writeFileSync would not lock the file and try to make writes as atomic as possible.

That's just what many people have told me. No idea if it's actually accurate.

It seems crazy to me that fs.writeFileSync would not lock the file and try to make writes as atomic as possible. Either way, it is frustrating that I can't find much information on how fs operations work at a lower level. Somebody needs to do a blog post... I guess digging into the Node source is always an option.

Can you open an issue on Node.js about better docs regarding that? I would definitely like to see that too.

@bcoe
Copy link
Member

bcoe commented Dec 19, 2015

@jamestalmage I like the idea of caching for large projects that benefit from it.

It does, however, introduce quite a few more moving parts. I've learned making nyc run in as many setups and environments as possible (cough Windows) that every moving part is going to be more edge-cases for us to debug on various outlier platforms....

So, my ask for caching is that we make it a configuration option, and my temptation in 5.x.x is to default with it to false.

Would love to hear other people's thoughts.

@jamestalmage
Copy link
Member Author

large projects that benefit from it.

The difference is noticeable even with the 50 line module I am working on now.

So, my ask for caching is that we make it a configuration option, and my temptation in 5.x.x is to default with it to false.

I think your concerns are reasonable, especially before we get AppVeyor up and running. I think we should eventually turn it on by default once we see green from AppVeyor, and we have dogfooded it for a while. A feature toggle is certainly a good idea as it gives us a way to eliminate the cache implementation as the source of reported bugs.

@bcoe
Copy link
Member

bcoe commented Dec 19, 2015

@jamestalmage cool, I think we're on the same page 👍 turning this on by default just feels a fairly big change for a minor release, I'd rather let it bake as an optional setting for a bit, then bump to 6.x.x with it enabled by default.

@novemberborn
Copy link
Contributor

So, my ask for caching is that we make it a configuration option, and my temptation in 5.x.x is to default with it to false.

👍

@jamestalmage jamestalmage mentioned this pull request Dec 20, 2015
@bcoe bcoe closed this Dec 21, 2015
jamestalmage added a commit to jamestalmage/nyc that referenced this pull request Dec 22, 2015
This incorporates feedback on the istanbuljs#101 PR from @novemberborn and @sindresorhus

- rename: `createOutputDirectory()` -> `createDatastoreDirectories()`
- rename: `_addSource()` -> `_maybeInstrumentSource()`
- drop `tmpDirectory()` entirely, just use `tempDirectory()`
- make `hashCache` and `loadedMaps` instance members instead of closure references
- use `lazy-req` instead of trying to defer requires inline in the code.

Future Changes:
- I eventually revert out the `lazy-req` stuff once the `source-map-cache` no longer uses `lodash`.
jamestalmage added a commit to jamestalmage/nyc that referenced this pull request Dec 22, 2015
The codebase shifted significantly while I was working on PR istanbuljs#101.
This is a commit after rebasing to fix a few errors that surfaced, and incorporate the new changes in support of Windows.

Includes:

slight tweak to get tests passing in Windows
(cherry picked from commit 25eb6a7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants