Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rsync option to package command and improve performance #303

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

mike-potter
Copy link
Member

This is a fairly significant patch, but should not affect existing projects. It adds a "packages.rsync" option (False by default). When set to true, the package will be generated in a separate folder (packages.vmData defaults to "build/vm_data") via rsync and the compress option is enabled to generate an archive from the package, which is moved to the normal build/packages destination.

By mounting build/vm_data into the VM in your docker_composer.yml file, this allows the package to be built directly to the VM bypassing NFS to the local filesystem. Only the final archive.tgz is moved into the local file system. The vmData persists so the Build container can still perform additional tasks as needed, such as pushing the package into a remote repo. Or, the local tgz archive can be used to deploy the package.

On a current real project, this reduced the normal "grunt package" command running from the build container from a 40-minute task, down to a TWO minute task.

Given the significant performance improvement, I'm marking this PR for Milestone 1, so let's get reviews of this approach. (FYI it was modeled from some code in a "grunt deploy" task written for another project by Adam)

@mike-potter mike-potter added this to the 1.0.0 milestone Oct 5, 2016
Copy link
Contributor

@grayside grayside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mike-potter for working on this! It's been a goal of mine for awhile. I'm in the middle of a review, it might take me a few sessions to complete.

@@ -44,15 +44,15 @@ script:
- echo "Working copy disk usage"; du -chs .
- grunt --version
- grunt help
- grunt --quiet --timer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove quiet? Seems like an unrelated change. We might make the test less verbose by setting the environment variable GDT_QUIET.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the --quiet because it was super hard to debug the failing travis tests. Travis already spits out a ton of stuff and seeing what the grunt tasks actually do and if they get errors is important I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was --quiet suppressing errors too? In our case it's used to suppress desktop notifications which are irrelevant from inside travis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package command had a bug that was not completing all of its tasks, so without seeing the task output it was difficult to debug. Also hard to see the actual tasks being run through the cloud of all the other output generated in the travis log. I don't consider those desktop notifications irrelevant for travis as they aided me in debugging. And it doesn't hurt to have them, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the --quiet option and GDT_QUIET env var are supposed to only affect desktop notifications (i.e. growl). If the option affects the error reporting from Grunt, we should change the option to avoid conflicts.


**rsync**: If set to true, rsync will be used instead of file copying for
improved performance. This automatically enables the `compress` option and will
write the archive to the package destination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why enable compress by default? The main use case I've had for this requires not doing it, so curious what the thought is. Don't recall if that can be turned off, we should note that in docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a default. You can still set "archive" to false if you don't want it. Just trying to support the best defaults for new projects. I want "rsync=true" to represent the default behavior for "high performance" that enables archive, vm_data, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose a project leaves it creating an archive, and turns on rsync. What is the use case we're aiming for? Operational code backups? I'm open to calling this a best default, I'm just trying to understand what we're talking about empowering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By creating a single archive and using rsync you get the highest performance. There is only a single file (the archive) the needs to be moved out of the VM disk and into the NFS mounted system which is hugely faster than moving the non-archived directory.

What they do with the archive is project specific. I expect we'll eventually add a "deploy" command that does all of the git commit/push work from inside grunt, but for now the archived package can be moved to a hosting site, or uncompressed locally and committed to a production repo.

Basically, we are just trying to minimize the total data flow to NFS along with the individual number of files (since each file has NFS overhead, so lots of small files are really bad performance).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, writing the tarball being less I/O, offset by the time to create the tarball. Have you noted the time to create the tarball is less than syncing the files individually? I expect so.

package when using the `rsync` option. By selecting a volume mounted in your
VM `/data` area, this can significantly improve performance. The archive is
still moved to the normal package destination when complete. Defaults to
`build/vm_data`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDT is not targeted at virtual/containerized environments. This piece needs a bit more contextual documentation and to remove /data as an explicit concept.

Seems strange to place it in build/vm_data, why not build/package/tmp or even get a formal /tmp directory generated via os.tmpdir()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the super important part of this PR. Without the virtual intermediate directory there was no way to get anything performant and people are bypassing gdt on their projects because of this. I know GDT is not "targetting" at containers, which is why this option isn't enabled by default.

The idea here is that in your docker container you mount a volume that gdt can use for various purposes to improve performance. Only tackling the packaging right now, but I can imagine other commands using this also. So don't want it to be within "build/package". The /data is just an example that pertains to our own container, but that isn't anywhere in the GDT except the docs as an example. Since our docs already reference many P2 practices, I don't see this as a problem right now (see the CI docs for example).

Still, specific updates to this doc file are welcome if you want to wordsmith it a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure all config options and documentation are not overly bound to containers or virtual environments specifically, even if the changes we have facilitate that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this less specific to Docker and the /data path set up by devtools, could this be recast as a temporary directory path (i.e. tempDir rather than vmData)? We can note in the documentation that the purpose for this path is to allow using a path that performs quickly if the project directory lives in a slow performing location (i.e. NFS).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think calling it a tempDir option would be fine. I'll make that change to the code and docs. Thanks for the suggestion.


## Performance

When running `grunt package` within a docker container, the default file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we focus more on NFS mounts and less on Docker? I'm fine with Docker in the docs, but it should be almost more like a sidebar detail. It's great that we're looking into accelerating this for use with NFS mounts, it would be great if we could identify similar tricks for Drush Make and Composer (build in an intermediate directory, transfer to the volume mounted space where speedup can better affect the complete codebase.

The major issue that originally drove me to look into rsync and other alternatives to copy was more about size of projects - a built up D8 project or a vanilla Atrium project will break Node under default Node.js settings rather than complete the operation. I'd like to note this -- it's going to be the more common problem than NFS outside Phase2 usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I covered this above, but the whole point of this PR was performance. Switching from copy to rsync gives a small performance improvement. The majority of the improvement comes from using the vm_data.

Yes, it's really about NFS and not docker, so the docs can be wordsmithed a bit.

And yes, this vm_data is definitely something we want to incorporate into other areas of gdt. The package command was just the big target because it was taking 40 minutes on a project.

var archive = grunt.config('config.packages.archive') || useRsync;

// When using rsync, generate to a path that can be mounted in the VM.
var vmPath = grunt.config.get('config.packages.vmData') || 'build/vm_data';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per earlier comments, I'd like a more contextual, less tech-specific name for this -- intermediary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specific suggestions are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other thread, we're calling it tempDir. Just closing the loop here :) Certainly shorter than intermediary.


var excludePaths = ['bower_components', 'node_modules', '.gitkeep'];

if (useRsync) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously explored with https://github.com/phase2/grunt-drupal-tasks/blob/master/tasks/make.js#L71, we need a fallback for Windows which so far has been simply "Got Windows? Okay, we'll still use copy!" Because the configuration for copy and rsync are not fully compatible, I've been thinking of rsync not as a mode, but rsync to copy fallback as a breaking change so we can revise configuration requirements. Open to discussion on this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I added makes them compatible. You still use srcFiles, projFiles, and the new excludePaths. The code builds the proper arguments for either copy or rsync. So yes, Windows cannot use the "rsync=true" option.

If there is a simple way to detect if rsync is installed I'd be happy to use that to set the default and have it fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a canRsync utility function used by the code I linked to. I'd like that to be checked as part of the condition whether we are using Rsync. We can make that function smarter to detect if rsync is installed in a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't see the canRsync. I'll add that.

@mike-potter
Copy link
Member Author

FYI: Travis tests are failing because of #304 and not because of anything in this PR.

Copy link
Contributor

@grayside grayside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to call it there on a review of the changeset, I think I'm going to need to look at the modified version of package.js directly to make sure I'm not misremembering anything.

if (item.substr(0, 1) === '!') {
item = item.slice(1);
item = item.replace('**', '*');
excludePaths.push(item);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really smart, however, the filesystem matching by the Node minimatch library has more breadth of syntax than is typically supported by Bash/Rsync/etc beyond just negation, such as fancy globbing and simple expressions. That's the incompatibility I alluded to earlier. I think we need to confirm that failure is reasonably graceful as a minimum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think we can address that in the documentation. IMHO none of our config options should be tied to a specific implementation (node, rsync, etc). So we should specify how wildcards are handled and then convert to the needed format for either Copy or Rsync.

Right now the only issue is ** vs * and rsync seems to take the ** and still function, so I didn't need to change that.

If people want fancy expressions then we should use some form of RegEx that would be less implementation-specific. But that's work for later. For now if somebody wants to use a fancy Node minimatch expression then they just can't use the new rsync option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Confirm using a minimatch expression results in rsync failing in a reasonably informative and quick manner.
  • Release notes and documentation detail the new restriction to stick with Bash-compatible file matching patterns with rsync mode.


if (destPath !== finalPath) {
grunt.config('shell.mvArchive', {
command: 'mv ' + destPath + '.tgz ' + finalPath + '.tgz'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of note, mv will not work across NFS volume mounts. We will want to document that limitation a bit to help inform how projects are configured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's what I thought too. But it seems to work across docker volumes. Maybe we should change it from a "mv" to a normal grunt copy. Then delete the tarball from the destPath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe that restriction is only outside the container. mv is an order of magnitude faster than copy, so if it works I think we should keep it until we have reason to switch to copy.

A bigger issue is windows compat. For portability we should just switch to an fs.rename[Sync].

tasks.push('shell:rsync');
} else {
tasks.push('copy:source');
}
tasks.push('copy:package');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy:package looks like it's been stripped down to some final copy operations of random project files. Maybe this should be renamed for clarity? copy:package-extras?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I even thought of "copy:projFiles" to be really specific. If we call it "package-extras" then having the config be "projFiles" might be confusing too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine. This is a stepping stone to #255, otherwise I'd suggest we drop projFiles and rename the config for clarity.

// Setup default rsync command and options.
var rsync = 'rsync -ahWL --no-l --stats --chmod=Du+rwx ';
for (var pathIndex = 0; pathIndex < excludePaths.length; pathIndex++) {
rsync = rsync + "--exclude " + excludePaths[pathIndex] + ' ';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like only the final excluded file would be included here. Shouldn't we remove the loop and replace with an array join on whitespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding "--exclude path" to rsync for each item in the array. So maybe a join with "--exclude " would work? Otherwise this loop does actually work fine.

srcFiles.unshift('**');
// Add any additional excludePaths to srcFiles.
for (var i = 0; i < excludePaths.length; i++) {
srcFiles.push('!**/' + excludePaths[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array merge seems more clean here. Also, something to dedup might be worthwhile to make sure any code that wants to remove exclusions only has to find a single instance of a mask.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally looped through excludePaths to add the '!**/' and then merged the array with srcFiles but this seemed cleaner. So I guess I'd welcome a code alternative to review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

var _ = require('lodash'); at the top.

srcFiles = _.union(srcFiles, _.map(excludePaths, function (item) {
  return '!**/' + item;
}));

Union both merges and unique-ifies.

Looks like we should check on updating lodash.

@arithmetric
Copy link
Collaborator

@mike-potter Could we target this for a 1.1.0 milestone? I think the current master version is in good shape for a rc1 release and a stable one after a bit of testing.

@mike-potter mike-potter modified the milestones: 1.1.0, 1.0.0 Oct 14, 2016
@mike-potter
Copy link
Member Author

Sure, we can do 1.1.0. I agree that we need to get 1.0.0 released so people can start using the composer-based support. And adding the "tempdir" option probably opens up other issues we'd want to work on for improving performance elsewhere in GDT that could benefit.

A sample docker-composer.yml entry to mount the vm_data looks like this:
```
volumes:
- /data/PROJECTNAME/package:/var/www/build/vm_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've picked up this branch and am trying to pull it along. As part of this I'm reviewing and revising the docs to clarify my game plan. I'm confused why we would volume mount the intermediary directory. Wouldn't that slow down the rsync operation compared to leaving the tempDir somewhere in the container but not in a volume mount where NFS slowdowns can catch it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /data volume is just in the VM, not on the local Mac NFS system. This simply makes the stuff in /var/www/build/vm_data persistent, similar to what we do with the MySQL database. Volumes are only slow if they map the local Mac folders

@grayside grayside removed this from the 1.1.0 milestone Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants