Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull/Push cached environment using storage #6763

Merged
merged 11 commits into from
Mar 12, 2020
Merged

Conversation

humitos
Copy link
Member

@humitos humitos commented Mar 9, 2020

Before starting clonig the repository we check of a cached environment in the storage. If there is one, we pull it and extract it under project.doc_path (using the same structure that we have been using) and use those cached files.

Once the build has finished and it was successful, we compress all the environment files (pip cache, checkout, conda and virtualenv environments) and push it to the storage.

The cached is cleaned up when the version is wiped.

This feature is under a Feature flag (CACHED_ENVIRONMENT), so we can select specific projects to start testing out this cached environment. It can help on projects with big/huge repositories or that install a lot of dependencies.

Before starting clonnig the repository we check of a cached
environment in the storage. If there is one, we pull it and extract it
under `project.doc_path` (using the same structure that we have been
using) and use those cached files.

Once the build has finished and it was successful, we compress all the
environment files (pip cache, checkout, conda and virtualenv
environments) and push it to the storage.

The cached is cleaned up when the version is wiped.

This feature is under a Feature flag (CACHED_ENVIRONMENT), so we can
select specific projects to start testing out this cached
environment. It can help on projects with big/huge repositories or
that install a lot of dependencies.
@humitos
Copy link
Member Author

humitos commented Mar 9, 2020

Some notes to keep in mind while reviewing this PR:

  • I tried to put these method under before_vcs and build_complete signals, but the code wasn't really clean. Besides it needed some other changes to the signal it self since we are not passing the required information (e.g. build object is a dictionary instead of a Build instance)
  • The cached environment is pushed back to the storage only on success builds to avoid corrupted data
  • In case of corrupted cached environment, the user can just wipe the version to start from a fresh environment
  • If we want to start testing this out in production, we need to define RTD_BUILD_ENVIRONMENT_STORAGE
  • This solution should work either on Amazon S3 and Microsoft Blob Storage (it uses django-storage behind the scenes)
  • I haven't testing this with a huge repository yet
  • I'm not using compression at all to avoid using CPU cycles for this. We want to go fast here

@humitos humitos requested review from agjohnson and a team March 9, 2020 23:32
Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is looking great! I noted some changes to be a bit more defensive, but it would be good to get feedback from someone that has touched the task operations more recently for feedback on the order of operations there. It looks okay to me, but I'm a little rusty on potential issues that could stem from tar file operations or storage calls failing.

readthedocs/projects/tasks.py Outdated Show resolved Hide resolved
readthedocs/projects/tasks.py Outdated Show resolved Hide resolved
@humitos humitos marked this pull request as ready for review March 10, 2020 18:49
@humitos humitos requested review from agjohnson and a team March 10, 2020 18:49
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the approach I was thinking would work and not be a ton of effort. This feels like enough to at least start testing, and then we can iterate and improve on it as we go. Hopefully this will at least get us at parity with our scaling builders.

local_fd.write(remote_fd.read())

with tarfile.open(tmp_filename) as tar:
tar.extractall(self.version.project.doc_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth abstracting the archive format here. It looks like shutil has an abstraction over it. I imagine we'll want to play with tar and gzip at least, because I imagine bandwidth and CPU will be a tradeoff here that we want to be able to tweak.

https://docs.python.org/3/library/shutil.html#archiving-operations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... Interesting! I didn't know about shutil.make_archive. I will take a deeper look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an abstraction is a good idea to be able to play here. Although, it seems that shutil.make_archive does not gives us too much ability for this since it's not too customizable: we can't change lower level attributes.

Besides, we would need Python 3.8 because since that version they are using modern PAX (--xattrs equivalent):

Changed in version 3.8: The modern pax (POSIX.1-2001) format is now used instead of the legacy GNU format for archives created with format="tar".

On the other hand, tarfile allows us to play with gzip, bz2 and lzma and with compression level as well.

readthedocs/projects/tasks.py Show resolved Hide resolved
Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Do we need a comparable corporate change?

@humitos
Copy link
Member Author

humitos commented Mar 12, 2020

@agjohnson there is a small PR that just defines the required settings on docker compose environment and other in -ops

I don't think there is any other differences compared with community.

@humitos humitos merged commit 2091b2d into master Mar 12, 2020
@humitos humitos deleted the humitos/cache-envs-storage branch March 12, 2020 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants