Pull/Push cached environment using storage #6763

humitos · 2020-03-09T23:25:45Z

Before starting clonig the repository we check of a cached environment in the storage. If there is one, we pull it and extract it under project.doc_path (using the same structure that we have been using) and use those cached files.

Once the build has finished and it was successful, we compress all the environment files (pip cache, checkout, conda and virtualenv environments) and push it to the storage.

The cached is cleaned up when the version is wiped.

This feature is under a Feature flag (CACHED_ENVIRONMENT), so we can select specific projects to start testing out this cached environment. It can help on projects with big/huge repositories or that install a lot of dependencies.

Before starting clonnig the repository we check of a cached environment in the storage. If there is one, we pull it and extract it under `project.doc_path` (using the same structure that we have been using) and use those cached files. Once the build has finished and it was successful, we compress all the environment files (pip cache, checkout, conda and virtualenv environments) and push it to the storage. The cached is cleaned up when the version is wiped. This feature is under a Feature flag (CACHED_ENVIRONMENT), so we can select specific projects to start testing out this cached environment. It can help on projects with big/huge repositories or that install a lot of dependencies.

humitos · 2020-03-09T23:32:20Z

Some notes to keep in mind while reviewing this PR:

I tried to put these method under before_vcs and build_complete signals, but the code wasn't really clean. Besides it needed some other changes to the signal it self since we are not passing the required information (e.g. build object is a dictionary instead of a Build instance)
The cached environment is pushed back to the storage only on success builds to avoid corrupted data
In case of corrupted cached environment, the user can just wipe the version to start from a fresh environment
If we want to start testing this out in production, we need to define RTD_BUILD_ENVIRONMENT_STORAGE
This solution should work either on Amazon S3 and Microsoft Blob Storage (it uses django-storage behind the scenes)
I haven't testing this with a huge repository yet
I'm not using compression at all to avoid using CPU cycles for this. We want to go fast here

agjohnson

This approach is looking great! I noted some changes to be a bit more defensive, but it would be good to get feedback from someone that has touched the task operations more recently for feedback on the order of operations there. It looks okay to me, but I'm a little rusty on potential issues that could stem from tar file operations or storage calls failing.

readthedocs/projects/tasks.py

ericholscher

This looks like the approach I was thinking would work and not be a ton of effort. This feels like enough to at least start testing, and then we can iterate and improve on it as we go. Hopefully this will at least get us at parity with our scaling builders.

ericholscher · 2020-03-11T14:02:35Z

readthedocs/projects/tasks.py

+                local_fd.write(remote_fd.read())
+
+            with tarfile.open(tmp_filename) as tar:
+                tar.extractall(self.version.project.doc_path)


I wonder if it's worth abstracting the archive format here. It looks like shutil has an abstraction over it. I imagine we'll want to play with tar and gzip at least, because I imagine bandwidth and CPU will be a tradeoff here that we want to be able to tweak.

https://docs.python.org/3/library/shutil.html#archiving-operations

Hrm... Interesting! I didn't know about shutil.make_archive. I will take a deeper look.

Having an abstraction is a good idea to be able to play here. Although, it seems that shutil.make_archive does not gives us too much ability for this since it's not too customizable: we can't change lower level attributes.

Besides, we would need Python 3.8 because since that version they are using modern PAX (--xattrs equivalent):

Changed in version 3.8: The modern pax (POSIX.1-2001) format is now used instead of the legacy GNU format for archives created with format="tar".

On the other hand, tarfile allows us to play with gzip, bz2 and lzma and with compression level as well.

readthedocs/projects/tasks.py

+ refactor how cached directories are managed.

agjohnson

Looks great! Do we need a comparable corporate change?

humitos · 2020-03-12T23:41:10Z

@agjohnson there is a small PR that just defines the required settings on docker compose environment and other in -ops

I don't think there is any other differences compared with community.

humitos requested review from agjohnson and a team March 9, 2020 23:32

agjohnson reviewed Mar 10, 2020

View reviewed changes

readthedocs/projects/tasks.py Outdated Show resolved Hide resolved

readthedocs/projects/tasks.py Outdated Show resolved Hide resolved

humitos added 4 commits March 10, 2020 09:52

Use tempfile.mkstemp since tempfile.mktemp is deprecated

7e5b86a

Cleanup temporary file once it's used

4e47c28

Use tarfile.open as recommended by docs

dc2e7e8

Merge branch 'master' into humitos/cache-envs-storage

5fd7fac

humitos marked this pull request as ready for review March 10, 2020 18:49

humitos requested review from agjohnson and a team March 10, 2020 18:49

humitos added 2 commits March 10, 2020 19:15

Update settings

8c9c473

Fix linting

1737e4a

ericholscher approved these changes Mar 11, 2020

View reviewed changes

This was referenced Mar 11, 2020

Remove fake broadcast operations against builder queues #6775

Closed

Create envs container for build environment cache readthedocs/common#54

Merged

humitos added 4 commits March 12, 2020 09:25

Add comment explaining with not push on sync repository task

c9f5ef7

Use --cache-dir on pip if CACHED_ENVIRONMENT is enabled

af4d441

Ignore OS-level handle

861d90b

Save .cache path properly

bd560ea

+ refactor how cached directories are managed.

agjohnson approved these changes Mar 12, 2020

View reviewed changes

humitos merged commit 2091b2d into master Mar 12, 2020

humitos deleted the humitos/cache-envs-storage branch March 12, 2020 23:42

humitos mentioned this pull request Mar 18, 2020

Use tar to extract cache, write by chunks and new build state #6793

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull/Push cached environment using storage #6763

Pull/Push cached environment using storage #6763

humitos commented Mar 9, 2020

humitos commented Mar 9, 2020

agjohnson left a comment

ericholscher left a comment

ericholscher Mar 11, 2020

humitos Mar 11, 2020

humitos Mar 11, 2020

agjohnson left a comment

humitos commented Mar 12, 2020

Pull/Push cached environment using storage #6763

Pull/Push cached environment using storage #6763

Conversation

humitos commented Mar 9, 2020

humitos commented Mar 9, 2020

agjohnson left a comment

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher Mar 11, 2020

Choose a reason for hiding this comment

humitos Mar 11, 2020

Choose a reason for hiding this comment

humitos Mar 11, 2020

Choose a reason for hiding this comment

agjohnson left a comment

Choose a reason for hiding this comment

humitos commented Mar 12, 2020