Skip to content
/ titan Public
forked from titan-data/titan

Commit

Permalink
remote workflows (#46)
Browse files Browse the repository at this point in the history
  • Loading branch information
Eric Schrock authored and mcred committed Oct 29, 2019
1 parent f9cadf4 commit dafd4d2
Show file tree
Hide file tree
Showing 7 changed files with 153 additions and 6 deletions.
7 changes: 7 additions & 0 deletions docs/src/local/commit.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,5 +75,12 @@ by running ``titan checkout``::
Here you can see that we stopped the container, swapped out the data, and
started it again. And with that, we're back to the original commit we created.

.. warning::

The titan infrastructure has not currently been built for scale, and while it
should work fine for dozens of commits, creating hundreds or thousands of
commits or repositories may have adverse effects on the system. This will be
addressed in a future release.

For information on more additional local workflows, see the
:ref:`local` section.
18 changes: 17 additions & 1 deletion docs/src/remote/addremove.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,20 @@
Adding and Removing Remotes
===========================

Coming Soon!
Each repository can have zero or more remotes configured. To add a remote,
use :ref:`cli_cmd_remote_add`::

$ titan remote add s3://bucket/path myrepo

Remotes are specified as URIs, with the first portion defining the provider
(s3 in the above case), and the rest being specific to that provider. By
default, the remote is named `origin`, but you can also assign remotes
names (required when you have more than one remote).

To get a list of remotes, use :ref:`cli_cmd_remote_ls`::

$ titan remote ls hello-world
REMOTE PROVIDER
origin s3

Remotes can be removed with the :ref:`cli_cmd_remote_rm` command.
21 changes: 20 additions & 1 deletion docs/src/remote/clone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,23 @@
Cloning Repositories
====================

Coming soon!
The :ref:`cli_cmd_clone` command will create a new repository using the
configuration from a remote. It is equivalent to creating a new repository with
an identical configuration, adding the remote, and pulling down the latest
commit::

$ titan clone s3://titan-data-demo/hello-world/postgres hello-world

The docker configuration is persisted with each commit, so the local repository
uses whatever the configuration was as of the last commit.

.. note::

There is not currently any way to override the docker configuration, such
as wanting to use a different port or network configuration. This
capability will be added in a future release.

.. note::

The clone command currently always uses the latest commit. The ability to
select a specific commit to use will be added in a future release.
30 changes: 29 additions & 1 deletion docs/src/remote/provider/s3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,32 @@
S3 Provider
===========

Coming Soon!
The S3 provider uses S3 to store commits remotely in a S3 bucket. Each commit
is stored as an tar archive, with the commit metadata attached as object
metadata. The URI format is::

s3://<bucket>/<key>

Commits will be created at ``<key>/<commitid>/<archive>.tar.gz``. The commit
metadata will be stored at the ``<commitid>`` level.

The AWS credentials are pulled using the default AWS credential chain at
the time you do the push or pull operation. So you must have the
``AWS_*`` environment variables set, or use your ``~/.aws`` configuration.
Because the S3 provider uses the standard AWS SDK, all variations of credentials
should be supported, including specifying a profile with ``AWS_PROFILE``.
To pull a commit, you will need ``s3:GetObject`` permissions. To push a commit,
you will need ``s3:PutObject`` permissions.

.. note::

The S3 provider doesn't currently support MFA (multi factor authentication).
If you have a ``session_token`` in your AWS config, then operations will
fail with an error message indicating the access key could not be found.

The S3 provider relies on basic AWS APIs to implement its functionality, and
as such has limited scalability. For example, finding the latest commit requires
listing all objects, getting metadata iteratively for each one, and comparing
the result. It should only be used for storing relatively small numbers of
commits. Improving this will require a new provider that includes a robust
metadata layer on top of the base S3 functionality.
27 changes: 26 additions & 1 deletion docs/src/remote/provider/ssh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,29 @@
SSH Provider
============

Coming Soon!
The SSH provider enables commits to be stored on any server where the user
has remote access over SSH. The URI syntax is::

ssh://user[:password]@host/path

The ``path`` is interpreted as an absolute path unless it starts with ``~``.
The SSH provider uses rsync to copy files to subdirectories within the path,
with metadata being stored in a ``metadata.json`` file. This means that pushes
are always full sends, as titan is sending data to a newly created directory.
Pulls, on the other hand, may not need to transfer all data depending on what
state exists locally.

The system must have ``sudo`` installed and the user must have ``sudo``
privileges for running rsync. This enables file ownership and permissions to be
set properly.

If ``password`` is not specified, then the user will be prompted for a password
at the time they do the push or pull operation. Future enhancements will
include the ability to specify a SSH key file instead of using passwords.

Like the S3 provider, the SSH provider has inherent scalability limitations. For
example, finding the latest commit requires listing all commits in the path,
reading the metadata file for each, and comparing the result. It should only be
used for storing relatively small numbers of commits. Improving this will
require a new provider that includes a robust metadata layer on top of the base
SSH functionality.
24 changes: 23 additions & 1 deletion docs/src/remote/pushpull.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,26 @@
Pushing and Pulling
===================

Coming Soon!
The :ref:`cli_cmd_push` and :ref:`cli_cmd_pull` commands form the basis of
sharing data via remote repositories. Unlike git, however, they transfer
only a single commit to or from the remote repository. There is no notion
of pulling "all commits" and then checking out one of them.

Exactly how each provider transfers data varies. Some, like S3, only do full
transfers of data as a single archive. Others, like SSH, will use rsync to
hopefully transfer only incremental data.

Each push and pull runs asynchronously in the context of the titan container,
but progress is streamed to the command line while it's being run. In rare
cases, it's possible to exit the CLI while the operation is ongoing. In this
case, you may get a message that an operation is in progress. You can either
wait for it to complete, or abort it with :ref:`cli_cmd_abort`.

While the CLI does not provide full-fledged management of remotes (something
specific to each remote), you can get a list of remote commits using the
:ref:`cli_cmd_remote_log` command.

.. note::

Titan doesn't currently retry after network errors or other interruptions.
This capabilities will be added in a future release.
32 changes: 31 additions & 1 deletion docs/src/remote/remote.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,37 @@
Remote Repositories
===================

Coming Soon!
While managing data locally on your laptop is all well and good, part of the
power of source code management is the ability to share that data with
others. Much like git, Titan has the notion of `remote repositories` that
act as an endpoint for push and pull.

There are a few important general things to be aware of:

* Titan commits do not have a strict dependency on the previous commit from
which it was created. Because they are much larger, we allow them to be
pushed and pulled independently. For this reason, :ref:`cli_cmd_clone` and
:ref:`cli_cmd_pull` will not pull down `all commits`, only the one specified
by the user.
* Titan does not support the notion of merging. While concepts like tagging
and branching will be added over time, generically merging data at the
on-disk level is not possible.
* Different remote providers have different performance characteristics,
including whether they support incremental transfers. Some will
always to a full data transfer, while others have a means to identify
only changed blocks. Titan is designed to work with small
datasets (<10GB), using it for anything remotely large may have adverse
effects on the system.

.. warning::

Titan currently ships with two very basic providers, the :ref:`remote_provider_s3`
and the :ref:`remote_provider_ssh`. These are only introductory providers, designed
to have zero dependencies on external software. But as such, they
will face challenges across security, performance, and robustness when
operated at scale in an enterprise setting. As Titan matures, we will be
working with the community and partners to help develop remote providers
with more robust capabilities.

.. toctree::
:maxdepth: 1
Expand Down

0 comments on commit dafd4d2

Please sign in to comment.