Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRFC 33 - Repositories #33

Open
jbenet opened this issue Jan 12, 2015 · 1 comment
Open

JRFC 33 - Repositories #33

jbenet opened this issue Jan 12, 2015 · 1 comment

Comments

@jbenet
Copy link
Owner

jbenet commented Jan 12, 2015

This document is an attempt at specifying a generalized spec for repositories
(the git and ipfs kind) in the hope to arrive at a generalized set of good
practices. I am new to many intricacies and edge cases, so please suggest
important additions.


Many tools and systems create data repositories with configuration files. The
classic example is git and other VCS tools, but many systems do. Application
changes will necessarily bring about changes to the format of the repository
(e.g. changing how data is stored, or changing the data itself). These should
NEVER cause any data loss on users, and great care must be given to ensure
all format changes are accompanied with migration tools.

As applications grow, different types of storage media or execution strategies
may optimize different use cases e.g. "flat files inside .git for git cli"
vs "git repo inside database for fast web server access". No matter the use
case, application implementations should be able to operate with different
concrete versions of the repository, provided suitable adaptors exist. This
separation reduces the cost of writing new storage implementations, and new
application implementations.

Terms:

  • repo - a repository, a structured collection of objects, with a
    configuration. e.g. a git repo. an ipfs repo
  • config - a repository configuration which holds repository options
  • database - a database which holds the repository data. this may be
    a key value store (leveldb), a collection of flat files (.git/objects), a
    relational db (SQLite), etc.
  • address - is an identifier of the location of the repository e.g.:
    /Users/jbenet/foo/bar/.git, https://github.com/jbenet/go-ipfs.
  • format - the way in which the data is organized
  • repo version - a number identifying the repo's format. It is easiest if
    these are monotonically increasing integers.
  • concrete repo - the actual repo as stored in storage media. (e.g. posix
    files inside .git/, files and a leveldb, s3, ...)
  • virtual repo - a virtual object which can be manipulated. The distinction
    between concrete and virtual is here so that tools may be written mostly
    to operate on the virtual repo, and remain compatible with a variety of
    repo implementations, through adapters.

Notes

  • repo version MUST be included, and remain readable by all tools
    attempting to modify repo (e.g. migration tools from any version must
    be able to determine the current version of the repo. Example:
    .go-ipfs/version)
  • config and database may both be implemented by the same storage system,
    but it is recommended they are separate, as one might define the other.

Synchronization

Operations on a repo may require synchronization (some repos may support
concurrent modifications, and others require complete mutual exclusion). Repos
which require mutual exclusion must support mechanisms to achieve it (e.g.
.git/index.lock). These may be granular or coarse, but repo formats must define
synchronization, so various implementations can ensure safe, concurrent access.

Migrations

Migrations: through the lifetime of an application, repo formats may require
changes. These changes must be accompanied a "migration tool", which convert
the data from the most recent format version, to the new one. Ideally the
upgrade can be applied in both directions (old <-> new). For example, one
may end up with a set of "repo version migration" tools like the following:

> ls ipfs/bin/repo-migrations
1-to-2
2-to-3
3-to-4
4-to-5
5-to-6
6-to-7

> ipfs/bin/repo-migrations/1-to-2
repository version: 3
already up to date.

> ipfs/bin/repo-migrations/3-to-4
repository version: 3
applying path: 3-to-4
repository version: 4

> ipfs/bin/repo-migrations/5-to-6 --revert
repository version: 4
applying patch: 4-to-3
repository version: 3

> ipfs/bin/repo-migrations/run 1-to-7
repository version: 3
applying patch: 3-to-4
applying patch: 4-to-5
applying patch: 5-to-6
applying patch: 6-to-7
repository version: 7

It is advised that repo migration tools are virtual repo tools (that is, implemented
to work with the logical repo, instead of the concrete data). This makes it possible
to reuse migration tools across repo implementations (with proper adapters).
This may not be possible always, repo-format-specific migration tools might
be necessary.

human inspection

Repo implementations must include tools to transform the data to a human
readable/inspectable structure. This makes it possible for users and application
implementors to debug problems. These tools may be easiest to implement with
a human readable repository format, and conversion tools to convert to/from it.

corruption

  • corrupted - an unexpected, invalid data state
  • recovery - the process of "uncorrupting" a repository. may not be possible.

...

@betabrain
Copy link

This is very interesting. It made me think of storage in general... Most applications use data structures, config files, databases, or any combination thereof. These represent the virtual repo. The concrete repo is either RAM, a local file, or a service the other side of a network connection. Here are some thoughts.

What if there was a standard multirepo format?

  • Users got to choose how and where application data is stored.
  • Developers would not have to think of how to store application data.
  • The on-disk format could be changed in order to keep up with changing performance requirements as the dataset grows.
  • All migrations / transformations are handled by the repo.
  • GC keeps track of lossy transformations and refuses to collect dropped information as to allow rollbacks.

access

  • To access a repo only its address should need to be known.
  • The repo should be self-describing, i.e. stores format and version in a standard way.
  • The address should use an open format, e.g. https://github.com/jbenet/multiaddr.
var repo = require("multirepo").open("~/myrepo");

users = repo.access("userdb"); // userdb is a relational datastore
users.query('select * from users;');

notes = repo.access("notes"); // notes is an append only list
notes.append({note: "Hello world!", "ts": new Date()})

meta layer

  • Keeps a history of all transformations and migrations on the repository.
  • Stores migration procedures.

logical layer

These are the data structures the application reads and writes.

  • Describes the logical data structures.
  • Specifies the data types and schema that go into the data structures.
  • Optionally specifies performance requirements.
> repo logical

name           structure         signature                     entries
config         map               string -> string                    4
index          list              path -> delta                      21
objects        map               sha1 -> commit, tree, blob       2409

concrete layer

  • Describes the on-disk formats, e.g. leveldb, ini-file, etc.
  • Optionally stores past performance data to automatically migrate to a better on-disk format.
> repo concrete

name         backend          size
config       git-config      0.3 K
index        git-index      12.1 K
objects      git-objects    17.9 M

batteries included

  • Provide tools to migrate/transform the data stored (logical layer), e.g.
    https://github.com/jbenet/transformer
  • Create bundles/backups using the right tools for each backend.
  • Suggest best backend based on performance requirements or past performance data.
> repo migrate-backend objects ipfs-git

> repo concrete

name         backend          size
config       git-config      0.3 K
index        git-index      12.1 K
objects      ipfs-git       21.4 M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants