Skip to content
TOGoS edited this page May 9, 2016 · 14 revisions

ContentCouch is a system (a data model, a set of command-line utilities, and a few other programs) for identifying and storing files and directories by their content, thereby allowing you to create cheap snapshots of your filesystem which can be easily backed up, synchronized, and shared. It is designed around common standards such as URIs and RDF/XML to facilitate integration with other systems.

It is implemented in Java 1.4, with most dependencies included in the repository. Therefore it should be fairly simple to run ContentCouch on any computer with Java.

ContentCouch3 is contains additional tools for working with the same data structures and is meant to replace ContentCouch with a better-engineered codebase, but currently (2016) doesn’t implement all of ContentCouch’s features.

ContentCouch makes use of URNs to identify both files and directories. Files are referenced with URIs of the form

urn:sha1:XXXXXX or urn:bitprint:XXXXX.YYYYY

Directories are stored as RDF documents and referenced with URIs of the form

x-rdf-subject:<RDF document URI> (ccouch also supports the older but equivalent style: x-parse-rdf:urn:sha1:YYYYYY)

ContentCouch can store and check out files by hardlinking (this should work both on UNIX-like systems and on NTFS under Windows, and is specified by passing the -link option to store, checkout, cache, and/or copy), thereby requiring very little extra storage space even when a multi-gigabyte file is stored in the repository and checked out to several locations on the same partition.

Current Limitations

Currently, ContentCouch does not efficiently store multiple similar versions of a file, as each blob is stored in an individual, uncompressed file (this makes finding blobs very simple and allow hardlinks to any stored blob). The proper solution, I think, is to use an underlying filesystem (such as ZFS) that can automatically share blocks between similar files (also making hardlinks unnecessary). Another solution would be to allow blobs to be defined in terms of other, similar blobs.

The checkout command’s idea of a merge is extremely naive – it simply copies any files from the source not existing in the destination folder, and will error (or ignore, if a certain option is given) if identically-named blobs with non-identical content are found. More robust changeset merging is a mid-priority to-do item.

If you have a small collection of files that change often, I recommend tracking them using a more mature DVCS such as Git. If, however, you have a large and growing collection of files that take up a lot of space on disk and rarely change, ContentCouch may be just what you need, despite its limitations.

Clone this wiki locally