diff --git a/merkle-dag-targets.md b/merkle-dag-targets.md new file mode 100644 index 00000000..4c623cc8 --- /dev/null +++ b/merkle-dag-targets.md @@ -0,0 +1,469 @@ +* TAP: +* Title: Enabling Merkle Tree and DAG Objects as Targets in TUF +* Version: 1 +* Last-Modified: +* Author: +* Type: Standardization +* Status: Draft +* Content-Type: markdown +* Created: 09/05/2022 +* Requires: +* +TUF-Version: +* +Post-History: + +# Abstract + +This TAP proposes extending the TUF specification to support nodes in Merkle +trees or DAGs as targets in metadata. In doing so, TUF's properties can be +applied to applications such as Git, IPFS, and OSTree. This document describes +how the Merkle DAG ecosystems' self certified hash values or hashing routines +can be adopted, and the properties applications must have to ensure hash values +are robust. + +# Motivation + +The current TUF specification requires `targets` to be files--TUF is designed +around the assumption that the artifacts being distributed are "regular" +files. However, emerging applications and ecosystems that defy this assumption +can still greatly benefit from TUF's security properties. + +A Merkle Tree is a hash-based data structure where the leaf nodes are +identified by the hash of their data and non-leaf nodes are identified by the +hash of their child nodes. A Merkle Directed Acyclic Graph (DAG) are similar, +except non-leaf nodes can be associated with some data instead of just the +leaf nodes. Also, an instance of a Merkle DAG does not need to be balanced and +each node can have several parents. Essentially, Merkle DAGs offer more +freedom than typical Merkle Trees. + +Both data structures are used in various applications today. Perhaps the most +ubiquitous is Git. They are also used in other ecosystems with a focus on +content addressability, such as in the Interplanetary File System (IPFS). In +IPFS, files are addressed and accessed by the hashes of their content rather +than by reference, using their location. Any file can exist at some specified +location, but much stronger claims can be made about a file identified by its +hash. + +These two systems are the driving use cases of this TAP. The core idea of this +TAP is to apply TUF's properties to objects native to these ecosystems, and +therefore, it describes how to treat Merkle Tree or DAG nodes as TUF targets. + +From the perspective of this TAP, there are no differences to how objects of +these data structures are handled. As a result, the terms "tree" and "DAG" are +used interchangeably throughout this document to refer to an instance of a +Merkle Tree or DAG. Additionally, "Merkle object" refers to a node in a Merkle +Tree or DAG. + +NOTE: The use cases are still a work in progress. + +## Use Case 1: Open Law Library's The Archive Framework + +The Open Law Library is an open access publisher that makes laws freely +accessible to governments and their citizens. They build tools that help +governments with the drafting, codifying, and publishing aspects of the +legislative process. + +The organization have developed a variant of TUF called The Archive Framework +(TAF), designed to support Git repositories as targets rather than regular +files. TAF uses a stand-in file representing each repository which records the +specific commit ID. This file is then used as a target in TUF metadata. + +## Use Case 2: IPFS as a Backend for Targets + +IPFS builds on a variety of other technologies to provide a peer-to-peer +protocol that can store and transfer files. Files are broken up into multiple +blocks that are part of a Merkle DAG. Each file is identified by either the +root node when there are multiple blocks, or a single node that also contains +the data of the file. + +Adding support for IPFS to TUF allows developers to distribute files stored on +IPFS as opposed to traditional servers and distributed via HTTP. This can be +achieved in two manners: by abstracting the delivery protocol, and by treating +IPFS nodes as targets rather than the files they represent. + +TUF is already protocol agnostic, so merely having IPFS as an alternative +protocol for repository backends requires few or no changes to TUF metadata. +On the other hand, directly recording IPFS nodes brings it in line with other +attempts to record non-traditional, Merkle DAG targets such as TAF. + +## Use Case 3: Distributing Artifacts Using OSTree + +OSTree, or libostree, is a library and tool that applies a Git-like model for +entire, bootable filesystem trees. The project includes utilities for deploying +these images. It follows a similar principle as Git, using hash values to build +content addressability. OSTree is used by a variety of projects such as Flatpak +and Fedora CoreOS. + +One key use of OSTree is for packaging and distribution operations. With +OSTree, a package manager can distribute an entire filesystem tree. In some +cases, such a tree can be the artifact itself, say an operating system image, +but OSTree also makes it possible to distribute multiple artifacts using a +single identifier--that of the root of the tree. The entire directory structure +uses a Merkle tree under the hood. + +# Specification + +The key differences between regular file targets and Merkle objects are in how +their hashes are computed and how the TUF verification workflow applies to +them. As such, the key focus of this document is to articulate what is required +to design a TUF implementation capable of recording Merkle objects. This TAP +considers two instantiations of Merkle DAGs--Git and the Interplanetary +Filesystem (IPFS). These systems differ significantly in the type of data each +node represents. In Git, each node in the DAG represents a _commit_, or a +record of changes made, while in IPFS, each DAG node represents a file, or the +root of a tree of nodes that collectively represent a file. As such, these +systems are different enough to ensure the contents of this TAP can apply to +multiple types of Merkle Tree or DAG systems not explicitly considered here. + +Presently, each entry in TUF's targets metadata has two key parts--the +identification of the target, and the characteristics of the target. +Incorporating Merkle objects will require consideration to both of these +aspects, as well as to how they are handled during verification. + +## Identifying the Target + +Currently, file targets are identified by a path that is relative to the +repository's base URL. As discussed before, a Merkle DAG is a hash-based data +structure, so every node is associated with a hash value. Therefore, as the +identifier of each node is ecosystem specific, the strategy used to identify a +target node will vary accordingly. + +In order to support different Merkle DAG ecosystems, this TAP proposes using +RFC 3986's URI structure for the target identifier. This has the following +structure. + +``` +: +``` + +The `scheme` contains a token that uniquely identifies the Merkle DAG ecosystem +while `hier-part` contains the location or identifier of the specific target. + +For example, every Git repository contains a Merkle DAG, in which every node is +a commit object, and each commit has a unique identifier generated using SHA-1. +However, while this identifier can point to the commit, it is insufficient to +locate the repository itself. So, when the Merkle DAG in question is that of a +Git repository, the target identifier should point to the root of the +repository as a whole. + +``` +git: +``` + +With Git in particular, the URI structure can be extended to contain more +information. For example, the URI could record a specific _branch_ or _tag_. + +``` +git:?branch= +git:?tag= +``` + +On the other hand, IPFS introduces the concept of locating arbitrary artifacts +by their content, rather than by a particular location. When a file is added to +IPFS, it is then available at an endpoint that uses the +cryptographic hash of its contents. This idea is the notion of content +addressability. In this instance, it makes sense to use this identifier in TUF +metadata. + +``` +ipfs: +``` + +It is important to note that a file can encompass multiple nodes in the IPFS +Merkle DAG, and in such situations, the identifier should be the root node +which points to the other nodes that make up the file. + +## Recording the Characteristics of the Target + +In the current TUF specification, each target entry has the following format: + +``` +{ + "length" : LENGTH, + "hashes" : {ALG: HASH, ...}, + ("custom" : CUSTOM) +} +``` + +While this format does not require any significant changes, the techniques used +to populate their values for non-file targets must be carefully designed. + +The opaque `custom` field requires no change to make the changes in this TAP +possible. + +Discuss: what makes sense for `length` in git repos? The commit object files +pre-compression? + +The `length` field is an integer that captures the length in bytes of the +target. For artifacts uploaded to IPFS, this field should capture the length of +the file as a whole. + +The `hashes` field points to a dictionary object that captures the +cryptographic hashes of the target in one or more algorithms. While vital for +targets that are regular files, in the case of Merkle DAG objects, the +identifier for each node is a hash value itself, and self certifies the +contents associated with the node. Therefore, the identifying hash can be used +in the `hashes` field. + +In the case of Git, this field can contain the identifier of the commit at the +tip of the branch in question. When this repository is fetched by the client +during verification, receiving the commit with the specific identifier is akin +to receiving a file with a particular recorded hash. However, this requires a +degree of trust in the hash computation mechanism built into the Git +implementations used while recording the hash and verifying on the client. + +For recording artifacts stored in IPFS, a similar approach, in which the +content identifier is computed by the system, can be used. This identifier is +not the same as the hash of the artifact itself, but rather identifies the root +node of the subgraph used to represent the artifact. When using this +identifier, it is also important to be aware of, and to account for, the +multibase representation used. Multibase, a protocol that can disambiguate the +encoding used for base-encoded text, is used by IPFS for its hash values. The +same hash value can have multiple distinct representations, depending on the +base. If an implementation of this TAP is directly using the values provided by +IPFS , it is important to note or choose a specific base or encoding to avoid +confusion in the future. The `custom` field can be used to communicate these +configuration choices. + +Also, it is important to remember that the "multihash" system used by IPFS is +“crypto-agile,” meaning its content identifying system is not locked into one +cryptographic hash algorithm. As TUF is similarly designed, and does not +mandate a particular hash algorithm, its metadata structure allows for any +number of hashes to be recorded for every target. This can be leveraged when +multiple hash values exist for a particular target. + +## Verifying the Target + +The verification workflow also depends on the application or ecosystem to +validate hash values. In a well designed application or ecosystem, nodes with +invalid identifiers should not be allowed to exist without causing errors. Git +is an example of such an ecosystem. If a commit object no longer matches the +claimed hash value, the main Git implementations immediately flag the issue, +essentially halting all operations that can apply to the corrupted object. + +In such an ecosystem, the ability for a node to legally exist in the Merkle DAG +with its identifier being recomputable for its data is equivalent to verifying +a given file has a particular hash value, as in the current TUF verification +workflow. An example of a robust application is in the +[appendix](#appendix-ideal-application-behaviour). + +This delegates some trust to the implementation. In situations where this is +not ideal, the node hash can be calculated manually as part of the verification +process as well. Continuing with the example of Git, the hashes of all nodes +can be verified as part of the verification process. This is also demonstrated +in the [appendix](#appendix-ideal-application-behaviour). + +# Rationale + +This TAP proposes several changes to the general artifact recording process +currently employed in TUF. + +## Use of URIs for Target Identification + +The TAP updates the definition of a target identifier from being only a path +relative to a repository. It also allows URIs, which are a broadly understood +and widely accepted method, to point to different resources. As such, URIs are +an ideal choice when an identifier must clearly specify the Merkle DAG system +of a particular target, while also locating the object in question. + +## Use of Self Certified Hash Values + +A key change proposed in this TAP is the use of hash values calculated by +individual Merkle DAG applications or ecosystems rather than those generated +by the developers using TUF. Yet, as discussed in the security analysis, as +long as the application is careful with its selection of hash algorithms, the +only critical element in this change is how the hash values are used in the +verification workflow. It is vital to always remember that the hash calculation +is no longer directly controlled by the developer. As such, audits of the +mechanism must be regularly performed rather than blindly trusting the +application in question. + +This may not always be possible--the application in question may not be open +source or not auditable for other reasons. In these situations, it is highly +recommended that developers must not take self certified hashes at face value. + +It is also not necessary to take a Merkle DAG application at face value. +Instead, the TUF implementation in use can be extended to interface with the +application, re-implementing the hashing mechanism used to record the target's +hashes. What this means is that the TUF implementation uses the application's +hashing mechanisms rather than reinventing the wheel. This ensures that for a +given cryptographic hash algorithm, there are not multiple values for a given +object. + +This is not a concern for regular files because when computing their hashes, +the inputs can only be structured in one way--the files themselves. This is not +the case for more abstract scenarios such as those of Merkle DAG applications. +It is possible to use the characteristics of a Merkle DAG node in multiple ways +when computing its hash. In order to avoid confusion, this TAP specifies using +the existing hashing routine as long as it is robust and secure. + +# Security Analysis + +There are several considerations to be made when this TAP is applied in +practice. + +## Auditing Hash Computation + +Hash calculation is the biggest change proposed in this TAP. In TUF, the +developers distributing the targets control the hash algorithms used and the +actual computation. On the other hand, this TAP recommends using the node's +identifier, which is itself a hash, instead of recording a new hash. This +transfers the control to the _application_. Note that if the distributor also +controls the application, this is no longer a concern. + +Yet, this conditional solution will not always be true. So, it is important to +consider several factors when choosing to record Merkle DAG objects as targets. +As always, the algorithm used should result in **unique** hashes for distinct +objects. Two distinct Merkle DAG nodes should under no circumstances share a +hash value. Further, the hash value should be **repeatable**. During +verification, the hash values of the nodes are not calculated again. Instead, +the client checks that a node exists in the respective Merkle DAG system with +the expected hash. In this scenario, the client should not recognize the same +node with a different hash when all other parameters such as the algorithm +used are the same. For example, a poorly written implementation may compute +one hash at the repository and expect a different hash on the client, +considering the original hash invalid on the client. The systems considered in +this TAP, Git and IPFS, do not have this problem, but this is a hypothetical +issue to be considered when evaluating other Merkle DAG ecosystems. + +## Unavailable Resources + +The availability of targets in a Merkle DAG context is no different from that +of regular files. For the metadata to be signed, the specific object must be +available from the corresponding source. Similarly, verification of the +metadata is contingent on actually receiving the resource in question--here, +that takes the form of the specified nodes being present on the client +post-fetch. + +# Adoption Considerations + +While this TAP goes into detail about handling Git and IPFS, it should be +possible to apply the same techniques to other Merkle DAG based systems. +There are several factors to consider as an adopter looking to implement this +TAP. + +## Applicability of the TAP + +An important aspect of applying the ideas in this TAP is ensuring the target +system is indeed built using a Merkle Tree or DAG. This TAP is **not**, for +example, generalizable to all version control systems (VCSs). Consider +Subversion (SVN), an alternative to Git. Like Git, SVN has a concept of +recording changes which it calls _revisions_. However, SVN **does not** use a +Merkle DAG to store these revisions. Instead, each revision is identified by an +auto-incrementing integer, one more than the previous revision. This identifier +does not make any claims about the specific changes in the revision. Indeed, +the identifier is entirely disconnected from the contents of the changes +contained in the corresponding revision, and using it as a self certified value +of a revision as prescribed in this document for Git is **dangerous**, +entirely undermining the security properties offered by TUF. + +Implementers must be therefore very careful with adopting this TAP for a new +system. They must be familiar with the characteristic properties of Merkle Tree +or DAG based systems. If they are implementing this TAP for an existing system +they do directly control, they must thoroughly and regularly +[audit the hash computation](#auditing-hash-computation) mechanisms used by the +system. + +## Registering a Scheme for New Applications + +As noted [previously](#identifying-the-target), Merkle DAG objects are +identified using URIs, where the `scheme` describes the specific ecosystem the +target belongs to. In order to avoid collisions for these values, adopters +should communicate any new applications they implement this TAP for to the +broader TUF community. Adopters should announce the new application and the +identifier they have selected for it via the forums used by the community such +as the mailing list, Slack channels, and the monthly community meetings. They +can also seek the community's feedback in assessing the ecosystem for the +applicability of this TAP. + +# Backwards Compatibility + +Discuss: should conforming implementations also implement base TUF for regular +files? + +The expectation with the changes introduced here is that they can only be +extensions to existing TUF implementations. Therefore, these implementations +can continue handling regular file targets as they already do. + +However, when multiple TUF implementations exist with varying degrees of +support for this TAP, or indeed support for different sets of Merkle DAG +systems, there may be compatibility issues during verification. A non +conforming implementation cannot handle TUF metadata that contains Merkle DAG +targets. + +# Augmented Reference Implementation + +None at the moment. + +# Copyright + +This document has been placed in the public domain. + +# References + +* [Merkle Trees](https://xlinux.nist.gov/dads/HTML/MerkleTree.html) +* [RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax](https://tools.ietf.org/html/rfc3986) +* [Interplanetary Filesystem](https://ipfs.io/) +* [IPFS Merkle DAGs](https://docs.ipfs.io/concepts/merkle-dag/) +* [IPFS Content Addressing](https://docs.ipfs.io/concepts/content-addressing/) +* [IPFS Hashing](https://docs.ipfs.io/concepts/hashing/) +* [Multibase](https://github.com/multiformats/multibase) +* [Multihash](https://github.com/multiformats/multihash) + +# Appendix: Ideal Application Behaviour + +For example, consider what happens when a Git commit is manually overwritten +with different information. + +```bash +$ cat .git/object/65/774be295aaf5ac9412ebe81584138643ebded2 | zlib-flate -uncompress +commit 727tree b4d01e9b0c4a9356736dfddf8830ba9a54f5271c +author Aditya Sirish 1654557334 -0400 +committer Aditya Sirish 1654557334 -0400 +gpgsig -----BEGIN PGP SIGNATURE----- + + iQEzBAABCAAdFiEE4ylBKZy4wNk9zyesuDEQ0BJUVgQFAmKeipYACgkQuDEQ0BJU + VgSmUgf9FSwk2VVPn0vWmFzx6x5JdT9CQ3Tl9cqxug0/Zu8xfesQlMgpcpDDMHSf + ZdmGfYaLb7aqSL0jE+pwytAfhGN4xwegqS4/YrzqnPZPjxtj5JlwBVtdMtsYRVHN + QvsDBZEYYd/MFGqSyVkJwFAH9idRwdki8wQ/JwtbAf0QIkqWdIORckh75V7VxX1r + Rv5jU9luU60NbEzAHa/W3xvfKVgaA4a1VjmS7ATOrAS4maNi+VzXjBnvhmR4z7zS + FF4N3QkZ8XwHMu/uuldTq2mB4/uJ/BXP5TNZULn7sbYHKMXrH4ZscqDFplRMeah/ + XxcVTwUVn2zHdmOMf7xw6goFszPaDg== + =lcDr + -----END PGP SIGNATURE----- + +Initial commit + +Signed-off-by: Aditya Sirish +$ cat .git/object/65/774be295aaf5ac9412ebe81584138643ebded2 | zlib-flate -uncompress | sha1sum +65774be295aaf5ac9412ebe81584138643ebded2 - # this matches the commit ID +$ cp .git/object/65/774be295aaf5ac9412ebe81584138643ebded2{,.valid} # copy of the original commit object +``` + +As seen above, the commit IDs are merely the SHA-1 hash of the contents of the +commit object. Now, if we replace the original commit object with a new one +that is not exactly the same, Git shows an error. + +```bash +$ ls -l .git/object/65 +-rw-r--r-- 1 saky users 143 Jun 6 19:20 774be295aaf5ac9412ebe81584138643ebded2 +-r--r--r-- 1 saky users 532 Jun 6 19:15 774be295aaf5ac9412ebe81584138643ebded2.valid +$ cat .git/object/65/774be295aaf5ac9412ebe81584138643ebded2 | zlib-flate -uncompress +commit 727tree b4d01e9b0c4a9356736dfddf8830ba9a54f5271c +author Aditya Sirish 1654557334 -0400 +committer Aditya Sirish 1654557334 -0400 + +Initial commit + +Signed-off-by: Aditya Sirish +``` + +In this situation, we have replaced the commit object with one without a GPG +signature. This commit object should in fact have a different ID. + +```bash +$ cat .git/object/65/774be295aaf5ac9412ebe81584138643ebded2 | zlib-flate -uncompress | sha1sum +e381ba874033b24fb80efb396b7c167118753b91 - # this does not match the commit ID +$ git show +error: hash mismatch 65774be295aaf5ac9412ebe81584138643ebded2 +fatal: bad object HEAD +```