(WIP) Rewrite TAP 19 to be more focused on TUF properties

Signed-off-by: Aditya Sirish <[email protected]>
theupdateframework · Mar 22, 2023 · 111ba4b · 111ba4b
1 parent 2fd3632
commit 111ba4b
Showing 1 changed file with 78 additions and 187 deletions.
diff --git a/tap19.md b/tap19.md
@@ -28,31 +28,17 @@ around the assumption that the artifacts being distributed are "regular"
 files. However, emerging applications and ecosystems that defy this assumption
 can still greatly benefit from TUF's security properties.
 
-Content addressed systems are those in which objects are addressed and accessed
-by hashes of their content rather than by reference, such as their location. A
-common paradigm used to achieve content addressability is to represent objects
-via a Merkle Tree or Directed Acyclic Graph (DAG). A Merkle Tree is a hash-based
-data structure where the leaf nodes are identified by the hash of their data and
-non-leaf nodes are identified by the hash of their child nodes. A Merkle DAG is
-similar, except non-leaf nodes can be associated with some data instead of just
-the leaf nodes. Also, an instance of a Merkle DAG does not need to be balanced
-and each node can have several parents. Essentially, Merkle DAGs offer more
-freedom than typical Merkle Trees.
-
-Both data structures are used in various applications today. Perhaps the most
-ubiquitous is Git. They are also used in other ecosystems like the
-Interplanetary File System (IPFS). In IPFS, files are addressed and accessed by
-the hashes of their content rather than by reference, using their location. Any
-file can exist at some specified location, but much stronger claims can be made
-about a file identified by its hash.
-
-These two systems are the driving use cases of this TAP and therefore the
-specification uses Merkle DAG semantics to describe its changes. Further, from
-the perspective of this TAP, there are no differences to how objects of these
-data structures are handled. As a result, the terms "tree" and "DAG" are used
-interchangeably throughout this document to refer to an instance of a Merkle
-Tree or DAG. Additionally, "Merkle object" refers to a node in a Merkle Tree or
-DAG.
+Content addressed systems are those in which objects are addressed or identified
+by a function of their contents. Typically, each artifact is addressed using a
+cryptographic hash of its contents. Due to this characteristic, these systems
+typically enforce the verification of artifact integrity intrinsically. Some
+examples of content addressed systems are Git and the Interplanetary File System
+(IPFS). In Git, if some object with a particular content address is overwritten
+in the Git object store, any operations that _use_ the corrupted object fail
+because of a hash mismatch. IPFS provides interfaces to store artifacts at which
+point their hashes are calculated. These hashes can then be used to fetch the
+corresponding artifacts from IPFS. Similar to Git, a corruption of an object in
+the store results in failures when attempting to use that artifact.
 
 ## Use Case 1: Open Law Library's The Archive Framework
 
@@ -71,21 +57,7 @@ as Targets.
 
 ## Use Case 2: IPFS as a Backend for Targets
 
-IPFS builds on a variety of other technologies to provide a peer-to-peer
-protocol that can store and transfer files. Files are broken up into multiple
-blocks that are part of a Merkle DAG. Each file is identified by either the
-root node when there are multiple blocks, or a single node that also contains
-the data of the file.
-
-Adding support for IPFS to TUF allows developers to distribute files stored on
-IPFS as opposed to traditional servers and distributed via HTTP. This can be
-achieved in two manners: by abstracting the delivery protocol, and by treating
-IPFS nodes as targets rather than the files they represent.
-
-TUF is already protocol agnostic, so merely having IPFS as an alternative
-protocol for repository backends requires few or no changes to TUF metadata.
-On the other hand, directly recording IPFS nodes brings it in line with other
-attempts to record non-traditional, Merkle DAG targets such as TAF.
+TODO: John?
 
 ## Use Case 3: Distributing Artifacts Using OSTree
 
@@ -99,8 +71,7 @@ One key use of OSTree is for packaging and distribution operations. With
 OSTree, a package manager can distribute an entire filesystem tree. In some
 cases, such a tree can be the artifact itself, say an operating system image,
 but OSTree also makes it possible to distribute multiple artifacts using a
-single identifier--that of the root of the tree. The entire directory structure
-uses a Merkle tree under the hood.
+single identifier--that of the root of the tree.
 
 # Specification
 
@@ -142,10 +113,10 @@ TAP proposes using RFC 3986's URI structure for the entry's identifier.
 <scheme>:<hier-part>
 ```
 
-The `scheme` contains a token that uniquely identifies the Merkle DAG ecosystem
-while `hier-part` contains the location or identifier of the specific target.
-In the Git example, the `scheme` may be `git` and the `hier-part` can indicate
-the repository and other information. Note that the specifics of how this TAP
+The `scheme` contains a token that uniquely identifies the ecosystem while
+`hier-part` contains the location or identifier of the specific target. In the
+Git example, the `scheme` may be `git` and the `hier-part` can indicate the
+repository and other information. Note that the specifics of how this TAP
 applies to Git repositories must be recorded in the corresponding POUF, this
 document does not formally specify how it applies to any particular ecosystem.
 
@@ -171,72 +142,26 @@ In the current TUF specification, each target entry has the following format:
 }
 ```
 
-The opaque `custom` field requires no change to make this TAP possible.
+The opaque `custom` field requires no explicit changes. An ecosystem may choose
+to define some specific fields within it, and this must be communicated in the
+corresponding POUF.
 
 The `length` field is an integer that captures the length in bytes of the
-target. While this is straightforward for files, it can be more complicated to
-define what the length of Merkle DAG objects are. This field may also be
-entirely dropped if there is no clear value for a particular ecosystem. The
-corresponding POUF must provide a clear direction for populating (or not) this
-field.
-
-The `hashes` field points to a dictionary object that captures the
-cryptographic hashes of the target in one or more algorithms. While vital for
-targets that are regular files, in the case of Merkle DAG objects, the
-identifier self certifies the contents associated with the node. Thus, this
-self certifying value can be used as the hash for the object.
-
-In the case of Git, this field can contain the identifier of the commit at the
-tip of the branch in question. When this repository is fetched by the client
-during verification, receiving the commit with the specific identifier is akin
-to receiving a file with a particular recorded hash. However, this requires a
-degree of trust in the hash computation mechanism built into the Git
-implementations used while recording the hash and verifying on the client.
-
-For recording artifacts stored in IPFS, a similar approach, in which the
-content identifier is computed by the system, can be used. This identifier is
-not the same as the hash of the artifact itself, but rather identifies the root
-node of the subgraph used to represent the artifact. When using this
-identifier, it is also important to be aware of, and to account for, the
-multibase representation used. Multibase, a protocol that can disambiguate the
-encoding used for base-encoded text, is used by IPFS for its hash values. The
-same hash value can have multiple distinct representations, depending on the
-base. If an implementation of this TAP is directly using the values provided by
-IPFS , it is important to note or choose a specific base or encoding to avoid
-confusion in the future. The `custom` field can be used to communicate these
-configuration choices for each object.
-
-Also, it is important to remember that the "multihash" system used by IPFS is
-“crypto-agile,” meaning its  content identifying system is not locked into one
-cryptographic hash algorithm. As TUF is similarly designed, and does not
-mandate a particular hash algorithm, its metadata structure allows for any
-number of hashes to be recorded for every target. This can be leveraged when
-multiple hash values exist for a particular target.
-
-As before, the specific details associated with recording characteristics of an
-artifact are left to the POUF detailing the implementation of the corresponding
-ecosystem or application.
+target. This field may or may not be relevant, depending on the ecosystem. The
+POUF must specify how `length` is to be parsed.
+
+Similarly, the `hashes` field me be unnecessary if the target identifier
+directly uses the ecosystem's hash value. Once again, the ecosystem's POUF must
+specify how `hashes` is to be parsed.
 
 ## Verifying the Target
 
-The verification workflow also depends on the application or ecosystem to
-validate hash values. In a well designed application or ecosystem, nodes with
-invalid identifiers should not be allowed to exist without causing errors. Git
-is an example of such an ecosystem. If a commit object no longer matches the
-claimed hash value, the main Git implementations immediately flag the issue,
-essentially halting all operations that can apply to the corrupted object.
-
-In such an ecosystem, the ability for a node to legally exist in the system with
-its identifier being recomputable for its data is equivalent to verifying a
-given file has a particular hash value, as in the current TUF verification
-workflow. An example of a robust application is in the
-[appendix](#appendix-ideal-application-behaviour).
-
-This delegates some trust to the implementation. In situations where this is
-not ideal, the node hash can be calculated manually as part of the verification
-process as well. Continuing with the example of Git, the hashes of all nodes
-can be verified as part of the verification process. This is also demonstrated
-in the [appendix](#appendix-ideal-application-behaviour).
+As this TAP applies to content addressed systems which enforce artifact
+integrity protections, verification of a target in the TUF sense is limited to
+all of TUF's checks in the specification except the hash verification of the
+artifact. Instead, the ecosystem is responsible for verifying artifact integrity
+at the time of use of the artifact. Examples of these checks are presented in
+the [appendix](#appendix-application-behavior).
 
 # Rationale
 
@@ -251,106 +176,67 @@ and widely accepted method, to point to different resources. Thus, URIs are an
 ideal choice when an identifier must clearly specify the specific system of a
 particular target, while also locating the object in question.
 
-## Use of Self Certified Hash Values
-
-A key change proposed in this TAP is the use of hash values calculated by
-individual content addressed systems such Merkle DAG applications or ecosystems
-rather than those generated by the developers using TUF. Yet, as discussed in
-the security analysis, as long as the application is careful with its selection
-of hash algorithms, the only critical element in this change is how the hash
-values are used in the verification workflow. It is vital to always remember
-that the hash calculation is no longer directly controlled by the developer. As
-a consequence, audits of the mechanism must be regularly performed rather than
-blindly trusting the application in question.
-
-This may not always be possible--the application in question may not be open
-source or not auditable for other reasons. In these situations, it is highly
-recommended that the developers not take self certified hashes at face value.
-
-It is also not necessary to take an application or ecosystem at face value.
-Instead, the TUF implementation in use can be extended to interface with the
-application, re-implementing the hashing mechanism used to record the target's
-hashes. What this means is that the TUF implementation uses the application's
-hashing mechanisms rather than reinventing the wheel. This ensures that for a
-given cryptographic hash algorithm, there are not multiple values for a given
-object.
-
-This is not a concern for regular files because when computing their hashes,
-the inputs can only be structured in one way--the files themselves. This is not
-the case for more abstract object representations that exist in content
-addressed systems such as those of Merkle DAG applications. It is possible to
-use the characteristics of a Merkle DAG node in multiple ways when computing its
-hash. In order to avoid confusion, this TAP specifies using the existing hashing
-routine as long as it is robust and secure.
+## Relinquishing Artifact Integrity Checks
+
+The most significant change proposed in this TAP is the transfer of artifact
+integrity verification from TUF to the ecosystem. This has major implications
+for TUF's security guarantees and it can be catastrophic if the TAP is applied
+to an ecosystem without strong integrity validation properties. The
+[security analysis](#security-analysis) covers the basics of how an ecosystem
+this TAP applies to computes artifact hashes for verifying their integrity.
 
 # Security Analysis
 
 There are several considerations to be made when this TAP is applied in
 practice.
 
-## Auditing Hash Computation
-
-Hash calculation is the biggest change proposed in this TAP. In TUF, the
-developers distributing the targets control the hash algorithms used and the
-actual computation. On the other hand, this TAP recommends using the node's
-identifier, which is itself a hash, instead of recording a new hash. This
-transfers the control to the _application_. Note that if the distributor also
-controls the application, this is not a concern.
-
-Yet, this conditional solution will not always be true. So, it is important to
-consider several factors when choosing to record content addressed objects as
-targets. As always, the algorithm or hashing routine used should result in
-**unique** hashes for distinct objects. Two distinct objects should under no
-circumstances share a hash value. Further, the hash value should be
-**repeatable**. During verification, the hash values of the nodes are not
-necessarily _explicitly_ calculated by the TUF client. Instead, the client
-checks that a node exists in the respective system with the hash and expects the
-system to detect if some other node is masquerading using the provided hash.
-Therefore, the client should not recognize the same node with a different hash
-when all other parameters such as the algorithm used are the same. A poorly
-written implementation may compute one hash at the repository and expect a
-different hash on the client, considering the original hash invalid on the
-client. The systems considered in this TAP, Git and IPFS, do not have this
-problem, but this is one hypothetical issue to be considered when evaluating
-other ecosystems.
+## Auditing Ecosystems and their Hash Computation Routines
+
+As noted before, content addressable systems typically use cryptographic hashes
+over the contents of artifacts. A system that is a legitimate candidate for this
+TAP must be thoroughly audited to validate its hash computation routines and
+artifact integrity checks. Developers are also urged to monitor the development
+of the ecosystem itself to ensure the assumptions of strong artifact integrity
+validations continue to hold.
+
+The hash algorithm must result in **unique** hashes for distinct objects.
+Further, the hash value should be **repeatable**. For any artifact, the
+algorithm should always generate the same hash value. These properties matter to
+hash algorithms selected by TUF implementations performing artifact integrity
+checks and so they must also exist in the content addressed ecosystem.
 
 ## Unavailable Resources
 
 The availability of targets in a content addressed context is no different from
 that of regular files. For the metadata to be signed, the specific object must
-be available from the corresponding source. Similarly, verification of the
-metadata is contingent on actually receiving the resource in question--here,
-that takes the form of the specified nodes being present on the client
-post-fetch.
+be available from the corresponding source.
 
 # Adoption Considerations
 
-While this TAP goes into detail about handling Git and IPFS, it should be
-possible to apply the same techniques to other robust content addressable
-systems. There are several factors to consider as an adopter looking to
-implement this TAP.
+There are several factors to consider as an adopter looking to implement this
+TAP.
 
 ## Applicability of the TAP
 
 An important aspect of applying the ideas in this TAP is ensuring the target
 system is indeed content addressable.  This TAP is **not**, for example,
 generalizable to all version control systems (VCSs). Consider Subversion (SVN),
 an alternative to Git. Like Git, SVN has a concept of recording changes which it
-calls _revisions_. However, SVN **does not** use a Merkle DAG to store these
-revisions. Instead, each revision is identified by an auto-incrementing integer,
-one more than the previous revision. This identifier does not make any claims
-about the specific changes in the revision. Indeed, the identifier is entirely
-disconnected from the contents of the changes contained in the corresponding
-revision, and using it as a self certified value of a revision as prescribed in
-this document for Git is **dangerous**, entirely undermining the security
-properties offered by TUF.
+calls _revisions_. However, SVN **does not** use a content addressed store for
+these revisions. Instead, each revision is identified by an auto-incrementing
+integer, one more than the previous revision. This identifier does not make any
+claims about the specific changes in the revision. Indeed, the identifier is
+entirely disconnected from the contents of the changes contained in the
+corresponding revision, and using it as a self certified value of a revision as
+prescribed in this document for Git is **dangerous**, entirely undermining the
+security properties offered by TUF.
 
 Implementers must be therefore very careful with adopting this TAP for a new
 system. They must be familiar with the characteristic properties of content
 addressable systems. If they are implementing this TAP for an existing system
-they do directly control, they must thoroughly and regularly
-[audit the hash computation](#auditing-hash-computation) mechanisms used by the
-system.
+they do not directly control, they must thoroughly and regularly
+[audit the system](#auditing-ecosystems-and-their-hash-computation-routines)
+mechanisms used by the system.
 
 ## Registering a Scheme for New Applications
 
@@ -388,16 +274,17 @@ This document has been placed in the public domain.
 
 # References
 
-* [Merkle Trees](https://xlinux.nist.gov/dads/HTML/MerkleTree.html)
 * [RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax](https://tools.ietf.org/html/rfc3986)
 * [Interplanetary Filesystem](https://ipfs.io/)
-* [IPFS Merkle DAGs](https://docs.ipfs.io/concepts/merkle-dag/)
 * [IPFS Content Addressing](https://docs.ipfs.io/concepts/content-addressing/)
 * [IPFS Hashing](https://docs.ipfs.io/concepts/hashing/)
-* [Multibase](https://github.com/multiformats/multibase)
-* [Multihash](https://github.com/multiformats/multihash)
 
-# Appendix: Ideal Application Behaviour
+# Appendix: Application Behavior
+
+In this appendix, we consider how our example ecosystems enforce artifact
+integrity.
+
+## Git
 
 For example, consider what happens when a Git commit is manually overwritten
 with different information.
@@ -467,3 +354,7 @@ used in a significant operation, its hash is checked against the contents. As
 such, attempting to `git checkout` a commit that has been tampered with, for
 example, will result in an error. On the other hand, viewing it via `git log`,
 `git show`, and `git cat-file` will not result in an error.
+
+## IPFS
+
+TODO: John?