From e705ce850778a6d2b7006b71e308227f43f4de77 Mon Sep 17 00:00:00 2001 From: Andrew Nesbitt Date: Mon, 20 May 2019 09:48:34 +0100 Subject: [PATCH] Move documents into /docs folder --- README.md | 11 +- docs/abstractions.md | 94 ++++++++++++ blockers.md => docs/blockers.md | 6 +- categories.md => docs/categories.md | 0 docs/concepts.md | 226 ++++++++++++++++++++++++++++ docs/decentralization.md | 202 +++++++++++++++++++++++++ glossary.md => docs/glossary.md | 0 docs/linking.md | 61 ++++++++ docs/papers.md | 34 +++++ docs/problems.md | 57 +++++++ docs/tree.md | 47 ++++++ 11 files changed, 733 insertions(+), 5 deletions(-) create mode 100644 docs/abstractions.md rename blockers.md => docs/blockers.md (93%) rename categories.md => docs/categories.md (100%) create mode 100644 docs/concepts.md create mode 100644 docs/decentralization.md rename glossary.md => docs/glossary.md (100%) create mode 100644 docs/linking.md create mode 100644 docs/papers.md create mode 100644 docs/problems.md create mode 100644 docs/tree.md diff --git a/README.md b/README.md index 1d83097..f85e6c2 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,16 @@ IPFS Package Managers Special Interest Group -- [Package Management Glossary](glossary.md) -- [Package Management Categories](categories.md) - [Package Manager list](package-managers) +- [Package Management Glossary](docs/glossary.md) +- [Package Management Categories](docs/categories.md) +- [How IPFS Concepts map to package manager concepts](docs/concepts.md) +- [Problems with Package Managers](docs/problems.md) +- [Facilitating the Correct Abstractions](docs/abstractions.md) +- [Package indexing and linking](docs/linking.md) +- [Cladistic tree of depths of integration](docs/decentralization.md) +- [Decentralized Publishing](docs/decentralization.md) +- [Academic papers related to package management](docs/papers.md) ## Integrations diff --git a/docs/abstractions.md b/docs/abstractions.md new file mode 100644 index 0000000..4609d56 --- /dev/null +++ b/docs/abstractions.md @@ -0,0 +1,94 @@ +# Facilitating the Correct Abstractions + +(Acknowledging, of course, the hubris of the title -- we can only hope and try!) + +To contribute meaningfully to advancing the state of package management, we must first understand package management. + +First in understanding package management, we should identify and understand the stages of package management. These are stages I would identify: + +- *[human authorship phase is ready to produce a package]* +- pack content +- write release metadata (version name, etc) +- upload content and release metadata +- *[-- switch between producer to consumer --]* +- fetch release metadata +- transitive dependency resolution +- lockfile creation +- *[-- possible switch to even further downstream consumer --]* +- lockfile read and content fetch +- content unpack +- *[a new human authorship phase takes over!]* + +![cycle](https://user-images.githubusercontent.com/627638/53887868-6f7ce580-4023-11e9-9109-803fbd5a06ef.jpg) + +(Image is an earlier visualization of roughly the same concepts, but pictured with authorship cycle closed. +Also note this image contains an "install" phase which is elided in the list above +or perhaps equates to "content unpack" depending on your POV; and several other steps +were combined rather than enumerated clearly.) + +Understanding these phases of package management, we can begin to identify what +might be the key concepts of APIs that haul data between each of the steps. +And understanding what key concepts and data need to be hauled between each +step gives us a roadmap to how IPFS/IPLD can help haul that data! + +--- + +Now. There's many interesting things in the above: + +- Never forget that rather than a list, there is actually a cycle when creation + gets involved. I won't talk about this more in this issue, but in the longest + runs, it's incredibly important to mind how we can close this loop. + +- Some of these phases are particularly clear in how they can relate to IPFS! + For example, uploading of packages and fetching of packages: clearly, these + operations can benefit from IPFS by treating it as a simple content bucket + that happens to be be particularly well decentralized. Since this is already + clear, I also won't talk any more about this in this issue. + +- You might have noticed I injected some Opinions into a few of the steps. + In particular, that ordering of transitive resolution vs lockfile creation + vs metadata fetch is not entirely universally adopted! Some systems skip + the lockfile concept entirely and re-do dependency resolve every time they're used! + Some systems vary in what the lockfile contains (version numbers that are still + technically somewhat vague and need centralized/online translation into content, + versus content-identifiers/hashes, etc). Of course, systems vary *wildly* + in terms of what information they actually act on and what exact logic they + use for transitive dependency resolution. And alarmingly, most systems don't + clearly separate metadata fetch from resolution processes at all. + +--- + +That last set of things I really want to focus in on. + +I think my biggest takeaway by far from the last couple years of thinking about this whole domain is that segmenting resolve from all other operations is absolutely Of The Essence. +It's the point that never ceases to be contended, and for fundamental rather than incidental reasons: it is *correct* for different situations and packagers and user stories to use different resolution strategies. + +It's also (what a coincidence) the key API concept that lets IPFS help other systems while keeping clear boundaries that let them get on with whatever locally contendable (e.g language specific) logic they need to. + +--- + +But here we've got a bummer. Essentially no modern package managers I can think +of actually intentionally designed their resolve stages to be separate and pluggable. + +The more we encourage separation of resolve from the steps that follow it, +the more clear it becomes for every system to have lockfiles; +and the more things have lockfiles, the happier we are, +because the jump from lockfile to content-addressable distribution system gets +more incremental and becomes more obviously a right choice. +But this is already widely clear and quite popular! + +More interesting is what happens when we encourage separation of resolve from +the steps that *precede* it -- namely, from "metadata fetch". + +If we can encourage a world of package managers which have clearly delineated +boundaries between metadata fetch and the evaluation of transitive dependency resolution +upon that metadata, we both get clearer points for integration IPFS/IPLD in the +metadata distribution, AND we provide a huge boost to enabling "reproducible resolve" -- +an issue I've written more about [here](https://repeatr.io/vision/strategy/reproducible-resolve/) in the Timeless Stack docs -- +which sets up the whole world nicely for bigger and better ecosystems of reproducible builds. + +--- + +Thank you for coming to my Github issue / thinkpiece. + +Where can we go from here? No idea: I just want to put all these thoughts out there to cook. We'll probably want to consider these in roadmapping anything beyond the most shortterm basic content-bucket integrations; and perhaps start circulating concepts like separating resolve from metadata transport sooner rather than later to prepare the ground for future work in that direction. diff --git a/blockers.md b/docs/blockers.md similarity index 93% rename from blockers.md rename to docs/blockers.md index 9ece04e..d61b6f2 100644 --- a/blockers.md +++ b/docs/blockers.md @@ -12,13 +12,13 @@ Filestore expects files to be immutable once added, so rsyncing updates to exist Adding a directory of files to MFS means calling out to `ipfs files write` for every file, ideally there should be one command to write a directory of files to MFS. -Alternative approach may be to mount MFS as a fuse filesystem (ala https://github.com/tableflip/ipfs-fuse) +Alternative approach may be to mount MFS as a fuse filesystem (ala https://github.com/tableflip/ipfs-fuse) ### Updating rolling changes requires rehashing all files -If there is a regular cron job downloading updates to a mirror with rsync, there's currently no easy way to only re-add the files that have been added/changed/removed without rehashing every file in the whole mirror directory. +If there is a regular cron job downloading updates to a mirror with rsync, there's currently no easy way to only re-add the files that have been added/changed/removed without rehashing every file in the whole mirror directory. -Mounting MFS as a fuse filesystem (ala https://github.com/tableflip/ipfs-fuse) and rsyncing directly onto fuse may be one approach. +Mounting MFS as a fuse filesystem (ala https://github.com/tableflip/ipfs-fuse) and rsyncing directly onto fuse may be one approach. Alternatively there could be a ipfs rysnc command line tool that could talk directly with rsync server protocol. diff --git a/categories.md b/docs/categories.md similarity index 100% rename from categories.md rename to docs/categories.md diff --git a/docs/concepts.md b/docs/concepts.md new file mode 100644 index 0000000..b0e0c5a --- /dev/null +++ b/docs/concepts.md @@ -0,0 +1,226 @@ +# How IPFS Concepts map to package manager concepts + +## The address of a package + +HTTP is the most popular method for downloading the actual contents of a package, given a url, the package manager client makes a http request and the response body is the package contents, which gets saved to disk. + +The package url usually contains the registry domain, package name and version number: + + http://package-manager-registry.com/package-name/1.0.0.tar.gz + +When you want to download a package using IPFS, rather than a url that contains the name and version, instead you provide a cryptographic hash of the contents that you’d like to recieve, for example: + + /ipfs/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B + +You may notice, that unlike with the URL, there is no domain name, because that hash uniquely describes the contents, IPFS doesn’t need to load it from a particular server, it can request that package from anyone else who is sharing it on IPFS without worrying if it has been tampered with. + +## Indexing packages + +To work out the address of a package that you’d like to download, first you need to find out a few details about it, this is done using a package index, which is one of the primary roles of a registry. + +Let’s say you already know the name of the package that you’d like to grab, “libfoobar”, to construct a http url your now only missing one part, the version number. Registries usually provide a way of finding out what the available version numbers for a package are. One way is with a JSON API over http: + + http://package-manager-registry.com/libfoobar/versions + +That returns a list of available version numbers: + +``` +[ + { + "number": "0.0.1", + ... + }, + { + "number": "0.2.0", + ... + }, + { + "number": "1.0.0", + ... + }, + { + "number": "1.0.1", + ... + } +] +``` + + +To enable clients to download packages over IPFS you can provide the cryptographic hash (CID) of the contents of each version of the package along with it’s number: + +``` +[ + { + "number": "0.0.1", + "cid": "/ipfs/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B" + ... + }, + { + "number": "0.2.0", + "cid": "/ipfs/QmTeHfj83a09e6b0da3a6e1163ce53bd03eebfc1c507ds" + ... + }, + { + "number": "1.0.0", + "cid": "/ipfs/QmTeHfjf778979bb559fbd3c384d9692d9260d5123a7b3" + ... + }, + { + "number": "1.0.1", + "cid": "/ipfs/QmTeHfjrEfVDUDRootgUFc9cbd8e968edfa8f22a33cff7" + ... + } +] +``` + +IPFS doesn’t just have to store the package contents, it can also store the JSON list of available versions. + +``` +$ ipfs add versions.json +=> QmctG9GhPmwyjazcpseajzvSMsj7hh2RTJAviQpsdDBaxz +``` + +Doing this introduces some challenges, for one thing the CID of a list of versions for a package isn’t human readable, so needs a second way of finding that CID. + +One way to solve that would be to create another index of all package names with the CID of the json file of list of versions for that package: + +``` +[ + { + "name": "libfoo", + "versions": "/ipfs/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B" + ... + }, + { + "name": "libbar", + "versions": "/ipfs/QmTeHfj83a09e6b0da3a6e1163ce53bd03eebfc1c507ds" + ... + }, + { + "name": "libbaz", + "versions": "/ipfs/QmTeHfjf778979bb559fbd3c384d9692d9260d5123a7b3" + ... + } +] +``` + +You could create these linked json files manually but IPFS already has a technology called IPLD. + +Another challenge is that every time there’s a new version of “libfoobar” released, the contents of versions.json changes, which produces a different hash when added to IPFS, the index of all the package names can be updated after each release as well, producing a merkle tree of package data + +But of course then the indexes of all package names has the same problem, it gets updated after any package has a new version released. + +There are a couple of IPFS technologies to help with this: IPNS and DNSLink. + +IPNS allows you to create a mutable link to content on IPFS, you can think of a IPNS name as a pointer to an IPFS hash, which may change in the future to point to a different IPFS hash. + +For example we could use IPNS to point to the hash of a json file of released versions of libfoo, taking the CID of versions.json and publishing it to IPNS: + +``` + $ ipfs name publish QmctG9GhPmwyjazcpseajzvSMsj7hh2RTJAviQpsdDBaxz + => Published to QmSRzfkzkgofxg2cWKiqhTQRjscS4DC2c8bAD2TbECJCk6: /ipfs/QmctG9GhPmwyjazcpseajzvSMsj7hh2RTJAviQpsdDBaxz +``` + +Now we can use the IPNS address to load versions.json: + +``` +$ ipfs cat /ipns/QmSRzfkzkgofxg2cWKiqhTQRjscS4DC2c8bAD2TbECJCk6 +=> [ + { + "number": "0.0.1", + "cid": "/ipfs/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B" + ... + }, + ... + ] +``` + +After a new version of libfoo is published, we can add the tarball to IPFS, edit versions.json to include it: + +``` +[ + { + "number": "0.0.1", + "cid": "/ipfs/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B" + ... + }, + { + "number": "0.2.0", + "cid": "/ipfs/QmTeHfj83a09e6b0da3a6e1163ce53bd03eebfc1c507ds" + ... + }, + { + "number": "1.0.0", + "cid": "/ipfs/QmTeHfjf778979bb559fbd3c384d9692d9260d5123a7b3" + ... + }, + { + "number": "1.0.1", + "cid": "/ipfs/QmTeHfjrEfVDUDRootgUFc9cbd8e968edfa8f22a33cff7" + ... + }, + { + "number": "2.0.0", + "cid": "/ipfs/Qm384d9692d9260d5123a7b3UFc9cbd8e968edfa8f2233" + ... + } +] +``` + +and then add the updated versions.json to IPFS: + +``` +$ ipfs add versions.json +=> added QmcCmvc9K7fVeY9xQEj3xoa9HWnxs2M5zX97wTAvTQ61a9 versions.json +``` + +Then we can update that same IPNS address to point to the new copy of versions.json: + +``` +$ ipfs name publish QmcCmvc9K7fVeY9xQEj3xoa9HWnxs2M5zX97wTAvTQ61a9 +=> Published to QmSRzfkzkgofxg2cWKiqhTQRjscS4DC2c8bAD2TbECJCk6: /ipfs/QmcCmvc9K7fVeY9xQEj3xoa9HWnxs2M5zX97wTAvTQ61a9 +``` + +The same IPNS address now points to the new versions.json file: + +``` +$ ipfs cat /ipns/QmSRzfkzkgofxg2cWKiqhTQRjscS4DC2c8bAD2TbECJCk6 +=> [ + { + "number": "0.0.1", + "cid": "/ipfs/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B" + ... + }, + ... + { + "number": "2.0.0", + "cid": "/ipfs/Qm384d9692d9260d5123a7b3UFc9cbd8e968edfa8f2233" + ... + } + ] +``` + +DNSLink gives you similar functionality to IPFS but adds a dependency to DNS. It works by using a domain name instead of a hash, for example: + + /ipns/package-manager-registry.com + +To configure that domain name to point to a specific IPFS hash, you add a TXT dns record with content in the following form: + + dnslink=/ipfs/ + +When requesting content from a DNSLink, IPFS will look up the TXT record on that domain name and then fetch the content for the CID stored within it. + +DNSLink can be useful for adding human readable names as well as adding a layer of social trust with users already familiar with your domain name, and at the time of writing DNSLink is quite a bit faster that using IPNS. + +One downside of DNSLink is that updating a DNS record every time you wish to make a change can be fiddly, not every DNS provider has an API that can be used to automate the action and DNS propagation can take hours for changes to be updated world wide in some cases. + +IPNS names on the other hand have the added benefit that they work purely using IPFS technology and can be resolved without needing any traditional infrastructure, as well as working “offline”. +Checking for new versions of packages + +## Publishing packages + +TODO + +## Verifying package contents + +TODO diff --git a/docs/decentralization.md b/docs/decentralization.md new file mode 100644 index 0000000..b604b2a --- /dev/null +++ b/docs/decentralization.md @@ -0,0 +1,202 @@ +# Decentralized Publishing + +I started thinking about what the parts of a fully decentralized package manager would look like, ignoring to a certain extent what the actual technology doing the work would be, as a follow up from https://github.com/ipfs/package-managers/issues/50. + +In the majority of issues we've discussed mirroring and supporting existing package manager registries that generally take the form of a centralized authority managing publishing to a particular index. + +_The exception being [Registry-less](https://github.com/ipfs/package-managers/blob/master/categories.md#registry-less) package managers like Go, Swift, Carthage, Deno etc that use DNS as the centralized authority and package/release identifiers are URLs, sometimes with shorthands for GitHub urls._ + +Some terminology: + +**Release:** The actual artifact of software being published as source code and/or binary, usually delivered as a compressed archive (tar, zip etc), they may contain some metadata about the contents but do not know their own CID within a content addressable system like IPFS. + +**Version index:**: Data for a collection of one or more of releases of a particular software project, including the human readable name/number of each release, the content address of the release and usually some integrity data to confirm the content at that address matches what was expected, with IPFS that comes for free. It may also contain some information about the state of each release (deprecated, insecure etc). + +**Version index owner:** A record of one or more identities that can do one or more of the following to a version index: +- add a new release +- update an existing release +- remove an existing release + +In a content addressable world, they are actually publishing a new version of a version index. + +**Package index:** Data for a collection of one or more Version Indexes, including the human readable name for each version index and it's content address. + +**Package index owner:** A record of one or more identities that can do one or more of the following to a package index: +- add a new version index to package index +- update an existing version index +- remove an existing version index + +In a content addressable world, they are actually publishing a new version of a Package index when making one of those changes. + +![Screenshot 2019-05-01 at 14 35 13](https://user-images.githubusercontent.com/1060/57025551-53668080-6c2f-11e9-880f-1d095bc93242.png) + +## Identity + +Identity and permissions around changing indexes depending on the [category](https://github.com/ipfs/package-managers/blob/master/categories.md#implementation-categories) of package manager. + +For example: + +**File system based:** These usually store releases, version indexes and package indexes all within the same root filesystem, meaning that the administrators of that system usually have permission to update all indexes, those administrators also the ability to make more administrators but must trust them with all indexes. + +**Git based:** These usually store releases, version indexes and package indexes all within the same git repository, meaning that the administrators of that repository usually have permission to update all indexes, those administrators also the ability to make more administrators but must trust them with all indexes. + +**Database based:** These tend to have more fine grained permissions, identities can be given limited access to only certain version indexes, and only the ability to update information about those version indexes in the package indexes. Those identities often have the ability to manage access to those version indexes by other identities as well. Access can also be limited to certain actions like disabling deleting data from indexes. Administrators of the database have access to all indexes. + +**Registry-less:** The permission setup can vary with url based setups, administrators who control the DNS for a domain will have the ultimate ability to change the data available from urls on that domain, although usually the web server that the DNS record resolves to takes care of permission setups, any administrator of that web server usually also has access to change the data for that domain. + +One difference with registry-less systems is that there is usually no single administrator that has control DNS of all domains involved (ignoring top level DNS authorities like ICANN), although within a private network, a network administrator can override certain levels of DNS configuration but other security features like TLS may limit functionality in that case. + +In a decentralized package manager, identity and access controls require a different approach due to the lack of an administrator that has access over the whole system. + +Rather than there being just one version index for a project with a unique, human readable name, there can be instead any number of instances of a version index for a project, each one owned by an identity which factors into the globally unique name for that version index. + +Because there is no central naming authority, any identity has the ability to publish a new release of a project and update their own copy of a version index to include that release. + +For example, below are two different version indexes created by two different identities, which share the first two releases: + +```json +{ + "owner_identity": "dave:4ba0d198425a112c335a112c334e92848d76905e69a60de3f57e", + "version_index_name": "libfoo", + "releases": [ + { + "number": "1.0.1", + "cid": "Qm2453256c...510b2" + }, + { + "number": "1.0.2", + "cid": "Qm3f53392...4f877" + } + ] +} +``` + +```json +{ + "owner_identity": "lucy:dc99e9aa86fab83a062cff5e0808391757071a3d5dbb942802d5", + "version_index_name": "libfoo", + "releases": [ + { + "number": "1.0.1", + "cid": "Qm2453256c...510b2" + }, + { + "number": "1.0.2", + "cid": "Qm3f53392...4f877" + }, + { + "number": "1.0.3", + "cid": "Qm20d2ca...b41a4" + } + ] +} +``` + +Those indexes are effectively named `%{owner_identity}:%{version_index_name}`, for example `lucy:dc99e9aa86fab83a062cff5e0808391757071a3d5dbb942802d5:libfoo`, but these names are not very meaningful or memorable to humans. + +_Note: I'm skimming over content addressability and versioning of version indexes for simplicity here_ + +Package indexes also have a similar naming approach, where the owner has curated a number of version indexes together. Again any identity has the ability to publish a new package index, for example, two similar package indexes with libfoo from different version indexes: + +```json +{ + "owner_identity": "molly:f6f7e983afc59354c91673d637c22072ec68f710794f899ae747", + "package_index_name": "lib_node_modules", + "version_indexes": { + "libfoo": "lucy:dc99e9aa86fab83a062cff5e0808391757071a3d5dbb942802d5:libfoo", + "libbar": "fred:d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3:libbar", + "libbaz": "anna:55579b557896d0ce1764c47fed644f9b35f58bad620674af23f3:libbaz" + } +} +``` + +```json +{ + "owner_identity": "andrew:d979885447a413abb6d606a5d0f45c3b7809e6fde2c83f0df", + "package_index_name": "lib_node_modules", + "version_indexes": { + "libfoo": "dave:4ba0d198425a112c335a112c334e92848d76905e69a60de3f57e:libfoo", + "libbar": "fred:d0cfc2e5319b82cdc71a33873e826c93d7ee11363f8ac91c4fa3:libbar", + "libbaz": "anna:55579b557896d0ce1764c47fed644f9b35f58bad620674af23f3:libbaz" + } +} +``` + +Note here how a package index can provide human-meaningful names to version indexes that it includes. The standalone version index `"dave:4ba0d1...3f57e:libfoo"` is mapped to a short readable name `libfoo`, much in the same way that a version index provides a mapping between the content address of a release and it's human-meaningful number, `"dave:4ba0d1...3f57e:libfoo"` maps the content address `Qm3f53392...4f877` to the number `1.0.2`. + +This is same mapping happens in centralized registries but it is usually implicit rather than explicit, the npm module `react` by default a shorthand for https://registry.npmjs.org/react, but an end user may choose to override that default and point to a private registry instead for their project: `https://internal-npm-registry.enterprise.com/react`. + +## Dependencies + +When a software project declares a dependency on another software project via it's package manager configuration, the usual practice to specify the short human-meaningful name and an acceptable range of releases that it should work with. + +Because a release can be included in many version indexes and package indexes, the release itself usually declares it's dependency requirements in a index agnostic approach. For example, rather than specifying the full URL (`https://registry.npmjs.org/react`) of a dependency, a short name or alias is used `react`, this allows different registries to host the same packages without needing to rewrite the metadata for every package with their specific registry details. + +This is key for decentralized package managers as the package index name (`andrew:d979885447a413abb6d606a5d0f45c3b7809e6fde2c83f0df:lib_node_modules`) is much less human-meaningful than a centralized registry url. + +Acceptable version numbers in dependency requires have similar properties, when declaring a range of acceptable versions, the author is encouraged to be as broad as possible in acceptable version ranges so as to reduce the likelihood of conflicts with other packages requirements of that same dependency during resolution. + +Unlike package names, version number requirements for dependencies may also attempt to take into account future releases that are likely to work with that release, for example using [semantic versioning](https://semver.org/) to allow for minor patch releases whilst avoiding major breaking change releases. + +From a decentralized package manager's point of view the key attribute of dependency requirements are that they are agnostic to the exact name of the version and package indexes used, instead preferring the local names (`react`) and number ranges (`>= 1.0.0`) which are defined within the indexes that the end user chooses to use. + +## Discovery + +One of the more challenging problems with decentralized package management is that of discovery, finding information about packages in a network. There are three main categories of discovery: + +### Search + +Search is the initial starting point of finding an open source package that solves a problem for you, for example "a library to parse xml documents in javascript". This usually involves searching for keyword matches across names, descriptions, tags and other textual metadata, with the aim of finding and comparing details of various matches, resulting in the address of one or more version indexes. + +In centralized registries, relevant data from the package index is often replicated to a separate, specialized full-text-search index. More general web search engines like Google also index the contents of individual html package pages. + +In a decentralized environment there won't be an automatic index created of every package published, services may need to be built to help index the variety of new indexes and packages being published, things like: + +- opt-in announcement of releases and index updates to indexing services +- trawling existing known indexes and source code to discover package/version indexes and releases +- peer-to-peer sharing of package/version indexes and releases where connected users also keep track of their peers published package data + +### Resolution + +Once you have a package that you wish to install, say [`react`](https://www.npmjs.com/package/react), there is are two further stages of discovery that happen before the installation is complete: + +1. A version index needs to be discovered for that package which provides list of available releases, of which one will be chosen to be installed that doesn't conflict with other existing package requirements usually the newest or highest number. + +2. A specific release many declare a number of dependencies by name, [`react@16.8.6`](https://unpkg.com/react@16.8.6/package.json) has four dependencies with version range requirements: + +```json + "dependencies": { + "loose-envify": "^1.1.0", + "object-assign": "^4.1.1", + "prop-types": "^15.6.2", + "scheduler": "^0.13.6" + }, +``` + +For each one of those dependencies the same two discovery steps need to be performed until the full dependency tree has been resolved. + +In a centralized registry this stage of discovery usually involves looking up existing version indexes within the same database, some even have specialized APIs to query the indexes in bulk because the data is all stored in the same place. + +This process can also fail in two ways: + +1. If you cannot find a version index to match the package name you are searching for, the classic example is [`left-pad`](https://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/) where a popular package's version index was removed from the central registry completely and many users couldn't successfully install that package. + +2. If you cannot find a release number within a version index that satisfies all of the dependency requirements for that package that other packages with depend on it have specified. This may be because an existing release has been removed, or because one of the dependency requirements specifies a release number that conflicts with other dependency requirements within the same tree. + +One way that decentralized publishing can help minimize these kinds of discovery problems is via validation on or before the publishing of indexes, ensuring that there are no unreachable parts of a dependency tree within an index, and then automatically advising on how to find and correct those errors if they are caught. + +### Updates + +Existing software packages often have multiple releases published as bugs are found and fixed or functionality improvements added. When these new releases are published, end users of the package need a way of finding out that those new versions are available. + +This usually is done by adding the new release to existing version indexes, there are also services that can announce new updates via email, rss, push notifications or even as a direct request to update that dependency requirement within a codebase such as [Dependabot](https://dependabot.com/). + +There may also be updates to be discovered about existing releases, such as a state change ("This release has been marked as deprecated") or a security vulnerability or legal notice has been published about it, although the contents of the release is not expected to change, the metadata about it within a version index may be updated to reflect the change. + +Whilst centralized registries make it very easy to have update information available to all users, it also makes it harder for end users to opt-out of getting that update information which can have a negative impact on the reproducibility of a projects dependency tree, either due to data being removed or new data being added that changes the results of dependency resolution in an undesirable manner. + +## Tooling + +When it comes to decentralized publishing, and consuming those decentralized indexes, new tooling will need to be built to help encourage the use of standards and compatibility between indexes as well as extra services for aiding in some of the discovery problems that arise. Experimenting and finding out what those tools might look like would be a useful task to think about before starting to implement decentralized publishing. + +One other aspect that I've yet to explore is the idea that version and package index "owners" don't need to be humans, but could instead be software that does the heavy lifting of publishing and curating indexes together. This could open the door to multiple levels of identities of ownership and connection between different indexes, perhaps even federations of indexes that get combined together by groups or communities of end users. diff --git a/glossary.md b/docs/glossary.md similarity index 100% rename from glossary.md rename to docs/glossary.md diff --git a/docs/linking.md b/docs/linking.md new file mode 100644 index 0000000..f358815 --- /dev/null +++ b/docs/linking.md @@ -0,0 +1,61 @@ +# Package indexing and linking + +The crux of package management is really the act of maintaining one or more **indexes** of package releases, with some rules around how to group, connect and order releases. + +In an index releases usually have a version identifier (e.g. `1.0.0`) and are grouped together by a project identifier (e.g. `react`), but those identifiers don't have to match up with the what the source code contents (often a tarball) of a release believes it is (although it usually does). + +Indexes tend to store project/release identifiers, which usually translate into an address that points to where the actual contents of a release is. + +This has some benefits, for one, it decouples the index from the storage and transportation of the release content. It also allows for the index to be updated, mutated and mirrored without changing any data inside the packages. + +Many package managers take further advantage of these indexes when declaring dependencies within an package, using only the bare minimum project identifier with an optional version range selector, relying on the index to provide the available version identifiers within that project identifier group and then the actual address of the desired release of that dependency once selected. + +That separation also requires a lot of trust in both the index and the translation of the addresses the index provides into actual package contents. There are a lot of assumptions that the relevant data that was in the index when the release was initially published will continue to be there. + +*Side note: The community solution to this trust problem is solved by many package managers by defaulting everyone to a single, large centralized index that only allows certain users to add new data to it. These large indexes often become central points of failure, becoming too large for regular users to mirror and operationally taxing for the maintainers of the index.* + +When the contents of a release has so little knowledge of itself and it's context, it always needs to be paired with a separate index. + +So what if... **a package could also be it's own index?** + +A release can't contain a hash of itself or be updated with information about later releases, but it can contain an IPFS CID of the release that came before it (except for the initial release), creating a linked list of previous releases: + +``` + 0.1.1 <----0.1.2 + / +0.0.1 <--- 0.1.0 <--- 1.0.0 <--- 1.1.0 <--- 1.1.1 + \ + 1.0.1 <--- 1.0.2 +``` + +#### Why is this useful? + +One of the most frequent lookups a package manager will do in an index is to get a list of available releases for a given project identifier, the resulting list of releases is then passed into dependency resolution. + +- Reproducibility: If you already have a particular package, the list of available versions that can be pulled from this linked list of previous packages will always be the same. + +- Trust: once published, the list of previous versions for that package is immutable, this reduces the reliance on external indexes which have the possibility to change the items in a list of available versions. + +- Size: Rather than having one large index that contains all available releases across all packages, this approach breaks it up into many small indexes, one per package and community indexes then only need index the leaf nodes of each package tree. + +#### What's missing? + +- Discovery: Users expect to be able to run `xpm update` and automatically discover newer releases than the ones they already have, which still requires a separate index, although that index only needs to point at the latest version rather than keep track of all possible versions + +#### A note about dependencies: + +You could use this same method for referencing other package linked lists (via IPFS CID) for dependencies, pointing to the tip of a branch from your release, which then points to an immutable, reproducible transitive dependency tree, again without an index: + +``` +0.0.1 <--- 0.0.2 <--- 0.1.0 <--- 1.0.0 <--- 1.1.0 <--- 1.1.1 + / / + dep@1.0.1 <--- dep@2.0.2 <--- dep@2.1.0 <--- dep@3.0.0 + / / + trans-dep@0.0.1 <--- trans-dep@0.0.1 +``` + +Whilst being very reliable for reproducibility, fast moving communities will often want to pick up the latest versions of transitive dependencies without requiring changes to intermediate dependencies. + +This is one of the pain points for [gx](https://github.com/whyrusleeping/gx), as often the developer does not have the ability to publish new releases of an intermediate dependency to the central index. + +One solution to this may be to enable users to easily create and manage their own small indexes, possibly even at a per-application level which would allow easy forking and patching of dependencies without needing to change project identifiers. I'll be writing more on this idea soon. diff --git a/docs/papers.md b/docs/papers.md new file mode 100644 index 0000000..da93ae0 --- /dev/null +++ b/docs/papers.md @@ -0,0 +1,34 @@ +## Academic papers related to package management + +I started a making a list of published papers that appeared relevant to package manager research and software dependencies, figured it'd be useful to share it. Feel free to add more in the comments. + +- [A Look In the Mirror: Attacks on Package Managers](https://isis.poly.edu/~jcappos/papers/cappos_mirror_ccs_08.pdf) +- [Automatic Software Dependency Management using Blockchain](http://trap.ncirl.ie/3300/1/gavindmello.pdf) +- [Reflections on Trusting Trust](https://www.archive.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf) +- [SPAM: a Secure Package Manager](https://cseweb.ucsd.edu/~dstefan/pubs/brown:2017:spam.pdf) +- [Toward Decentralized Package Management](https://portail.telecom-bretagne.eu/publi/public/fic_download.jsp?id=5756) +- [Why do software packages conflict?](https://dl.acm.org/citation.cfm?id=2664470 ) +- [Dependency solving](https://dl.acm.org/citation.cfm?id=2330431 ) +- [A modular package manager architecture](https://dl.acm.org/citation.cfm?id=2401012 ) +- [A method to generate traverse paths for eliciting missing requirements](https://dl.acm.org/citation.cfm?id=3290697) +- [MPM: a modular package manager](https://dl.acm.org/citation.cfm?id=2000255) +- [A look in the mirror: attacks on package managers](https://dl.acm.org/citation.cfm?id=1455841) +- [Towards efficient optimization in package management systems](https://dl.acm.org/citation.cfm?id=2568306 ) +- [On software component co-installability](https://dl.acm.org/citation.cfm?id=2025149 ) +- [Package upgrades in FOSS distributions: details and challenges](https://dl.acm.org/citation.cfm?id=1490292) +- [On the use of package managers by the C++ open-source community](https://dl.acm.org/citation.cfm?id=3167290) +- [An adaptive package management system for scheme](https://dl.acm.org/citation.cfm?id=1297093) +- [NixOS: a purely functional Linux distribution](https://dl.acm.org/citation.cfm?id=1411255 ) +- [On the topology of package dependency networks: a comparison of three programming language ecosystems](https://dl.acm.org/citation.cfm?id=3003382) +- [When It Breaks, It Breaks: How Ecosystem Developers Reason about the Stability of Dependencies](https://dl.acm.org/citation.cfm?id=2916370) +- [On the Development and Distribution of R Packages: An Empirical Analysis of the R Ecosystem](https://dl.acm.org/citation.cfm?id=2797476) +- [Software engineering with reusable components](https://dl.acm.org/citation.cfm?id=260943) +- [A look at the dynamics of the JavaScript package ecosystem](https://dl.acm.org/citation.cfm?id=2901743) +- [A historical analysis of Debian package incompatibilities](https://dl.acm.org/citation.cfm?id=2820545) +- [Mining component repositories for installability issues](https://dl.acm.org/citation.cfm?id=2820524) +- [The Evolution of Project Inter-dependencies in a Software Ecosystem: The Case of Apache](https://dl.acm.org/citation.cfm?id=2550583 ) +- [OPIUM: Optimal Package Install/Uninstall Manager](https://dl.acm.org/citation.cfm?id=1248851) +- [How the Apache community upgrades dependencies: an evolutionary study](https://dl.acm.org/citation.cfm?id=2821962) +- [The Spack Package Manager: Bringing Order to HPC Software Chaos](https://tgamblin.github.io/pubs/spack-sc15.pdf) +- [PubGrub: Next-Generation Version Solving](https://medium.com/@nex3/pubgrub-2fb6470504f) +- [EasyBuild: Building Software With Ease](https://easybuilders.github.io/easybuild/files/easybuild-PyHPC-SC12_paper.pdf) diff --git a/docs/problems.md b/docs/problems.md new file mode 100644 index 0000000..2c0f963 --- /dev/null +++ b/docs/problems.md @@ -0,0 +1,57 @@ +# Problems with Package Managers + +I thought it would be an interesting thought experiment to outline problems that package publishers, package consumers and package manager maintainers currently experience today, rather than outlining the benefits of IPFS and looking for places where those benefits can be applied to package management. + +Doing this may highlight areas where introducing IPFS can provide significant improvements that wouldn't be a clear headline feature of IPFS. + +For example, IPFS can offer large savings in bandwidth costs, but almost all the large community package managers are offered free CDN and hosting services by Fastly, Cloudflare, Bintray etc, so that package managers don't see that as being a problem right now. + +Not all problems will be present in all package managers and IPFS definitely won't be able to fix all problems either but this list may spark some interesting ideas that would otherwise have been missed, as well as helping us to focus our efforts on the key problems that IPFS can help with. + +Feel free to add in other problems or group similar problems together, this list is off the top of my head and will likely be missing loads things. + +## Package consumers + +- Package releases being removed from registries +- Package maintainers transferring ownership to unknown entities +- Registry downtime stops building/developing of software that is dependent on it +- Not being able to reproduce a known working set of dependencies at a later date +- Not being able to opt-out of using new releases which have breaking changes +- Not being able to update/fix/swap old packages deep within a dependency tree when they cause problems +- Not being able to resolve conflicting dependency requirements when building a dependency tree +- Not knowing the status of a package (deprecated, unmaintained, broken etc) +- Not being able to confirm that the contents of a download package is the same as was originally published by the author +- Not being able to install more than one version of a dependency at once +- Not being able to install or build packages whilst offline +- Not being able to efficiently review the downloaded code within each dependency of an application +- Not being able to filter packages by compatible licenses +- Not being able to validate the license/copyright/patent details of a package +- Hosting internal or private mirrors of registries is time consuming and requires ongoing maintenance and infrastructure costs +- Language level packages that depend system level packages do not communicate or automate those dependency links effectively +- Not being able to find new packages that are relevant to consumers interests/needs +- Difficult to know when new releases of packages are published +- Difficult to known if a new release of a package is stable + +## Package Producers + +- Communicating with consumers of packages +- Maintaining compatibility with multiple platforms, architectures and run times +- Vetting contributions for security concerns +- Coordinating key signing infrastructure between maintainers is time consuming and error prone +- Difficult to test package against a range of different versions of upstream and downstream dependencies +- Difficult to know what percentage of consumers are using newer releases vs stuck on old releases due to incompatibilities +- Difficult to discourage users from using broken/deprecated releases + +## Package Manager Maintainers + +- Difficulties funding maintenance, development and infrastructure costs +- Large amount of trust placed on very small amount of maintainers +- Heavy support burden from both consumers and producers +- Having to police illegal/pirate/malicious packages from the registry +- Recovery of data after loss due to server failure or security breach +- Deploying significant changes can result in community backlash +- Difficulties communicating with package consumers and producers +- Difficult to hand over control/trust when maintainers step down +- Distributing infrastructure globally can be costly/complex +- Greatly increased infrastructure costs when storing binaries as well as source code +- Package managers often can't use themselves for managing dependencies, due to bootstrapping issues, resulting in duplicate efforts and reduced productivity diff --git a/docs/tree.md b/docs/tree.md new file mode 100644 index 0000000..b32e519 --- /dev/null +++ b/docs/tree.md @@ -0,0 +1,47 @@ +# Cladistic tree of depths of integration + +Preface: Understanding package managers -- let alone comprehensively enough to begin to describe the design spaces for growing them towards decentralizating and imagining the design space future ones -- is eternally tricky. I've been hoping to get some more stuff rolling in terms of either short checklists, or terminology that we might define to help shortcut design discussions to clarity faster, and so on. This is another attempt, and it certainly won't be an end-all, all-inclusive, etc; but it's a shot. + +This is a quick, fairly high-level outline of tactical implementation choices that one could imagine making when designing a package manager with _some_ level of IPFS integration. + +--- + +We could present this as a cladistic tree of tactical choices, one bool choice at a time: + +- using IPFS: NO + - (effect: womp womp) +- using IPFS: YES + - aware of IPFS for content: NO + - (effect: this is implemented by doing crass bulk snapshotting of some static http filesystem. works, but bulky.) + - aware of IPFS for content: YES + - (effect: means we've got CIDs in the index metadata.) + - uses IPFS for index: NO + - (effect: well, okay, at least we provided a good content bucket!) + - (effect: presumably this means some centralized index service is in the mix. we don't have snapshotting over it. we're no closer to scoring nice properties like 'reproducible resolve'.) + - uses IPFS for index: YES + - (effect: pkgman clients can do full offline operation!) + - (effect: n.b., it's not clear at this stage whether we're fully decentralized: keep reading.) + - index is just bulk files: YES + - (effect: easiest to build this way. leaves a lot of work on the client. not necessarily very optimal -- need to get the full index as files even if you're only interested in a subset of it.) + - (effect: this does not get us any closer to 'pollination'-readiness or subset-of-index features -- effectively forces centralization for updating the index file...!!!) + - index is just bulk files: NO + - (effect: means we've got them to implement some index features in IPLD.) + - (effect: dedup for storage should be skyrocketing, since IPLD-native implementations naturally get granular copy-on-write goodness.) + - (effect: depends on how well it's done...! but this might get us towards subsets, pollination, and real decentralization.) + - TODO: EXPAND THIS. What choices are important to get us towards subsets, pollination, and real decentralization? + +The "TODO" on the end is intentional :) and should be read as an open invitation for future thoughts. I suspect there's a lot of diverse choices possible here. + +--- + +Another angle for looking at this is which broad categories of operation are part of the duties typically expected of the hosting side of a package manager: + +- distributing (read-only) content +- distributing (read-only) "metadata"/"index" data -- mapping package names to content IDs, etc +- updating the "metadata"/"index" -- accepting new content, reconciling metadata changes, etc + +This is a much more compact view of things, but interestingly, the cladistic tree above maps surprising closely to this: as the tree above gets deeper, we're basically moving from "content" to "distributing the index" to (at the end, in TODOspace of the cladistic tree) "distributed publishing". + +--- + +These are just a couple of angles to look at things from. Are there better ways to lay this out? Can we resolve deeper into that tree to see what yet-more decentralized operations would look like?