-
Notifications
You must be signed in to change notification settings - Fork 9
/
intro.tex
334 lines (195 loc) · 60.7 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
\chapter{The evolution \statusgreen}\label{chap:vision}
This chapter provides background information about the motivation for and evolution of Swarm and its vision today. Section \ref{sec:historical_context} lays out a historical analysis of the World Wide Web, focusing on how it became the place it is today.
Section \ref{sec:fair-data} introduces the concept and emphasises the importance of data sovereignty, collective information, and a \gloss{fair data economy}. It discusses the infrastructure a self-sovereign society will need to collectively host, move, and process data.
Finally, Section \ref{sec:vision} recaps the core values underlying the vision, spells out the requirements of the technology, and establishes the design principles that guide us in manifesting Swarm.
\section{Historical context \statusgreen}\label{sec:historical_context}
\green{}
While the Internet in general---and the \gloss{World Wide Web} (\gloss{WWW}) in particular---has dramatically reduced the costs of disseminating information, these costs are still not zero, and their allocation heavily influences who gets to publish what content and who will consume it.
In order to appreciate the problems we are trying to solve, a little journey into the historical evolution of the \gloss{World Wide Web} proves to be helpful.
\subsection{Web 1.0 \statusgreen}\label{sec:web_1}
In the era of \gloss{Web 1.0}, in order to have your content accessible to the whole world, you would typically fire up a web server or use some free or cheap web hosting space to upload your content, which could then be navigated through a series of HTML pages. If your content was unpopular, you still had to bear the cost of maintaining the server or paying the hosting to keep it accessible. However, true disaster struck when, for one reason or another, it became popular (e.g. you got "slashdotted"). At this point, your traffic bill skyrocketed just before either your server crashed under the load or your hosting provider throttled your bandwidth to the point of making your content essentially unavailable for the majority of your audience. If you wanted to stay popular, you had to invest in high-availability clusters connected with fat pipes; your costs grew together with your popularity, without any obvious way to cover them. There were very few practical ways to allow (let alone require) your audience to directly share the ensuing financial burden.
The prevailing belief at the time was that the \gloss{ISP} would come to the rescue and resolve these challenges. Since in the early days of the Web revolution, bargaining about peering arrangements between the ISPs often centred around the location of providers and consumers and which ISP was making money from which other's network. Indeed, when there was a sharp imbalance between originators of TCP connection requests (aka SYN packets), it was customary for the ISP originating the request to compensate the recipient ISP. This setup somewhat incentivised the recipient ISP to help support those hosting popular content. In practice, however, this incentive structure usually led to some ISPs putting a free \emph{pr0n} or \emph{warez} server in the server room to tilt back the scales of SYN packet counters. This meant that blogs catering to niche audiences had no way of competing and were generally left out in the cold. Note, however, that back then, creator-publishers still typically owned their content.
\subsection{Web 2.0 \statusgreen}\label{sec:web_2}
The transition to \gloss{Web 2.0} changed much of that. The migration from personal home pages running on one's own server, using Tim Berners-Lee's elegantly simplistic and accessible hypertext markup language, to server-side scripting using CGI-gateways, Perl, and the inexorable PHP, led to a departure from the beautiful idea that anyone could write and run their own website using simple tools. This divergence set the web on a path towards a prohibitively difficult and increasingly complex stack of scripting languages and databases. Suddenly, the world wide web wasn't a beginner-friendly place any more. At the same time, new technologies emerged, allowing the creation of web applications with simple user interfaces that enabled unskilled publishers to simply POST their data to the server and divorce themselves of the responsibilities of actually delivering the bits to their end users. This marked the birth of Web 2.0.
Capturing the initial maker spirit of the web, sites like MySpace and Geocities now ruled the waves. These platforms offered users a personalised corner of the internet to call their own, complete with as many scrolling text marquees, flashing pink glitter Comic Sans banners, and all the ingenious XSS attacks a script kiddie could dream of. It was like a web within the web---a welcoming and open environment for users to start publishing their own content, increasingly without the need to learn HTML, and without rules. Platforms abounded, and suddenly there was a place for everyone, a phpBB forum for every niche interest imaginable. The web became full of life and the dotcom boom showered Silicon Valley in riches.
Of course, this youthful naivete, the fabulous rainbow-coloured playground wouldn't last. The notoriously unreliable MySpace fell victim to its open policy of allowing scripting, leading users' pages to become unreliable and rendering the platform virtually unusable. When Facebook arrived with a clean-looking interface that simply worked, it became clear that MySpace's time was up, and people migrated in droves. The popular internet acquired a more self-important undertone, and we filed into the stark white corporate office of Facebook. But there was trouble brewing. While offering this service for "free," Mr. Zuckerberg and others had an agenda. In exchange for hosting our data, we (the dumb f*cks; \citealp{carlson2010ims}) would have to trust him with it. Obviously, we did. For the time being, there was ostensibly no business model beyond luring in more venture finance, amassing huge user-bases, and a "we'll deal with the problem later" attitude. But from the start, extensive and unreadable T\&Cs granted all the content rights to the platforms. While in the Web 1.0 era, it was easy to keep a backup of your website and migrate to a new host, or simply host it from home yourself, now those with controversial views had a new problem to deal with: "deplatforming".
At the infrastructure level, this centralisation became evident through the emergence of unthinkably huge data-centres. Jeff Bezos evolved his book-selling business to become the richest man on Earth by providing solutions for those who struggled with the technical and financial hurdles of implementing increasingly complex and expensive infrastructure. This new constellation of web services was capable of dealing with those irregular traffic spikes which would have crippled widely successful content in the past. Soon enough, a significant portion of the web came to be hosted by a handful of large companies. Corporate acquisitions and an endless stream of VC money led to a greater and greater concentration of power. A forgotten alliance of open-source programmers, creators of the royalty-free Apache web server, and Google, which introduced paradigm-shifting methods for organising and accessing vast amounts of data, dealt a crippling blow to Microsoft's attempt to force the web into a hellish, proprietary existence forever imprisoned in Internet Explorer 6. Of course, Google eventually accepted "parental oversight," shelved its promise to "do no evil," succumbed to its very own form of megalomania, and began to eat the competition. Steadily, email became Gmail, online ads became AdSense, and Google crept into every facet of daily life on the web.
On the surface, everything was rosy. A technological utopia hyper-connected the world in a way no-one could have imagined. No longer was the web just for academics and the super 1337—it made the sum of human knowledge available to anyone, and now that smartphones became ubiquitous, it could be accessed anywhere. Wikipedia gave everyone superhuman knowledge, while Google allowed us to find and access it in a moment. Facebook gave us the ability to communicate with everyone we had ever known, for free. However, underneath all this, there was one problem buried just below the glittering facade. Google knew what they were doing. So were Amazon, Facebook and Microsoft. And so did some punks, since 1984.
After acquiring a massive number of users, the time had finally come for these behemoth platforms to cut a check to investors. They could no longer continue delaying to figure out a business model. To provide value back to shareholders, the platforms turned to advertising revenue as panacea. Google and the other platforms may have investigated other potential sources of income, however, no significant alternatives were adopted. Now the web started to get complicated, and distracting. Advertisements appeared everywhere, and the pink flashing glitter banners were back, this time pulling your attention from the content you came for to deliver you to the next user acquisition opportunity.
And as if this weren't enough, there were more horrors to come. The Web lost the last shred of its innocence when the proliferation of data became unwieldy, and algorithms were used to "improve" our access to the content we wanted. Now that the platforms had all our data, they were able to analyse it to work out what we wanted to see, seemingly knowing us even better than we knew ourselves. However, there was a hidden catch---these all-encompassing data sets and secret algorithms were available for sale to the highest bidder. Deep-pocketed political organisations could target swing voters with unprecedented precision and efficacy. Cyberspace became suddenly all too real, while consensus reality became a thing of the past. News not only became fake; it evolved into personally targeted manipulation, as often nudging us to act against our best interests, all without us even realising it. The desire to save on hosting costs turned everyone into a target to become a readily controllable puppet.
At the same time, more terrifying revelations lay in wait. The egalitarian ideals that once underpinned the construction of a trustful internet proved to be the most naive of all. The DoD, the very institution that facilitated its adoption since the early days, now sought to reclaim control. Edward Snowden walked out of the NSA with a virtual stack of documents, the contents of which no one could have imagined (unless, of course, you had thought the Bourne Conspiracy was a documentary). It turned out that the protocols had been subverted, and all the warrant canaries were long dead. The world's major governments had been running a surveillance dragnet on the entire global population—incessantly storing, processing, cataloguing, indexing, and providing on-demand access to the sum total of a person's online activity. It was all available at the touch of an XKeyscore button—an all-seeing, unblinking Optical Nerve determined to "collect it all" and "know it all", no matter who or what the context. Big Brother turned out to look like Sauron.
This gross erosion of privacy, along with many other similar efforts by various power-drunk or megalomaniac state and individual actors across the world to track and censor oppressed people, political adversaries, and journalists, had provided impetus for the Tor project. An extraordinary collaboration between the US Military, MIT, and the EFF, the Tor project offered a means to obfuscate the origin of a request and deliver content in a protected, anonymous manner. While wildly successful and a household name in some niches, Tor has not seen much mainstream adoption due to its relatively high latency resulting from its inherent inefficiencies.
By the time of Snowden's revelations, the web had become ubiquitous and an integral part of almost every facet of human life, with the vast majority of it operated by large corporations. While reliability problems were now a thing of the past, there was a price to pay. Content producers were offered context-sensitive, targeted advertising models in a Faustian bargain. The offers came with a knowing grin, revealing that these corporations knew content producers had no choice. "We will give you scalable hosting that will cope with any traffic your audience throws at it", they sang, "but in return, you must give us control over your content. We are going to track each member of your audience and collect (and own, *whistle*) as much of their personal data as possible. We will, of course, decide who can and who cannot see it, as is our right, no less. Additionally, we will proactively censor it and share your data with authorities as necessary to protect our business." As a consequence, millions of small content producers created immense value for a handful of mega corporations, getting peanuts in return---typically, free hosting and advertisement. What a deal!
Setting aside the resulting FUD of the Web 2.0 data and news apocalypse that we witness today, there are also a couple of technical problems with the web's underlying architecture. The corporate approach has led to a centralist maxim, where all requests must be routed through some backbone somewhere, to a monolithic data-centre, and then passed around, processed, and finally returned back. Even if to simply send a message to someone in the next room. This client-server architecture has at best flimsy security and was so often breached that it became the new normal, leaving oil-slicks of unencrypted personal data and even plaintext passwords in its wake, spread all over the web. The last nail in the coffin is the sprawl of incoherent standards and interfaces this has facilitated. Today, spaghetti code implementations of growing complexity subdivide the web into multifarious micro-services. Even well-funded companies find it increasingly difficult to deal with the development bills, and it is common that fledgling start-ups drown in a feature-factory sea of spiralling technical debt. A modern web application's stack in all cases is a cobbled together Goldberg machine comprising so many moving parts that it is almost impossible even for a supra-nation-state corporation to maintain and develop these implementations without numerous bugs and regular security flaws. Well, except for Google and Amazon, to be honest. At any rate, we're well overdue for a reboot. In the end, it's the data describing our lives. They are already trying but yet they have no power to lock us into this mess.
\subsection{Peer-to-peer networks \statusgreen}\label{sec:peer_to_peer}
As the centralised Web 2.0 took over the world, the \gloss{peer-to-peer} (\gloss{P2P}) revolution was also gathering pace, quietly evolving in parallel. P2P traffic rapidly accounted for the majority of packets flowing through the pipes, overtaking the SYN-bait servers mentioned earlier. It proved that by working together to use their hitherto massively underutilised \emph{upstream bandwidth}, end users could achieve the same availability and throughput for their content as previously only achievable with the help of big corporations and their data centres attached to the fattest pipes of the Internet's backbone. What's more, it could be realized at a fraction of the cost. Importantly, users retained far greater control and freedom over their data. Eventually, this mode of data distribution proved to be remarkably resilient even in the face of powerful and well-funded entities' desperate exertions to shut it down.
However, even the most evolved mode of \gloss{P2P} file sharing, tracker-less \gloss{BitTorrent} \citep{pouwelse2005bittorrent} was only that: file-level sharing. This was not at all suitable for providing the kind of interactive, responsive experience that people were coming to expect from web applications on \gloss{Web 2.0}. Additionally, while becoming extremely popular, BitTorrent was not conceived of with economics or game theory in mind. It was very much a product of the era before the world took note of the revolution its namesake would precipitate: that is to say, before anyone understood blockchains and the power of cryptocurrency and incentivisation.
\subsection{The economics of BitTorrent and its limits \statusgreen}
The genius of BitTorrent lies in its clever resource optimisation \citep{cohen2003incentives}: if many clients want to download the same content from a user, it gives them each different parts in the first phase. In the second phase, they can swap the parts between each other in a tit-for-tat fashion until everyone has all the parts. This way, the upstream bandwidth cost for a user hosting content (the \gloss{seeder} in BitTorrent parlance) remains roughly the same, regardless of how many clients download the content simultaneously. This solves the most problematic, ingrained issue of the ancient, centralised, master-and-slave design of \gloss{HTTP}, the protocol underpinning the \gloss{World Wide Web}.
Cheating (i.e.\ feeding your peers with garbage data) is discouraged by the use of hierarchical, piece-wise hashing. Each package offered for download is identified by a single short hash, and any part of it can be cryptographically verified to be a specific component of the package without requiring knowledge of other parts, and incurring only a very small computational overhead.
But this beautifully simple approach has five consequential shortcomings, all somewhat related \citep[see][]{locher2006free,piatek2007incentives}:
\begin{labelledlist}
\item[\emph{lack of economic incentives}]
There are no built-in incentives for users to seed content for others to download. In particular, there is no way to exchange upstream bandwidth provided by seeding for the downstream bandwidth required for downloading content. Effectively, the upstream bandwidth provided by seeding content to users is not rewarded. Because as much upstream bandwidth as possible can improve the experience with some online games, it can be a rational, if selfish choice to switch seeding off. Add some laziness, and it stays off forever.
\item[\emph{initial latency}]
Typically, downloads start slowly and with some delay. Clients that are further ahead in downloading have significantly more to offer to newcomers than they can offer in return. I.e. the newcomers have nothing to download (yet) for those further ahead. The result of this is that BitTorrent downloads start as a trickle before turning into a full-blown torrent of bits. This peculiarity has severely limited the use of BitTorrent for interactive applications that require both fast responses and high bandwidth, even though it would otherwise constitute a brilliant solution for many games.
\item[\emph{lack of fine-grained content addressing}]
Small \glossplural{chunk} of data can only be shared as part of the larger file. They can be pinpointed for targeted that leaves the rest of a file out to optimise access. Peers for the download can only be found by querying the \gloss{distributed hash table} (\gloss{DHT}) for a desired \emph{file}, and it is not possible to look for peers at the chunk-level. As the advertising of available content happens exclusively at the level of files, this leads to inefficiencies, as the same chunks of data can often appear verbatim in multiple files. So, while theoretically, all peers who have the chunk could provide it, there is no way to find those peers without the name or announced hash of the chunk's enveloping file.
\item[\emph{no incentive to keep sharing}]
Nodes are not rewarded for their sharing efforts (storage and bandwidth) once they have achieved their objective, i.e.\ retrieving all desired files from their peers.
\item[\emph{no privacy or ambiguity}]
Nodes publicly advertise exactly the content they are seeding, making it easy for attackers to discover the IP address of peers hosting content they would like to see removed. Any adversaries can then easily DDOS them, while corporations and nation states are able to petition the ISP for the physical location of the connection. This has led to a grey market of VPN providers helping users circumvent this. Although these services offer assurances of privacy, it is usually impossible to verify them as their systems are typically closed-source.
\end{labelledlist}
While spectacularly popular and useful, BitTorrent is only a rudimentary, albeit undoubtedly genius first step. It is amazing how far we can get simply by sharing our upstream bandwidth, hard-drive space, and tiny amounts of computing power–even despite the lack of proper accounting and indexing. However-–surprise!-–by adding just a few more emergent technologies to the mix, especially the \gloss{blockchain}, we get something that truly deserves the \gloss{Web 3.0} moniker: a decentralised, censorship-resistant platform for sharing and collectively creating content while retaining full control over it. What's more, the cost of this is almost entirely covered by using and sharing the resources supplied by the breathtakingly powerful, underutilised super-computer (by yesteryear's standards :-) ) that most of us already own.
\subsection{Towards Web 3.0 \statusgreen}\label{sec:towards-web3}
% 0/ intro talk about the limitations and problems with web2 app architecture
% 1/ why the game has changed in a post-satoshi world
% As the blockchain has brought us the ability to
% 2/ why swarm represents a further iteration on this change and makes the whole thing usable, how it overcomes limitations of the blockchain, emphasis the VC problem, talk about making the web fun again
% 3/ some short exploration of current attempts to provide this and their potential limitations, but keep this short, unemotional and unbiased
% 4/ drum up to grand ending of how swarm will provide trustless computing save the world etc. etc.
\begin{centerverbatim}
The Times 03/
Jan/2009 Chancel
lor on brink of
second bailout f
or banks
\end{centerverbatim}
At 6:15 on the 3rd of January 2009, a mysterious Cypherpunk created the first block of a chain that would encircle the entire world, fundamentally changing the way we think about money, forever. The genie of \emph{cryptocurrency} was out of the bottle. Satoshi Nakamoto had achieved something no one else could—the de facto (albeit small scale) disintermediation of banks through decentralised, trustless value transfer. Ever since that moment, we have effectively returned to the gold standard—everyone can now own \emph{sound} money. Money that no-one can multiply or inflate out of your pocket. What's more, now we can even issue our own currency, complete with an arbitrary monetary policy and a globally deployed electronic transmission system. It is still not well understood how much this will change our economies, but the system attracted an unprecedented degree of wealth, withdrawn from traditional asset classes, leading to a total capitalisation of crypto surpassing the staggering one trillion US dollars.
This first step was a monumental turning point. Now we had authentication and value transfer baked into the system at its very core. But as much as it was conceptually brilliant, it had some minor and not-so-minor problems with utility. It allowed the transmission of digital value, one could even 'colour' the coins or transmit short messages like the one above that marks the fateful date of the first block—but that's it. And, regarding scale... every transaction must be stored on every node. Sharding was not built-in. Worse, the protection of the digital money made it necessary that every node processed exactly the same as every other node, all the time. This was the opposite of a parallelised computing cluster, and millions of times slower.
When Vitalik conceived of Ethereum, he accepted some of these limitations, but the utility of the system took a massive leap. He added the facility for Turing-complete computation via the \gloss{Ethereum Virtual Machine} which enabled a cornucopia of applications that could run in this trustless setting. The concept was at once a dazzling paradigm shift and a consistent evolution of Bitcoin, which itself was based on a tiny virtual machine, with every single transaction really being—unbeknownst to many—a mini program. But Ethereum went all the way, and that again changed everything. The possibilities were numerous and alluring, and \gloss{Web 3.0} was born.
However, there was still a problem to overcome to fully transcend the world of Web 2.0—storing data on the blockchain was prohibitively expensive for anything but a tiny amount. Both Bitcoin and Ethereum had taken the layout of BitTorrent and run with it, complementing the architecture with the capability to transact, but leaving consideration about storing non-systemic data for later. Bitcoin had introduced a less secure second circuit below the distribution of blocks: candidate transactions are shipped around without fanfare, as secondary citizens, literally without protocol. Ethereum went further, separating out the headers from the blocks, creating a third tier that ferried the actual block data as needed. Because both classes of data are essential to the operation of the system, these could be called critical design flaws. Bitcoin's maker probably didn't envision mining becoming the exclusive domain of a highly specialised elite. Any transactor would have been expected to be able to mine their own transactions. Ethereum's developers faced the even harder challenge of data availability, and perhaps assuming it could be addressed later, ignored it for the moment.
In other news, the straightforward data dissemination approach of \gloss{BitTorrent} was successfully implemented for web content distribution by \gloss{ZeroNet} \citep{zeronet}. However, because of the aforementioned issues with BitTorrent, ZeroNet turned out unable to support the responsiveness that users of modern web services have come to expect.
In an effort to enable responsive, \glossplural{distributed web application} (or \glossplural{dapp}), the \gloss{InterPlanetary File System} or \gloss{IPFS} \citep{ipfs2014} introduced their own major improvements over BitTorrent. One stand-out feature was the highly web-compatible, URL-based retrieval scheme. In addition, the directory of the available data, the indexing, (like BitTorrent organised as a \gloss{DHT}) was vastly improved, allowing users to also search for small parts of any file, known as \glossplural{chunk}.
There are numerous other efforts to fill the gap and provide a worthy Web 3.0 surrogate for the constellation of servers and services that Web 2.0 developers have come to expect—to offer a path to emancipation from the existing dependency on centralised architecture that enables the data reapers. These are not insignificant roles to supplant, as even the simplest web app today relies on a vast array of concepts and paradigms which have to be remapped into the \gloss{trustless} setting of Web 3.0. In many ways, this problem is proving to be perhaps even more nuanced than implementing trustless computation on the blockchain. Swarm responds to this with a variety of carefully designed data structures that enable application developers to recreate familiar Web 2.0 concepts. Swarm reimagines the current offerings of the web and re-implements them based on solid cryptoeconomic foundations.
Imagine a sliding scale, starting on the left with large file size, low frequency of retrieval, and a monolithic \gloss{API}. To the right are small data packets, high frequency of retrieval, and a nuanced API. On this spectrum, file storage and retrieval systems like a posix filesystem, S3, Storj, and BitTorrent live on the left hand side. Key--value stores like LevelDB and databases like MongoDB or Postgres live on the right. Building a useful app requires different modalities scattered across the scale. Furthermore, there must be the ability to combine data where necessary and ensure only authorised parties have access to protected data. In the centralised model, handling these problems initially is easy, but gets more difficult with growth, and each range of the scale has a solution from one piece of specialised software or another. However, in the decentralised model, all bets are off. Authorisation must be handled with cryptography, limiting the combination of data. As a result, in the nascent, evolving Web 3.0 stack of today, many solutions deal piecemeal with only part of this spectrum of requirements. In this book, you will learn how Swarm spans the entire spectrum while providing high-level tools for the new guard of Web 3.0 developers. The hope is that from an infrastructure perspective, working on Web 3.0 will feel like the halcyon days of Web 1.0, while delivering unprecedented levels of agency, availability, security, and privacy.
To address the need for privacy to be baked in at the core level in file-sharing---as it is so effectively attained in Ethereum---Swarm enforces anonymity at an equally fundamental and absolute level. Lessons from Web 2.0 have taught us that trust should be given responsibly and only to those that are deserving of it and will treat it with respect. Data is toxic \citep{schneier2019Jul}, and we must treat it delicately in order to be responsible to ourselves and to those for whom we take responsibility. Later, we will explain how Swarm provides complete and fundamental user privacy.
Of course, to fully transition to a Web 3.0-decentralised world, we must address the dimensions of incentives and trust, which are traditionally "solved" by handing over responsibility to (often untrustworthy) centralised gatekeepers. As we have noted, BitTorrent also grappled with this problem and responded with various seed ratios and private (i.e., centralised) trackers.
The problem of lacking incentives for reliably hosting and storing content is apparent in various projects like ZeroNet or MaidSafe. Incentivising distributed document storage is still a relatively new research field, especially in the context of blockchain technology. The Tor network has seen suggestions for incentivisation schemes \citep{jansen2014onions,ghoshetal2014tor} but they have mostly been academic exercises and not integrated into the heart of the underlying system. Bitcoin has been repurposed to drive other systems like Permacoin \citep{miller2014permacoin}, while some have created their own blockchain, such as Sia \citep{vorick2014sia} or \citet{filecoin2014} for \gloss{IPFS}. BitTorrent is currently testing the waters of blockchain incentivisation with its own token \citep{tron2018,bittorrent2019}. However, even with all of these approaches combined, many hurdles remain to fulfil the specific requirements of a Web 3.0 dapp developer.
Later on, we will explore how Swarm provides a full suite of incentivisation measures and implements other checks and balances to ensure that nodes work together for the benefit the entire... swarm. This includes the option to rent out large amounts of disk space to those willing to pay for it—irrespective of the popularity of their content—while also enabling the deployment of interactive dynamic content to be stored in the cloud, a feature we call \gloss{upload and disappear}.
The objective of any incentive system for \gloss{peer-to-peer} content distribution is to encourage cooperative behaviour and discourage \gloss{freeriding}: the uncompensated depletion of limited resources. The \gloss{incentive strategy} outlined here aspires to satisfy the following constraints:
\begin{itemize}
\item it is in the node's own interest, regardless of whether or not other nodes follow it
\item it must be expensive to expend the resources of other nodes
\item it does not impose unreasonable overhead
\item it plays nice with "naive" nodes
\item it rewards those that play nice, including those following this strategy
\end{itemize}
In the context of Swarm, storage and bandwidth are the two most important scarce resources, and this is reflected in our incentives scheme. Bandwidth incentives aim to achieve speedy and reliable data provision, while storage incentives are designed to ensure long-term data preservation. This comprehensive approach caters to all web application development requirements and aligns incentives so that each individual node's actions benefit not only itself, but the whole of the network.
\section{Fair data economy \statusgreen}\label{sec:fair-data}
\green{}
In the era of \gloss{Web 3.0}, the Internet is no longer just a niche space where geeks play, but has become a vital conduit of value creation and has generated a huge share of overall economic activity. Yet, the data economy in its current state is far from fair, as the distribution of the spoils is controlled by those who control the data—mostly companies keeping it to themselves in isolated \glossplural{data silo}. To achieve the goal of a \gloss{fair data economy}, many social, legal, and technological issues will have to be tackled. We will now present some of the issues as they currently exist, and describe how Swarm aims to address them.
\subsection{The current state of the data economy \statusgreen} \label{sec:dataeconomy}
Digital mirror worlds already exist---virtual expanses that contain shadows of physical things, consisting of unimaginably large amounts of data \citep{MirrorWorlds2020Feb}. As more and more data continues to be synced to these parallel worlds, new infrastructure, markets, and business opportunities arise. Only relatively crude methods exist for measuring the size of the data economy as a whole, but one estimate places the financial value of data (including related software and intellectual property) in the USA at \$1.4 trillion to 2 trillion in 2019 \citep{MirrorWorlds2020Feb}. The EU Commission projects the figures for the data economy in the EU27 for 2025 at €829 billion (up from €301 billion in 2018; \citealp{EUDataStrategy2020Feb}).
Despite this huge amount of value, the asymmetric distribution of the wealth generated by the existing data economy has been put forward as a major humanitarian issue \citep{TheWinner2020Feb}. While unceasing increases in quality and quantity of data have led to ever-greater levels of efficiency and productivity, the resulting profits have been distributed unequally. The larger the company's data set, the more it can learn from the data, the more users it can attract, and the more data it can accumulate. Currently, this is most apparent with the dominating large tech companies such as \gloss{FAANG}, but it is predicted to become increasingly significant in non-tech sectors and even in nation states. Hence, companies are racing to become dominant in their respective sectors, granting an advantage to the countries hosting these platforms. Since Africa and Latin America host so few of these, they risk becoming exporters of raw data and paying other countries to import the intelligence derived from it, as warned by the United Nations Conference on Trade and Development \citep{TheWinner2020Feb}. Another problem arises when a large company monopolises a particular data market, as it can become the sole purchaser of data, exerting complete control over price setting. This control opens up the possibility of manipulating the "wages" offered for providing data, thereby keeping them artificially low. In many ways, we are already seeing evidence of this.
% move this?
% As a solution, citizens could organise into "data co-operatives", who would then act as trade unions do in conventional economy.
Flows of data are becoming increasingly blocked and filtered by governments, using the familiar reasoning based on the protection of citizens, sovereignty, and national economy \citep{VirtualNationalism2020Feb}. Leaks by several security experts reveal that for governments to properly give consideration to national security, data should be kept close to home and not left to reside in other countries. GDPR is one such instance of a "digital border" that has been erected—data may leave the EU only if appropriate safeguards are in place. Other countries, such as India, Russia, and China, have implemented their own geographic limitations on data. The EU Commission has pledged to closely monitor the policies of these countries and address any restrictions to data flows during trade negotiations. Additionally, the EU Commission takes necessary measures within the World Trade Organization \citep{EUWhitePaperAI2020Feb} to advocate for fair and unrestricted data exchange practices.
Despite the growing interest in the ebb and flow of data, the big tech corporations maintain a firm grip on much of it, and the devil is in the details. Swarm's privacy-first model ensures that no data needs to be divulged to any third parties, and everything is end-to-end encrypted out of the box, preventing service providers from aggregating and leveraging giant data sets. As a result, instead of being concentrated with the service provider, control of the data remains decentralised and with the individual to whom it pertains. And with that, so does the power. Expect bad press.
\subsection{The current state and issues of data sovereignty \statusgreen }\label{sec:data-sovereignty}
As a result of the Faustian bargain described above, the current model of the \gloss{World Wide Web} suffers from several flaws. Unforeseen circumstances of economies of scale in infrastructure provisioning and network effects in social media have turned platforms into massive data silos, holding large amounts of user data that are retained, shared, or deleted at the whim of a single organisation. This 'side effect' of the centralised data model empowers large private corporations to collect, aggregate, and analyse user data with their data siphons positioned right at the central bottleneck—the cloud servers where everything meets. This is exactly what David Chaum predicted in 1984, which ignited the Cypherpunk movement, serving as a vital inspiration for Swarm.
The increasing shift from human-mediated interactions to computer-mediated ones, combined with the rise of social media and smartphones, has led to more and more information about our personal and social lives becoming readily accessible to the companies provisioning the data flow. They have access to lucrative data markets where user demographics are linked with underlying behaviours, enabling them to understand users better than they understand themselves. This is the ultimate treasure trove for marketeers.
Data companies, including large tech corporations and other entities involved in collecting, aggregating, and analysing vast amounts of user data, have meanwhile evolved their business models, now focusing on capitalising on data sales rather than the service they initially provided. Their primary source of revenue is now selling the results of user profiling to advertisers, marketeers, and others who seek to "nudge" members of the public. The cycle is continued through eerily tailored advertisements served to users on the same platforms, measuring their reaction, and using their reactions to better elicit the desired behaviour in future advertisements, thus creating an unending feedback loop. A whole new industry has grown out of this torrent of information, sophisticated systems have emerged to predict, guide, and influence users to capture their attention and money. The industry openly and knowingly exploits human cognitive biases, often resorting to highly developed and cynically calculated psychological manipulation. The undisputable truth is that mass manipulation in the name of commerce has led to our modern reality where not even the most aware can truly exercise their freedom of choice and preserve their intrinsic autonomy of preference regarding consumption of content, goods, and services.
The shift in the platforms' focus towards advertisements rather than their intended primary purposes is also reflected in the declining quality of service for users. The needs of users have become secondary to the needs of the "real" customers: the advertisers. This declining user experience and quality of service is exacerbated in the case of social platforms where the inertia of network effect promotes user lock-in. Correcting these misaligned incentives is imperative to providing users with the same services without the unfortunate incentives inevitably resulting from the centralised data model.
Moreover, the lack of control over one's data has serious consequences on the economic potential of the users. Some refer to this situation, somewhat hysterically, as \gloss{data slavery}. But they are technically correct: our digital twins are held captive by these corporations and exploited for revenue generation. As users, we give up a great deal of agency, as the data we freely share is used to manipulate us and make us less well-informed and free.
The current system of keeping data in disconnected data sets has various drawbacks:
\begin{labelledlist}
\item[\emph{Unequal opportunity}] Centralised entities increase inequality as they siphon away a disproportionate amount of profit from the actual creators of the value.
\item[\emph{Lack of fault tolerance}] These datasets have a single point of failure in terms of technical infrastructure, notably security.
\item[\emph{Corruption}] The concentration of decision-making power makes these datasets easier targets for social engineering, political pressure, and institutionalised corruption.
\item[\emph{Single attack target}] The concentration of large amounts of data under the same security system attracts attacks as it increases the potential reward for hackers.
\item[\emph{Lack of service continuity guarantees}] Service continuity is in the hands of the organisation and is only weakly incentivised by reputation. This introduces the risk of inadvertent termination of service due to bankruptcy or regulatory/legal action.
\item[\emph{Censorship}] Centralised control of data access allows for, and in most cases eventually leads to, decreased freedom of expression.
\item[\emph{Surveillance}] Data flowing through centrally-owned infrastructure offers perfect access to traffic analysis and other methods of monitoring.
\item[\emph{Manipulation}] Monopolisation of the display layer enables data harvesters to manipulate opinions by controlling the presentation, order, and timing of data, calling into question the sovereignty of individual decision-making.
\end{labelledlist}
\subsection{Towards self-sovereign data \statusgreen} \label{sec:selfsovereigndata}
We believe that decentralisation is a game-changer that effectively addresses many of the problems listed above.
We argue that blockchain technology is the final missing piece in the puzzle to realise the cypherpunk ideal of a truly self-sovereign Internet. As outlined in the \emph{Cypherpunk Manifesto} by Eric Hughes in 1993 \citep{hughes1993}, "We must come together and create systems which allow anonymous transactions." One of the goals of this book is to demonstrate how decentralised consensus and peer-to-peer network technologies can be combined to form a rock-solid base-layer infrastructure. This foundation is not only resilient, fault tolerant and scalable, but also egalitarian and economically sustainable with a well-designed system of incentives. The low barrier of entry for participants ensures adaptive incentives, leading to prices that
automatically converge to the marginal cost. On top of this, add Swarm's strong value proposition in the domain of privacy and security.
Swarm is a \gloss{Web 3.0} stack that is decentralised, incentivised, and secure. In particular, the platform caters to participants with comprehensive solutions for data storage, transfer, access, and authentication—services that are becoming increasingly crucial in economic interactions. Offering universal access with robust privacy guarantees and no limitations from borders or external restrictions, Swarm fosters the spirit of global voluntarism and serves as the foundational \emph{infrastructure for a self-sovereign digital society}.
\subsection{Artificial intelligence and self-sovereign data \statusgreen} \label{sec:AIdata}
Artificial Intelligence (AI) holds great promise for our society, bringing new business opportunities and augmenting the tool sets used by various professions. On the one hand, it also poses a potential threat as it might displace certain professions and jobs \citep{Lee2018Sep}.
For the prevalent type of AI, machine learning (ML), particularly deep learning, three essential "ingredients" are required: computing power, models, and data. Today, computing power is readily available, and specialised hardware is being developed to further facilitate processing. An extensive headhunt for AI talent has been taking place for more than a decade, and a few companies have managed to corner the market for workers with the specialised talents needed to build models and analyse data. However, the dirty secret of today's AI, and deep learning, is that the algorithms, the 'intelligent math' is already commoditised. It is Open Source and freely available to everyone. It is not what Google or Palantir make their money with. The true 'magic trick' of these companies lies in getting access to the largest possible data sets to unleash the full potential of their AI systems for profitable gains.
The major players in the data economy have been systematically amassing vast amounts of data. They often offer free applications with some utility such as search engines or social media platforms, while secretly collecting and stockpiling user data without explicit consent or awareness. This monopoly on data has allowed multinational companies to make unprecedented profits, with only feeble motions to share the financial proceeds with the individuals whose data they have sold. Potentially much worse though, this hoarded data remains untapped, depriving both individuals and society as a whole of its potential transformative value.
It is perhaps not a coincidence that the major data and AI "superpowers" are the American and Chinese governments, along with major corporations within their respective borders. An AI arms-race is unfolding in full view of the citizens of the world, leaving most other countries lagging behind as "data colonies" \citep{HarariDavos2020Mar}. There are warnings that the current trajectory will lead to China and the United States accumulating an insurmountable advantage as AI superpowers \citep{Lee2018Sep}.
It doesn't have to be so. In fact, it likely won't because the status quo is a bad deal for billions of people. Decentralised technologies and cryptography offer a path towards data privacy while nurturing a fair data economy that retains the benefits of the current centralised system but without its pernicious drawbacks. This is the goal that many consumer and tech organisations across the globe are aiming for. They are working to support the push-back against the big data behemoths as more users begin to realise that they have been swindled into giving away their data. Swarm's infrastructure will play a pivotal role in facilitating this liberation.
Self-sovereign storage might well be the only way that individuals can reclaim control over their data and privacy. It is the first step towards breaking free from filter bubbles and reconnecting with one's own culture. Addressing various challenges of today's Internet and the distribution and storage of data, Swarm is built for privacy from the ground up, incorporating powerful data encryption and ensuring secure and completely leak-proof communication. Furthermore, it enables users to selectively share specific data with third parties at their discretion, allowing for the possibility of financial compensation in return. Payments and incentives are therefore also integral aspects of Swarm.
As Hughes wrote, "privacy in an open society requires anonymous transaction systems. ... An anonymous transaction system is not a secret transaction system. An anonymous system empowers individuals to reveal their identity when desired and only when desired; this is the essence of privacy."
Using Swarm enables leveraging a fuller set of data to create better services while still having the option to contribute to the global good with self-verifiable anonymisation. It's the best of all worlds.
This newfound availability of data, benefitting young academic students and startups with disruptive ideas in the AI and big-data sectors, has immense potential to drive advancements in the entire field. This holds great promise for progress in science, healthcare, the eradication of poverty, environmental protection, disaster prevention, and more. However, despite the eye-catching successes attracting robber barons and rogue states, many sub-fields are currently facing challenges and stagnation. Swarm's data liberation has the potential to break this impasse and unleash the sector's contributions to various domains.
The facilities that Swarm provides will open up a new set of powerful options for companies and service providers. With the widespread decentralisation of data, we can collectively own the extremely large and valuable data sets that are needed to build state-of-the-art AI models. Embracing data portability, a trend already hinted at in traditional tech, will foster competition and personalised services for individuals. The playing field will be levelled for all, driving innovation in line with the demands of the twenty-first century's third decade.
\subsection{Collective information \statusgreen}\label{sec:collective_information}
While \gloss{collective information} has been steadily accumulating since the inception of the Internet, the concept has only recently gained recognition and is now being discussed under a variety of headings such as \emph{open source}, \emph{fair data}, or \emph{information commons}.
A collective, as defined by Wikipedia, is:
\begin{displayquote}
"A group of entities that share or are motivated by at least one common issue or interest, or work together to achieve a common objective."
\end{displayquote}
The internet allows the formation of collectives on a previously unthinkable scale, transcending geographical location, political convictions, social status, wealth, even general freedom, and other demographics. The data produced by these collectives through interactions on public forums, reviews, votes, code repositories, articles, and polls, along with the emergent metadata, all contribute to collective information. Since most of these interactions are currently facilitated by for-profit entities running centralised servers, the collective information ends up stored in data silos owned by commercial entities, often concentrated in the hands of a few large technology companies. And while the actual work results are often in the open, the metadata, which can often be more valuable, powerful, and potentially dangerous, is usually held and monetised in secrecy.
These "platform economies" have already become essential and are only becoming ever more important in a digitised society. However, the information that commercial players acquire about their users is increasingly being leveraged against the users' best interests. To put it mildly, this calls into question whether these corporations can handle the ethical responsibility that comes with the power of managing our \gloss{collective information}.
While many state actors are trying to obtain unfettered access to the collective mass of individuals' personal data, with some countries going as far as demanding magic key-like back-door access, there are exceptions. Since AI has the potential for misuse and ethically questionable use, a number of countries have started "ethics" initiatives, regulations, and certifications for AI use, such as the German Data Ethics Commission or Denmark's Data Ethics Seal.
Yet, even if corporations could be made to act more trustworthy, as would be appropriate in light of their great responsibility, the mere existence of \glossplural{data silo} stifles innovation. The basic shape of the client-server architecture itself has led to this problem by defaulting to centralised data storage on "servers" within their "farms" (see \ref{sec:web_1} and \ref{sec:web_2}). In contrast, effective peer-to-peer networks like Swarm (\ref{sec:peer_to_peer}) make it possible to alter the very topology of this architecture, enabling the collective ownership of collective information.
\section{The vision \statusorange}\label{sec:vision}
\begin{displayquote}
Swarm is infrastructure for a self-sovereign society.
\end{displayquote}
\subsection{Values \statusorange}\label{sec:values}
Self-sovereignty implies freedom. If we break it down, this implies the following metavalues:
\begin{labelledlist}
\item[\emph{Inclusivity}] public and permissionless participation
\item[\emph{Integrity}] privacy, provable provenance
\item[\emph{Incentivisation}] alignment of interest of node and network
\item[\emph{Impartiality}] content and value neutrality
\end{labelledlist}
These metavalues can be thought of as systemic qualities which contribute to empowering individuals and collectives to gain self-sovereignty.
Inclusivity means that we aspire to include the underprivileged in the data economy and lower the barrier of entry for defining complex data flows and building decentralised applications. Swarm is a network with open participation for
providing services and permissionless access to publishing, sharing, and investing your data.
While having the freedom to express their intentions as "actions" and retain full authority to decide if they want to remain anonymous or share their interactions and preferences, users must also uphold the integrity of their online persona.
Economic incentives ensure that participants' behaviour align with the desired emergent behaviour of the network (see \ref{sec:incentivisation}).
Impartiality guarantees the neutrality of content and prevents gate-keeping. It also reaffirms that the other three values are not only necessary but also sufficient: it eliminates any values that may treat any particular group as privileged or express a preference for specific content or data from a particular source.
\subsection{Design principles \statusorange}\label{sec:design-principles}
The information society and its data economy bring about an age where online transactions and big data are indispensable to everyday life, making the technical infrastructure supporting them essential for a functional society. It is imperative therefore, that this base-layer infrastructure be \emph{future-proof} and equipped with robust guarantees for long-term continuity.
Persistence of the technical infrastructure is achieved by the following generic requirements expressed as \emph{systemic properties}:
\begin{labelledlist}
\item[\emph{Stable}] The specifications and software implementations are stable and resilient to changes in participation, or politics (political pressure, censorship).
\item[\emph{Scalable}] The solution is able to accommodate many orders of magnitude more users and data as it scales, without adversely impacting performance or reliability during mass adoption.
\item[\emph{Secure}] The network is resistant to deliberate attacks, remains impervious to social pressure and political influences, and demonstrates fault tolerance in its technological dependencies (e.g. blockchain, programming languages).
\item[\emph{Self-sustaining}] The solution runs by itself as an autonomous system, not depending on individual or organisational coordination of collective action or any legal entity's business, nor exclusive know-how, hardware, or network infrastructure.
\end{labelledlist}
\subsection{Objectives \statusyellow}\label{sec:objectives}
%\subsubsection{Scope}
When we talk about the "flow of data," a core aspect of this is how information has provable integrity across modalities, see Table \ref{tab:scope}. This corresponds to the original Ethereum vision of the \gloss{world computer}, constituting the trust-less (i.e. fully trustable) fabric of the coming data scene: a global infrastructure that supports data storage, transfer, and processing.
\begin{table}[!htb]
\centering
\begin{tabular}{c|c|c}
dimension & model & project area\\\hline
%
time & memory & storage \\
space & messaging & communication \\
symbolic & manipulation & processing \\
\end{tabular}
\caption{Swarm's scope and data integrity aspects across 3 dimensions.}
\label{tab:scope}
\end{table}
With the Ethereum blockchain as the CPU of the world computer, Swarm is best thought of as its "hard disk". Of course, this model belies the complex nature of Swarm, which is capable of much more than simple storage, as we will discuss.
The Swarm project sets out to bring this vision to completion and build the world computer's storage and communication.
\subsection{Impact areas \statusorange}
In what follows, we identify feature areas of the product that best express or facilitate the four values of inclusivity, integrity, incentivisation, and impartiality introduced above.
Inclusivity in terms of permissionless participation is best guaranteed by a decentralised peer-to-peer network.
Allowing nodes to provide service and get paid for doing so will offer a zero-cash entry to the ecosystem: new users without currency can serve other nodes until they accumulate enough currency to use services themselves. A decentralised network providing distributed storage without gatekeepers is also inclusive and impartial in that it allows content creators, who would otherwise risk being deplatformed by repressive authorities, to publish without their right to free speech being violated.
The system of economic incentives built into the protocols works best if it tracks the actions that incur costs in the context of peer-to-peer interactions. Bandwidth sharing as evidenced in message relaying is one such action where immediate accounting is possible as a node receives a message that is valuable to them. On the other hand, promissory services such the commitment to preserve data over time must be rewarded only upon verification. In order to avoid the \gloss{tragedy of the commons} problem, such promissory commitments should be guarded against by enforcing individual accountability through the threat of punitive measures, i.e. by allowing staked insurers.
Integrity is maintained by ensuring easy provability of authenticity while still maintaining anonymity.
Provable inclusion and uniqueness are fundamental to allowing trustless data transformations.
% \subsection{Requirements \statusred}\label{sec:requirements}
\subsection{The future} \label{sec:future}
In today's digital society, many challenges lie ahead for humanity, leaving the future uncertain. Nonetheless, to be sovereign and in control of our destinies, nations and individuals alike must retain access and control over their data and communication.
Swarm's vision and objectives are rooted in the values of the decentralised tech community. Originally conceived as the file storage component in the trinity which would form the world computer alongside Ethereum and Whisper, Swarm embraces its role in building a resilient decentralised digital ecosystem.
It provides the necessary responsiveness for dapps running on users' devices, while also offering incentivised storage utilising various storage infrastructures ranging from smartphones to high-availability clusters. Continuity will be guaranteed with well-designed incentives for bandwidth and storage.
Content creators will receive fair compensation for the content they offer, and content consumers will pay for it. By eliminating the middlemen providers who currently benefit from the network effects, the benefits will be spread throughout the network.
But it will be much more than that. Every individual and every device leaves a trail of data, which is collected and stored in silos, whose potential is used up only in part and to the benefit of large players.
Swarm will serve as the go-to platform for digital mirror worlds, providing individuals, societies, and nations with a cloud storage solution that is independent of any one large provider.
% This is especially important for countries currently lagging behind, such as Africa and Latin America.
Individuals will have full control over their own data. They will no longer be trapped in the current system of data slavery, where personal data is exchanged for services on opaque and exploitative platforms. Moreover, they will be able to form data collectives or data co-operatives, sharing their resources in the form of \emph{data commons} to achieve shared objectives.
Nations will establish self-sovereign Swarm clouds as data spaces to cater to the emerging artificial intelligence paradigm—in industry, health, mobility, and other sectors. These clouds will operate in a peer-to-peer manner, potentially within exclusive regions, and third parties will not be able to interfere by monitoring, censoring, or manipulating the flow of data. However, authorized parties will have access to the data, aiming to level the playing field for AI and services based on it.
Swarm can, paradoxically, serve as the "central" place to store data. Embracing this technology will empower individuals and society with robust accessibility, control, and fair value distribution of data, allowing for the leveraging of data for the collective benefit of all.
In the future society, Swarm will become ubiquitous, transparently and securely serving data from individuals and devices to data consumers within the fair data economy.