Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp/services/ledgerexporter: Research Spike - Alternative storage methods #4473

Closed
Tracked by #4571
Shaptic opened this issue Jul 21, 2022 · 3 comments
Closed
Tracked by #4571

Comments

@Shaptic
Copy link
Contributor

Shaptic commented Jul 21, 2022

This research spike entails exploring alternative ways to store and distribute unpacked ledger metadata (txmeta).

Currently, we've uploaded unpacked txmeta for pubnet through July 2022 to S3. Here are some stats:

  • ~7TB of storage space
  • ~42M separate objects
  • one directory

With that in mind, there are two avenues to this spike:

  • Is there a better way to structure these files? For example, we could have a folder for each checkpoint; we could combine all ledgers in a checkpoint into a single file; etc. The task is to come up with some strategies and analyse the pros/cons of each (things like storage/bandwidth costs, etc.)

  • Is there a better way to distribute these files? A back-of-the-envelope calculation tells us that the majority of the cost of distributing these files comes from egress bandwidth. We also want to let people build & store these files themselves, yet minimize the risk of people using rogue/corrupt/malformed txmeta. The task is to come up with a alternative transport layers and analyse their tradeoffs (for example, BitTorrent gives us bandwidth decentralization and integrity, but it's harder to do incremental updates. We can batch torrents by some ledger range, but is that too hard? What happens when we upgrade the meta format? What about IPFS? Others?? etc.)

@Shaptic
Copy link
Contributor Author

Shaptic commented Jul 21, 2022

Some initial notes I made regarding using BitTorrent:

There are a few benefits:

  • lower cost, since we wouldn't need to pay S3 storage and egress costs
  • though we still have to host/seed the torrent data somewhere
  • decentralization in that everyone shares the bandwidth to keep it up
  • trust, in that everyone is using one single source of unpacked meta, so you don't run the risk of some 3rd party organization uploading sketchy unpacked ledgers since everyone uses the same torrent (this problem already exists - for ex. one day some other public Horizon goes rogue and starts giving people bad info)

There are some downsides:

  • a "one time reliance" on SDF as a source, in the sense that people have to trust us about the unpacked meta, but they can also verify it themselves since all of the tools are public (the only barrier is cost).

  • this downside is bigger: you can't update it in real-time, but we could have a model where every X amount of time, we publish a new torrent that contains some new range of unpacked metas

@bartekn
Copy link
Contributor

bartekn commented Jul 22, 2022

Thinking about it more, there is one more downside. S3 (or any other cloud file store with an optional CDN) can give everyone very quick (thinking about 100 milliseconds or less) access to any ledger meta. It probably won't be possible with BitTorrent. It will, however, allow easier replication for orgs/people who would like to host fast access archives.

@Shaptic
Copy link
Contributor Author

Shaptic commented Jul 25, 2022

That's very true @bartekn: there's an initial startup time to connecting to the swarm before you can download. If you want more than a handful of ledgers, though, that startup time should be amortized and hardly impact the overall time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants