Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove shards #5066

Merged
merged 15 commits into from
Aug 3, 2024
Merged

Remove shards #5066

merged 15 commits into from
Aug 3, 2024

Conversation

thejohnfreeman
Copy link
Collaborator

@thejohnfreeman thejohnfreeman commented Jul 13, 2024

This change removes all text related to shards, including code, tests, documentation, and configuration settings.

99% of the change is removed lines and all remaining tests pass. Only 3 tests are removed:

  • nodestore/DatabaseShard_test.cpp
  • rpc/NodeToShardRPC_test.cpp
  • rpc/ShardArchiveHandler_test.cpp

Only 2 tests are modified:

  • nodestore/Database_test.cpp: Remove tests of Database::ledgersPerShard().
  • rpc/RPCCall_test.cpp: Remove tests of download_shard and node_to_shard.

Copy link

codecov bot commented Jul 13, 2024

Codecov Report

Attention: Patch coverage is 36.23188% with 44 lines in your changes missing coverage. Please review.

Project coverage is 74.8%. Comparing base (eedfec0) to head (dae6d00).

Files Patch % Lines
src/xrpld/app/ledger/detail/LedgerMaster.cpp 0.0% 19 Missing ⚠️
src/xrpld/overlay/detail/PeerImp.cpp 0.0% 13 Missing ⚠️
src/xrpld/app/rdb/backend/detail/Node.cpp 33.3% 4 Missing ⚠️
...rc/xrpld/app/rdb/backend/detail/SQLiteDatabase.cpp 20.0% 4 Missing ⚠️
src/xrpld/app/main/Application.cpp 0.0% 2 Missing ⚠️
src/xrpld/app/ledger/detail/InboundLedger.cpp 75.0% 1 Missing ⚠️
src/xrpld/app/misc/NetworkOPs.cpp 0.0% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5066     +/-   ##
=========================================
+ Coverage     71.4%   74.8%   +3.4%     
=========================================
  Files          796     766     -30     
  Lines        67075   63002   -4073     
  Branches     10858    8861   -1997     
=========================================
- Hits         47884   47114    -770     
+ Misses       19191   15888   -3303     
Files Coverage Δ
include/xrpl/protocol/HashPrefix.h 100.0% <ø> (ø)
include/xrpl/protocol/SystemParameters.h 100.0% <ø> (ø)
src/xrpld/app/consensus/RCLConsensus.cpp 64.0% <ø> (ø)
src/xrpld/app/ledger/InboundLedger.h 66.7% <ø> (ø)
src/xrpld/app/ledger/LedgerMaster.h 61.9% <ø> (ø)
src/xrpld/app/ledger/detail/InboundLedgers.cpp 32.7% <ø> (+0.7%) ⬆️
src/xrpld/app/main/Application.h 100.0% <ø> (ø)
src/xrpld/app/main/Main.cpp 79.5% <ø> (-0.1%) ⬇️
src/xrpld/app/misc/SHAMapStoreImp.cpp 75.1% <100.0%> (+1.2%) ⬆️
.../xrpld/app/rdb/backend/detail/PostgresDatabase.cpp 0.0% <ø> (ø)
... and 37 more

... and 9 files with indirect coverage changes

Impacted file tree graph

@MarkusTeufelberger
Copy link
Collaborator

So what's the replacement method to reliably and trustlessly share a full history node database?

@thejohnfreeman
Copy link
Collaborator Author

@MarkusTeufelberger tarball, rsync, all the most popular methods for copying data. The sharding feature was never completed. What is there to replace?

src/test/rpc/ShardArchiveHandler_test.cpp Outdated Show resolved Hide resolved
src/xrpld/rpc/ShardArchiveHandler.h Outdated Show resolved Hide resolved
@Bronek
Copy link
Collaborator

Bronek commented Jul 30, 2024

There's a comment mentioning shards in src/xrpld/nodestore/backend/NuDBFactory.cpp (lines 152-158), it probably needs to be updated

@Bronek
Copy link
Collaborator

Bronek commented Jul 30, 2024

I also see mention of RPC::ShardArchiveHandler in src/xrpld/net/detail/HTTPDownloader.cpp (line 127)

@Bronek
Copy link
Collaborator

Bronek commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 36.23188% with 44 lines in your changes missing coverage. Please review.

Project coverage is 74.8%. Comparing base (a39720e) to head (dd6b21a).
Report is 1 commits behind head on develop.

Additional details and impacted files

We have to ignore the "failing" check from codecov, the one that is showing "36% of diff hit (target 71%)". The alternative is to add more code in unit tests, which this PR should not do.

Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change.

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good change, which will simplify a lot of the remaining code. I left a few suggestions and questions.

However, I have to ask, sort of as a follow-up to @MarkusTeufelberger's question: Why not finish the feature? It wasn't bad code, and it had potential for making it a lot easier to manage space, especially as pertains to online_delete. I don't know what else was planned for it, either.

src/xrpld/app/rdb/README.md Outdated Show resolved Hide resolved
src/xrpld/nodestore/Database.h Outdated Show resolved Hide resolved
src/xrpld/nodestore/backend/NuDBFactory.cpp Outdated Show resolved Hide resolved
Comment on lines -487 to 472
case protocol::mtGET_PEER_SHARD_INFO_V2:
success = detail::invoke<protocol::TMGetPeerShardInfoV2>(
*header, buffers, handler);
break;
case protocol::mtPEER_SHARD_INFO_V2:
success = detail::invoke<protocol::TMPeerShardInfoV2>(
*header, buffers, handler);
break;
default:
handler.onMessageUnknown(header->message_type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about removing these message handlers without changing the protocol version as an indication that this peer no longer supports shards. However, I checked and onMessageUnknown doesn't do anything, so it looks like the peer won't be punished or booted, so maybe no harm no foul?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong feelings about it. I'm fine incrementing the protocol version if someone can teach me how.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW @ximinez that exactly "the peer won't be punished or booted, so not harm no foul" was my thinking when I reviewed this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of how one would add a protocol version. ximinez@f978447, but it's not actually useful, because older clients won't speak this protocol version, and newer clients won't support Shards anyway, so there's no need to test for it.

I updated the PeerImp::supportsFeature function, which for a "real" new protocol version would be called to check the feature before doing the new function or sending the new message or whatever.

Comment on lines -103 to 104
// The second argument of fetch is ignored when not using shards
if (auto obj = context.app.getNodeFamily().db().fetchNodeObject(
locator.getNodestoreHash(), locator.getLedgerSequence()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument of fetch is ignored when not using shards

So does it make sense to remove the second argument to fetchNodeObject everywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. I started going down that route, but it is all over the place, and I started having second thoughts. We might want to keep this information available for the future. It seems to be a lesson learned the hard way that we often want both the ledger digest and the ledger sequence together. In fact, I added a type in my FLR2 branch just for representing the pair, which I call LedgerIdentifier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That ledger sequence has a default value of 0. Do you know if there are any places that are using the default, or passing 0 in explicitly? If there aren't any, then I'm fine with leaving this as-is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there are places that pass 0 explicitly. Here is one I found (in FLR1) in a quick search. I believe there are more. 0 is not a valid sequence though. The genesis ledger starts at 1. 0 is used to mean "unknown".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I figured.

@scottschurr
Copy link
Collaborator

@ximinez asked...

Why not finish the feature?

In part because the code is particularly stale. Mickey gave me a core dump on the state of shards in November of 2021. No one has done any development (at all) on shards since then. That's a good 2.5 years worth of stale. Mickey was hoping that I would pick up development of shards. But I didn't. Ripple had other plans for what I needed to work on.

Also, Ripple ended up throwing it weight into clio, which was described as an alternative to shards as a solution for the full history problem. If you believe that clio solves the full history problem, then shards are a lot less useful.

Another reason to not finish Shards is because Mickey was just discovering some additional difficult parts of shards. The code is not bad code, but the feature is not actually done. There's no one with a vision for how the feature should be completed.

I hope that history helps.

@thejohnfreeman
Copy link
Collaborator Author

Based on my understanding of the old design, I have reservations.

I imagine that activity among ledger objects follows a power law distribution. Most objects are rarely touched, and a few objects represent most of the touches in any given block. If we pack new objects into a shard until it reaches a size limit, and then seal that shard and distribute it, then most objects in the shard are in the small set of frequently touched objects, but a few objects in the shard will be from the infrequently touched set.

How does this affect the performance of serving particular queries I might have?

  • If I want one particular object as it existed in one particular ledger, then I need a proof path from the header of that ledger to that object. If the shards are storing inner nodes, then the search might involve a path through multiple shards, held by different servers, and shards will fill up quickly because they are storing so much data from inner nodes. If the shards are not storing inner nodes, then they will store less data, but the search will involve even more shards, that are even more likely to be held by different servers, because we have to collect the leaves of the entire subtree to recompute the inner nodes.

  • If I want a complete copy of one ledger, then as time goes on, each object becomes more likely to be the only object of interest in its shard. The number of bytes I'm downloading explodes because most of it is for other ledgers.

I think what we really want is consistent hashing of ledger objects, not bundles of ledger objects that we call shards. But we need consistent hashing in the context of a trustless, decentralized peer-to-peer network with high turnover. There may be good solutions in this problem space. I haven't seen them, but I haven't done any research.

Then, there are my engineering sentiments. I don't like the proliferation of branches that shards introduced. I'd rather just see a clean replacement of the nodestore abstraction.

Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks for the detailed explanations and historical info @scottschurr and @thejohnfreeman. It sets my mind at ease, and hopefully will be helpful for anyone else asking the same question.

@thejohnfreeman thejohnfreeman added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Aug 1, 2024
@ximinez ximinez merged commit 7d27b11 into XRPLF:develop Aug 3, 2024
18 of 20 checks passed
@thejohnfreeman thejohnfreeman deleted the shards branch August 3, 2024 15:00
vlntb pushed a commit to vlntb/rippled that referenced this pull request Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. Remove unused code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants