Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resharding): introduce resharding actor #12217

Merged
merged 53 commits into from
Oct 16, 2024

Conversation

Trisfald
Copy link
Contributor

Main changes introduced by this PR:

  • Calling FlatStorageResharder resume and start in all code paths.
  • Added a very thin ReshardingActor and related Sender/Request items. Unfortunately, this meant propagating an extra argument through Client and Chain.
  • Moved FlatStorageResharder in ReshardingManager. Wrapped inside Option because it makes initialization of Chain easier, and it's not needed in many tests anyway.
  • Refactored FlatStorageResharder to work with a Sender. Functionality is the same as before.
  • A lot of uninteresting changes just to make old tests compile.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -361,7 +362,9 @@ impl Chain {
let resharding_manager = ReshardingManager::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need resharding manager in the view client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason. I'll make a note to ask @shreyan-gupta and then remove the manager from there

Copy link
Contributor

@shreyan-gupta shreyan-gupta Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be needed, but we would have to make it Option then?

@@ -52,13 +54,24 @@ use near_store::{ShardUId, StorageError};
pub struct FlatStorageResharder {
runtime: Arc<dyn RuntimeAdapter>,
resharding_event: Arc<Mutex<Option<FlatStorageReshardingEventStatus>>>,
scheduler: messaging::Sender<FlatStorageSplitShardRequest>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe import messaging? I don't think there is any ambiguity about it.

Comment on lines +65 to +67
/// * `runtime`: runtime adapter
/// * `scheduler`: component used to schedule the background tasks
/// * `controller`: manages the execution of the background tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty!

fn split_shard_task(resharder: FlatStorageResharder, controller: FlatStorageResharderController) {
let task_status = split_shard_task_impl(resharder.clone(), controller.clone());
pub fn split_shard_task(resharder: FlatStorageResharder) {
let task_status = split_shard_task_impl(resharder.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: What's up with the clones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it's not really needed anymore. Reference is enough.

@@ -501,7 +501,7 @@ pub enum FlatStorageReshardingTaskStatus {

/// Helps control the flat storage resharder operation. More specifically,
/// it has a way to know when the background task is done or to interrupt it.
#[derive(Clone)]
#[derive(Clone, Debug)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that sender and receiver implement Debug but alright.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught me by surprise as well

}

impl ReshardingManager {
pub fn new(
store: Store,
epoch_manager: Arc<dyn EpochManagerAdapter>,
runtime_adapter: Arc<dyn RuntimeAdapter>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my education, why do you need the runtime adapter in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that now FlatStorageResharder is owned and created by ReshardingManager. runtime is needed by FlatStorageResharder

ReshardingEventType::SplitShard(split_shard_event.clone()),
&next_shard_layout,
)?,
None => tracing::info!(target: "resharding", "flat storage resharder not initialized"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should that be tracing::error?

}

pub fn handle_flat_storage_split_shard_request(&mut self, msg: FlatStorageSplitShardRequest) {
split_shard_task(msg.resharder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it be any nicer as a method? msg.resharder.split_shard_task() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did the change and it looks nicer to me


#[derive(actix::Message, Clone, Debug)]
#[rtype(result = "()")]
pub struct FlatStorageSplitShardRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments to the pub structs?


#[derive(Clone, near_async::MultiSend, near_async::MultiSenderFrom)]
pub struct ReshardingSender {
pub flat_storage_split_shard_send: Sender<FlatStorageSplitShardRequest>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we'd want a sender for any type of resharding request and SplitShard would be one of the variants in the Request enum. No need to do it now though.

ReshardingEventType::SplitShard(split_shard_event.clone()),
&next_shard_layout,
)?,
None => tracing::error!(target: "resharding", "flat storage resharder not initialized"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not we panic if the resharder is not initialized? or it is based on some protocol versioning? if so, it should not error message right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting resharding sender to non-option should remove this case.

@@ -609,7 +605,10 @@ mod tests {
}

/// Generic test setup.
fn create_fs_resharder(shard_layout: ShardLayout) -> (Chain, FlatStorageResharder) {
fn create_fs_resharder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. what does fs stand for here? if flat_storage, can we expand the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it stands for flat storage yes

function name aged quite badly, I'll change it!

@@ -23,15 +27,27 @@ pub struct ReshardingManager {
/// A handle that allows the main process to interrupt resharding if needed.
/// This typically happens when the main process is interrupted.
pub resharding_handle: ReshardingHandle,
/// Takes care of performing resharding on the flat storage.
pub flat_storage_resharder: Option<FlatStorageResharder>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having this non-optional (maybe add a TODO as it is now convenient for this PR) in the future since as far as I know, the resharding will fail if this is not initialized yet? or can it still succeed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would too prefer to have this non-optional. I think we should address this as part of this PR itself. For the places where we don't have a sender or don't need to pass a sender, we can just pass in noop() sender.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to remove the Option and pass a noop()!

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 89.74359% with 24 lines in your changes missing coverage. Please review.

Project coverage is 71.72%. Comparing base (3cb74c2) to head (93b7556).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/flat_storage_resharder.rs 89.50% 12 Missing and 5 partials ⚠️
chain/chain/src/flat_storage_creator.rs 33.33% 2 Missing ⚠️
tools/database/src/resharding_v2.rs 0.00% 2 Missing ⚠️
chain/chain/src/resharding/manager.rs 92.30% 0 Missing and 1 partial ⚠️
chain/client/src/test_utils/test_env.rs 0.00% 1 Missing ⚠️
tools/speedy_sync/src/main.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #12217   +/-   ##
=======================================
  Coverage   71.72%   71.72%           
=======================================
  Files         833      835    +2     
  Lines      166700   166684   -16     
  Branches   166700   166684   -16     
=======================================
- Hits       119573   119562   -11     
+ Misses      41902    41901    -1     
+ Partials     5225     5221    -4     
Flag Coverage Δ
backward-compatibility 0.17% <0.00%> (-0.01%) ⬇️
db-migration 0.17% <0.00%> (-0.01%) ⬇️
genesis-check 1.25% <0.00%> (-0.01%) ⬇️
integration-tests 38.88% <58.54%> (+0.15%) ⬆️
linux 71.39% <83.33%> (-0.02%) ⬇️
linux-nightly 71.32% <89.74%> (+<0.01%) ⬆️
macos 53.82% <82.96%> (-0.50%) ⬇️
pytests 1.57% <0.00%> (-0.01%) ⬇️
sanity-checks 1.37% <0.00%> (-0.01%) ⬇️
unittests 65.52% <82.96%> (-0.03%) ⬇️
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@shreyan-gupta shreyan-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall but have a couple of comments

let resharder = self.clone();
let task = Box::new(move || split_shard_task(resharder, controller));
scheduler.schedule(task);
self.scheduler.send(FlatStorageSplitShardRequest { resharder });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right, why are we sending an instance of resharder along with request? Can the actor hold an instance of resharder as a member instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be easy to make the actor hold copy of resharder

split_shard_request: Arc<RwLock<VecDeque<FlatStorageSplitShardRequest>>>,
}

impl CanSend<FlatStorageSplitShardRequest> for MockReshardingSender {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the old framework for tests, before test loop. Let's try to keep it as lean as possible and not introduce any new functionality here? It would be really really really really really really great it we can just have a noop sender and kill all this code and just say we don't support resharding for old framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to kill this code and replace with the least amount of code just to make stuff compile

@@ -23,15 +27,27 @@ pub struct ReshardingManager {
/// A handle that allows the main process to interrupt resharding if needed.
/// This typically happens when the main process is interrupted.
pub resharding_handle: ReshardingHandle,
/// Takes care of performing resharding on the flat storage.
pub flat_storage_resharder: Option<FlatStorageResharder>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would too prefer to have this non-optional. I think we should address this as part of this PR itself. For the places where we don't have a sender or don't need to pass a sender, we can just pass in noop() sender.

ReshardingEventType::SplitShard(split_shard_event.clone()),
&next_shard_layout,
)?,
None => tracing::error!(target: "resharding", "flat storage resharder not initialized"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting resharding sender to non-option should remove this case.

}

pub fn handle_flat_storage_split_shard_request(&mut self, msg: FlatStorageSplitShardRequest) {
msg.resharder.split_shard_task();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not too happy that we are calling a function from resharder here.

For the purposes of reducing mental burden of identifying what happens where, ideally all the logic to deal with the heavy task of splitting should have been a part of the ReshardingActor class. This is too much of indirection to deal with.

Is it possible for us to push the logic of splitting the flat storage here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the general sentiment was to keep the actor as simple as possible, thus all the work and unit testing is done in the resharder. Any other thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love input from other folks, @wacban @Longarithm!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings either way. My only recommendation would be to keep the actor api part (typically the handle method) as slim as possible. This way we can properly unit test the actual implementation without needed to spin up the actor or worry about async. So something like this:

fn handle(request) {
  the_actual_implementation(request); 
} 

mod tests {
  // here the the_actual_implementation can be unit tested without needing to bootstrap an entire actor

@@ -401,6 +405,7 @@ pub fn start_with_config_and_synchronization(
partial_witness_actor.clone().with_auto_span_context().into_multi_sender(),
true,
None,
resharding_sender.into_multi_sender(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, we need a better way to handle adding new actors and new actix requests. Note to self; I'll see if I can come up with better ways of handling all actor senders in a single unified struct.

FlatStorageResharder::new(
runtime_adapter,
sender.into_sender(),
FlatStorageResharderController::from_resharding_handle(resharding_handle.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure how does the controller work and what do we do with the sender and receiver here? I'll have to go back and do a quick read up on how crossbeam_channels work but we should ideally not be introducing a new framework here given we have actors and message senders? Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planned to remove the channels in another PR because I thought they could come in handy but they did not. I can push the change here

@Trisfald
Copy link
Contributor Author

I've addressed most of the comments!

I left out two things:

  • Making the ReshardingActor hold a resharder instance: it's a bigger change than expected because it introduces a circular dependency between the two
  • Merging ReshardingActor with FlatStorageResharder: resharder is used both by the actor and by the logic which handles flat state status at startup, so it is convenient to have this separation at the moment

I plan to do further refactoring and improvements in separate PRs

@Trisfald Trisfald added this pull request to the merge queue Oct 16, 2024
Merged via the queue into near:master with commit aa48e52 Oct 16, 2024
24 of 25 checks passed
@Trisfald Trisfald deleted the flat-storage-resharding-actor branch October 16, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants