Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper handling for legacy parachain leases with gaps in coretime migration #426

Merged

Conversation

tdimitrov
Copy link
Contributor

@tdimitrov tdimitrov commented Aug 12, 2024

While testing #401 I noticed that there are two parachains with gaps in their leases which are not migrated to coretime correctly. These are para ids 3359 and 3388.

The problem is that the migration obtains the list of parachain ids which should be migrated from paras::Parachains (here) and there are no entries for para ids which are not active at the moment.

Paras 3359 and 3388 are not active in the current lease period (excerpt from slots::leases():

[
      3,359
    ]
    [
      null
      [
        14xQXJdUDC1pzyt8y3z27ANiUBgP7zTSaYutaLELJoyQrdLP
        2,010,000,000,000
      ]
      [
        14xQXJdUDC1pzyt8y3z27ANiUBgP7zTSaYutaLELJoyQrdLP
        2,010,000,000,000
      ]
      <skipped>

and:

[
      3,388
    ]
    [
      null
      [
        1TpMimWf8NvrusAQ36hVdgMpYhhtojg5Nw41CuVxG2zHPDg
        31,000,000,000,000
      ]
      [
        1TpMimWf8NvrusAQ36hVdgMpYhhtojg5Nw41CuVxG2zHPDg
        31,000,000,000,000
      ]
      <skipped>

And they have got no entries in paras::parachains():

[ 1,000 1,001 1,002 1,004 2,000 2,002 2,004 2,006 2,008 2,012 2,013 2,025 2,026 2,030 2,031 2,032 2,034 2,035 2,037 2,040 2,043 2,046 2,048 2,051 2,053 2,056 2,086 2,090 2,091 2,092 2,093 2,094 2,101 2,104 2,106 3,333 3,338 3,340 3,344 3,345 3,346 3,354 3,358 3,366 3,367 3,369 3,370 3,373 3,375 3,377 ]

As a result the migration skips them which is wrong.

A proper fix for this issue will require a new polkadot-sdk release and a bump in the runtimes repo which requires time and effort. To avoid this the PR moves the coretime migration from polkadot-sdk to the fellowship repo.

@tdimitrov
Copy link
Contributor Author

This is a draft because I am not happy with already_migrated function in the PR.

@tdimitrov
Copy link
Contributor Author

As discussed offline with @eskimor we'll move the whole coretime migration to the fellowship repo so that a proper fix can be made.
The PR is tested with this semi-automated test: https://github.com/tdimitrov/polkadot-fellows-runtimes/
I'll extend the repo with some more asserts but meanwhile the PR is ready for review.

@tdimitrov tdimitrov marked this pull request as ready for review August 14, 2024 07:22
});

let reservation_content = message_content.clone().chain(reservations).collect();
let pool_content = message_content.clone().chain(pool).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pool will be empty. Do we send an empty message here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I missed that. I guess we send an empty message. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Some(mk_coretime_call::<T>(CoretimeCalls::SetLease(p.into(), time_slice)))
});

let core_count: u16 = configuration::ActiveConfig::<T>::get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. on another call in a minute, but please check that this core handling is actually correct. This core_count value should read 0 on Polkadot at the time of the migration.

Copy link
Contributor Author

@tdimitrov tdimitrov Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 0. We update it on line 211-214 in the same file:
https://github.com/polkadot-fellows/runtimes/pull/426/files#diff-6a311f9b6c4cd2a09f397e5e0c90ea9193ba103bf277bb88de6bd223148af52eR211-R214

The actual value is 52:

coretime-migration  TRACE: Set core count to 52. legacy paras count is  52

The reserved cores are 0 though because legacy_paras_count == core_count. I have got an assert for this in the test:
https://github.com/tdimitrov/polkadot-coretime-launch-test/blob/a8e7aba7d6a741b10bd5330ae3119ef15a018711/coretime-migration-check/main.js#L158-L160

Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reserved cores are 0 though

What you mean by reserved cores? The system chains?

Generally your code is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the system chains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, wait. These are reserved for on demand, if I am not mistaken.
We reserve cores for system chains, then for lease cores and what's left is for on demand. Right @eskimor ?

Copy link
Contributor

@seadanda seadanda Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-migration the system chains just have leases as if they're regular parachains. They're included in the 52, except for the coretime chain since it isn't live yet. It will be registered by the time this migration runs, so it'll detect 53 leases (maybe more depending on how the auctions go).
reservations: Asset Hub, Collectives, Bridge Hub, People Chain and Coretime Chain for a total of 5 (you'll see 4 at the minute in your tests)
leases: 48 (plus any new ones from auctions, minus any that end between now and when it runs).

legacy_paras_count isn't the same as leases though, I think the assertion you're expecting is system_chains.len() + leases.len() == core_count, since core_count is equal to legacy_paras_count by construction (since num_cores is zero atm)

relay/polkadot/src/lib.rs Outdated Show resolved Hide resolved
relay/polkadot/Cargo.toml Outdated Show resolved Hide resolved
relay/polkadot/src/coretime_migration.rs Outdated Show resolved Hide resolved
relay/polkadot/src/coretime_migration.rs Outdated Show resolved Hide resolved
Comment on lines +251 to +257
let valid_until: u32 = match valid_until.try_into() {
Ok(val) => val,
Err(_) => {
log::error!("Converting block number to u32 failed!");
return None
},
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let valid_until: u32 = match valid_until.try_into() {
Ok(val) => val,
Err(_) => {
log::error!("Converting block number to u32 failed!");
return None
},
};

block number is u32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't compile due to expected u32 found associated type errors. Let's keep it as it is.

Comment on lines 260 to 261
let round_up = if valid_until % TIME_SLICE_PERIOD > 0 { 1 } else { 0 };
let time_slice = valid_until / TIME_SLICE_PERIOD + TIME_SLICE_PERIOD * round_up;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let round_up = if valid_until % TIME_SLICE_PERIOD > 0 { 1 } else { 0 };
let time_slice = valid_until / TIME_SLICE_PERIOD + TIME_SLICE_PERIOD * round_up;
let time_slice = (valid_until + TIME_SLICE_PERID - 1) / TIME_SLICE_PERIOD;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do a different kind of round up here. timeslice is not number of periods but a lifetime in blocks (divided by 80). Your formula works only for the 'number of periods' case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused... why we multiply the round up with TIME_SLICE_PERIOD?

Shouldn't it be let time_slice = valid_until / TIME_SLICE_PERIOD + round_up;) which is equivalent to your formula?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round_up is just 0 or 1. It leads to the same result as what I have written there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. Round up is multiplied by TIME_SLICE_PERIOD. Have a look: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a5ce70639f2ee8a169e5d618e707ed74

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basti's version is right. Integer division is just the floor, adding TIMESLICE_PERIOD rounds up, subtracting 1 makes sure it doesn't round up when v%T == 0.
The original code is adding 80 timeslices to the end of the lease, rather than one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

relay/polkadot/src/coretime_migration.rs Outdated Show resolved Hide resolved
relay/polkadot/src/coretime_migration.rs Outdated Show resolved Hide resolved
Comment on lines 260 to 261
let round_up = if valid_until % TIME_SLICE_PERIOD > 0 { 1 } else { 0 };
let time_slice = valid_until / TIME_SLICE_PERIOD + TIME_SLICE_PERIOD * round_up;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basti's version is right. Integer division is just the floor, adding TIMESLICE_PERIOD rounds up, subtracting 1 makes sure it doesn't round up when v%T == 0.
The original code is adding 80 timeslices to the end of the lease, rather than one.

Some(mk_coretime_call::<T>(CoretimeCalls::SetLease(p.into(), time_slice)))
});

let core_count: u16 = configuration::ActiveConfig::<T>::get()
Copy link
Contributor

@seadanda seadanda Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-migration the system chains just have leases as if they're regular parachains. They're included in the 52, except for the coretime chain since it isn't live yet. It will be registered by the time this migration runs, so it'll detect 53 leases (maybe more depending on how the auctions go).
reservations: Asset Hub, Collectives, Bridge Hub, People Chain and Coretime Chain for a total of 5 (you'll see 4 at the minute in your tests)
leases: 48 (plus any new ones from auctions, minus any that end between now and when it runs).

legacy_paras_count isn't the same as leases though, I think the assertion you're expecting is system_chains.len() + leases.len() == core_count, since core_count is equal to legacy_paras_count by construction (since num_cores is zero atm)

let pool_content = message_content.clone().chain(pool).collect();
let leases_content_1 = message_content
.clone()
.chain(leases.by_ref().take(legacy_paras_count / 2)) // split in two messages to avoid overweighted XCM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.chain(leases.by_ref().take(legacy_paras_count / 2)) // split in two messages to avoid overweighted XCM
.chain(leases.by_ref().take(leases.len() / 2)) // split in two messages to avoid overweighted XCM

legacy_paras_count includes system parachains
leases does not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but leases is an iterator and it's a bit involved to get its length. Let's keep it that way. It's not a big deal that the messages aren't equally distributed as long as they are sent in two passes (it's just one ParaId which causes the overweight).


let config = configuration::ActiveConfig::<T>::get();
// num_cores was on_demand_cores until now:
for on_demand in 0..config.scheduler_params.num_cores {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Polkadot this is not the path of this config entry, checking the chain state it's at config.coretime_cores. Is there a migration ahead of this one to change the configuration in storage?

Copy link
Contributor Author

@tdimitrov tdimitrov Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parachains_configuration::migration::v12::MigrateToV12<Runtime> does this change:
https://github.com/paritytech/polkadot-sdk/blob/ebf4f8d2d590f41817d5d38b2d9b5812a46f2342/polkadot/runtime/parachains/src/configuration/migration/v12.rs#L146-L158

It gets executed before the coretime migration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still the comment is actually not correct for Polkadot. It used to be coretime_cores not on_demand cores and there was no on-demand core before, thus nothing to migrate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This loop will do nothing, which is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the comment.

Copy link
Contributor

@seadanda seadanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

relay/polkadot/src/coretime_migration.rs Outdated Show resolved Hide resolved

let config = configuration::ActiveConfig::<T>::get();
// num_cores was on_demand_cores until now:
for on_demand in 0..config.scheduler_params.num_cores {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still the comment is actually not correct for Polkadot. It used to be coretime_cores not on_demand cores and there was no on-demand core before, thus nothing to migrate.


let config = configuration::ActiveConfig::<T>::get();
// num_cores was on_demand_cores until now:
for on_demand in 0..config.scheduler_params.num_cores {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This loop will do nothing, which is fine)

@seadanda seadanda mentioned this pull request Aug 15, 2024
9 tasks
Copy link
Contributor

@seadanda seadanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving to trigger a CI run with the broken runner disabled

@tdimitrov
Copy link
Contributor Author

There was a try-runtime check verifying if the relay chain is sending XMP messages to the broker chain. It was using dmq_contents which is not accessible from the runtime crate. Since I have tested the migration with chopsticks I think it's safe to drop it.

Please see this commit for details: 4675183

@seadanda
Copy link
Contributor

LGTM, zombienet is failing in CI but seems like it's unrelated

@seadanda
Copy link
Contributor

/merge

@fellowship-merge-bot
Copy link
Contributor

Enabled auto-merge in Pull Request

Available commands
  • /merge: Enables auto-merge for Pull Request
  • /merge cancel: Cancels auto-merge for Pull Request
  • /merge help: Shows this menu

For more information see the documentation

@fellowship-merge-bot fellowship-merge-bot bot enabled auto-merge (squash) August 15, 2024 21:04
@fellowship-merge-bot fellowship-merge-bot bot merged commit ffe6a36 into polkadot-fellows:main Aug 15, 2024
47 checks passed
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Aug 20, 2024
nazar-pc pushed a commit to nazar-pc/polkadot-sdk that referenced this pull request Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants