Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epoch stale after mainchain rpc massive restart #2559

Closed
532910 opened this issue Sep 8, 2023 · 10 comments
Closed

epoch stale after mainchain rpc massive restart #2559

532910 opened this issue Sep 8, 2023 · 10 comments
Labels
blocked Can't be done because of something bug Something isn't working I4 No visible changes neofs-ir Inner Ring node application issues S4 Routine U2 Seriously planned

Comments

@532910
Copy link

532910 commented Sep 8, 2023

to reproduce: stop all mainchain rpc (for upgrade for exmaple), then start

workaround: restart all ir nodes

@532910 532910 added the triage label Sep 8, 2023
@roman-khimov roman-khimov added bug Something isn't working neofs-ir Inner Ring node application issues and removed triage labels Dec 7, 2023
@roman-khimov roman-khimov added this to the v0.40.0 milestone Dec 7, 2023
@roman-khimov
Copy link
Member

Container can't be created as well in this case until all nodes are restarted.

@carpawell
Copy link
Member

stop all mainchain rpc

@532910, how long RPCs were down? Seconds? Minutes?

until all nodes are restarted

@roman-khimov, what exact nodes? IR? SN?

@532910
Copy link
Author

532910 commented Dec 18, 2023

how long

let start with minutes

what exact nodes? IR? SN?

IR, I believe

@roman-khimov roman-khimov added U2 Seriously planned S4 Routine I4 No visible changes labels Dec 21, 2023
@carpawell
Copy link
Member

No progress for now: one node local consensus does not allow reproducing. The best I can do here is wait until the next update and see the logs/profiles. If that is unacceptable I may try the local 4/7 nodes consensus.

carpawell added a commit that referenced this issue Jan 17, 2024
Scenario:
0. at least one subscription has been performed
1. another subscription is being done
2. a notification from one of the `0.` point's subs is received

If `2.` happens b/w `0.` and `1.` a deadlock appears since the notification
routing process is locked on the subscription lock while the subscription lock
cannot be unlocked since the subscription RPC cannot be done before the
just-arrived notification is handled (read from the neo-go subscription
channel).
Relates #2559.

Signed-off-by: Pavel Karpy <[email protected]>
carpawell added a commit that referenced this issue Jan 23, 2024
Scenario:
0. at least one subscription has been performed
1. another subscription is being done
2. a notification from one of the `0.` point's subs is received

If `2.` happens b/w `0.` and `1.` a deadlock appears since the notification
routing process is locked on the subscription lock while the subscription lock
cannot be unlocked since the subscription RPC cannot be done before the
just-arrived notification is handled (read from the neo-go subscription
channel).

`switchLock` does the same thing to the `routeNotifications`: it ensures that no
routine is doing/will be doing changes with the subscription channels, even
though `subs`'s lock was created for this purpose initially.

Relates #2559.

Signed-off-by: Pavel Karpy <[email protected]>
carpawell added a commit that referenced this issue Jan 24, 2024
Scenario:
0. at least one subscription has been performed
1. another subscription is being done
2. a notification from one of the `0.` point's subs is received

If `2.` happens b/w `0.` and `1.` a deadlock appears since the notification
routing process is locked on the subscription lock while the subscription lock
cannot be unlocked since the subscription RPC cannot be done before the
just-arrived notification is handled (read from the neo-go subscription
channel).

`switchLock` does the same thing to the `routeNotifications`: it ensures that no
routine is doing/will be doing changes with the subscription channels, even
though `subs`'s lock was created for this purpose initially.

Relates #2559.

Signed-off-by: Pavel Karpy <[email protected]>
@roman-khimov roman-khimov added the blocked Can't be done because of something label Jan 29, 2024
@roman-khimov
Copy link
Member

Can't be reproduced at this stage, waiting for another case in some network.

@roman-khimov roman-khimov modified the milestones: v0.40.0, v0.41.0 Jan 29, 2024
@roman-khimov roman-khimov modified the milestones: v0.41.0, v0.42.0 Mar 22, 2024
@roman-khimov
Copy link
Member

Seems to be reproducible if 2/3 RPC nodes to go offline for some time.

@roman-khimov roman-khimov modified the milestones: v0.42.0, v0.43.0 May 22, 2024
@evgeniiz321
Copy link

Got a stable reproduction during new payment tests tests.payment.test_container_payments.TestContainerPayments#test_container_payments: https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/477-1722817756/index.html#suites/44e11ced39071e8d0cfc11c5b94622ba/d3f9a85ef9ecb46a/.

@roman-khimov roman-khimov modified the milestones: v0.43.0, v0.44.0 Aug 20, 2024
@carpawell
Copy link
Member

carpawell commented Sep 27, 2024

@evgeniiz321, as I understand, in your case you want to make the test be faster so you change epoch duration to 20 blocks, right? I do not see any epoch ticks after changing so it is impossible to change epoch duration immediately, we do not have notifications about config changes and cannot recalculate the next block for epoch handling. As I understand you wait for a new epoch no longer than 60 seconds, probably, a new epoch will not happen if you used 240-second epoch before (1-second block for the default 240 epoch duration). Can you either try to tick epoch after network setting tuning or increase max awaiting to 240?

And yes, it does not relate the original flapping problem.

@roman-khimov roman-khimov modified the milestones: v0.44.0, v0.45.0 Nov 8, 2024
@carpawell
Copy link
Member

carpawell commented Nov 11, 2024

We have not seen this exact issue for so long so I close it, it can be reopened once it happens.

@carpawell carpawell closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2024
@carpawell carpawell removed this from the v0.45.0 milestone Nov 11, 2024
@532910 532910 reopened this Dec 4, 2024
@carpawell
Copy link
Member

Still was not the case. See #3007, it is more related to the situation.

@carpawell carpawell closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Can't be done because of something bug Something isn't working I4 No visible changes neofs-ir Inner Ring node application issues S4 Routine U2 Seriously planned
Projects
None yet
Development

No branches or pull requests

4 participants