Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Light client sometimes deadlocks from incoming RPC requests #8918

Closed
tomaka opened this issue Jun 18, 2018 · 5 comments
Closed

Light client sometimes deadlocks from incoming RPC requests #8918

tomaka opened this issue Jun 18, 2018 · 5 comments
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P5-sometimesoon 🌲 Issue is worth doing soon.
Milestone

Comments

@tomaka
Copy link
Contributor

tomaka commented Jun 18, 2018

Sometimes, when making an RPC request, the light client doesn't answer and becomes unresponsive through the JSON-RPC interface.

The logs don't show anything out of the ordinary, except a lot of:

17:02:35 IO Worker #1 TRACE network Session read error: 3:Some(0x7275325d87bae765ff85719c63cf68687429d874386a1e4106a108d87f6c0abeadce7b29eb44372993d33ad78ead7ce12065ad2ae3fab207e3c8c07ca48fa245) (Ok(V4(13.231.166.14:30303))) Error(Disconnect(TooManyPeers), State { next_error: None, backtrace: None })

However this seems to happen even when the node was running correctly.

If I had an hypothesis, it would be that we disconnect from the peer we're making a request from. But I'm really not sure that this is the problem.

cc @Tbaut @amaurymartiny @rphmeier

@tomaka tomaka added F2-bug 🐞 The client fails to follow expected behavior. P5-sometimesoon 🌲 Issue is worth doing soon. M4-core ⛓ Core client code / Rust. labels Jun 18, 2018
@5chdn 5chdn added this to the 1.12 milestone Jun 24, 2018
@tomusdrw
Copy link
Collaborator

Could be fixed by #8977 ?

@niklasad1
Copy link
Collaborator

niklasad1 commented Jun 27, 2018

@tomusdrw
AFAIK the light client has its own implementation chain_info() (has limited trait) and is not using ethcore::blockchain::BlockChain so I guess they are not related!

@rphmeier Can you confirm or explain how wrong I am? :)

@Tbaut
Copy link
Contributor

Tbaut commented Jul 3, 2018

I could reproduce and trace. Any RPC such as getBalance don't get answered.
Here is a

  • gist of 1min of logs when it happens with -l pip,rpc,network
  • gist of the same 1min of logs when it happens with network filtered out for readability
  • gist of another time it happens with -l less,light_fetch,on_demand,pip_provider

Note that the last one, the problem seems to be the peer discovery.. oscillating between 1 and 0.

@rphmeier
Copy link
Contributor

rphmeier commented Jul 3, 2018

@niklasad1 this is correct

@5chdn 5chdn modified the milestones: 2.0, 2.1 Jul 17, 2018
@cheme cheme self-assigned this Aug 20, 2018
@niklasad1 niklasad1 removed their assignment Aug 20, 2018
andresilva pushed a commit that referenced this issue Sep 4, 2018
This PR is fixing deadlock for #8918 

It avoids some recursive calls on light_sync by making state check optional for Informant.

The current behavior is to display the information when informant checks if block is major version.
This change a bit the informant behavior, but not on most cases.

To remember where and how this kind of deadlock are likely to happen (not seen with Parkinglot deadlock detection because it uses std condvar), I am adding a description of the deadlock.
Also, for the reviewers there may be better solution than modifying the informant.

### Thread1 

- ethcore/sync/light_sync/mod.rs

A call to the light handler through any Io (having a loop of rpc query running on like client makes the dead lock way more likely).
At the end of those calls we systematically call `maintain_sync` method.

Here maintain_sync locks `state` (it is the deadlock cause), with a write purpose

`maintain_sync` -> `begin_search` with the state locked open

`begin_search` -> lightcliennt `flush_queue` method

- ethcore/light/src/client/mod.rs

`flush_queue` -> `flush` on queue (HeaderQueue aka VerificationQueue of headers)

- ethcore/src/verification/queue/mod.rs

Condition there is some unverified or verifying content

`flush` wait on a condvar until the queue is empty. The only way to unlock condvar is that worker is empty and unlock it (so thread 2 is Verification worker).

### Thread2

A verification worker at the end of a verify loop (new block).

- ethcore/src/verification/queue/mod.rs

thread loops on `verify` method.

End of loop condition is_ready -> Import the block immediately 

calls `set_sync` on QueueSignal which send a BlockVerified ClientIoMessage in inner channel (IoChannel of ClientIoMessage) using `send_sync`

- util/io/src/service_mio.rs

IoChannel `send_sync` method calls all handlers with `message` method; one of the handlers is ImportBlocks IoHandler (with a single inner Client service field)

- ethcore/light/src/client/service.rs

`message` trigger inner method `import_verified`

- core/light/src/client/mod.rs

`import_verified` at the very end notify the listeners of a new_headers, one of the listeners is Informant `listener` method

- parity/informant.rs

`newHeaders` run up to call to `is_major_importing` on its target (again clinet)

-  ethcore/sync/src/light_sync/mod.rs

Here `is_major_importing` tries to get state lock (read purpose only) but cannot because of previous state lock, thus deadlock
@niklasad1
Copy link
Collaborator

Closed by #9385

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior. M4-core ⛓ Core client code / Rust. P5-sometimesoon 🌲 Issue is worth doing soon.
Projects
None yet
Development

No branches or pull requests

7 participants