Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error in Raiden wizard testing #4451

Closed
pirapira opened this issue Jul 25, 2019 · 11 comments · Fixed by #4693
Closed

Fatal error in Raiden wizard testing #4451

pirapira opened this issue Jul 25, 2019 · 11 comments · Fixed by #4693
Assignees
Milestone

Comments

@pirapira
Copy link
Contributor

Problem Definition

I saw a raiden node crash for ValueError: Channel is not in an open state..

The console output is copied in https://gist.github.com/pirapira/a59f2cd6ff736aef4233f242db0cd366

System Description

Here add a detailed description of your system, e.g. output of the following script:

% uname -a                                                                                                                                                                                                       ~/Downloads/home/pirate/Documents/Raiden/Binaries/Wizard
command -v solc && solc --version
command -v geth && geth version
command -v parity && parity --version
command -v raiden && raiden version
[ -d .git ] && git rev-parse HEAD
Linux skk 5.0.0-21-generic #22-Ubuntu SMP Tue Jul 2 13:27:33 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
/home/yoichi/bin/solc
solc, the solidity compiler commandline interface
Version: 0.5.4+commit.9549d8ff.Linux.g++
/usr/bin/geth
INFO [07-25|12:08:25.880] Bumping default cache on mainnet         provided=1024 updated=4096
Geth
Version: 1.9.1-stable
Git Commit: b7b2f60f866d92e75db5bfe5be51b48eadeb3150
Architecture: amd64
Protocol Versions: [63]
Network Id: 1
Go Version: go1.11.5
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.11
/usr/bin/parity
Parity Ethereum
  version Parity-Ethereum/v2.4.3-beta-33860c3-20190322/x86_64-linux-gnu/rustc1.33.0
Copyright 2015-2019 Parity Technologies (UK) Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

By Wood/Paronyan/Kotewicz/Drwięga/Volf
   Habermeier/Czaban/Greeff/Gotchac/Redmann

@pirapira
Copy link
Contributor Author

  File "raiden/connection_manager.py", line 322, in _join_partner
  File "raiden/api/python.py", line 610, in set_total_channel_deposit
ValueError: Channel is not in an open state.

are the relevant lines.

@lullis
Copy link
Contributor

lullis commented Aug 23, 2019

This is almost impossible to debug without log files, no?

@pirapira
Copy link
Contributor Author

I don't know if I can reproduce this. So the first step is to reproduce the problem using the versions used when the fatal error was spotted.

@Dominik1999 Dominik1999 assigned lullis and unassigned rakanalh Aug 29, 2019
@rakanalh rakanalh assigned LefterisJP and unassigned lullis Aug 30, 2019
LefterisJP added a commit to LefterisJP/raiden that referenced this issue Aug 30, 2019
Maybe fix raiden-network#4451

If a channel is not open, or is in a closing state (still open but
waiting on the closing transaction result) the `set_total_deposit`
function was throwing a `ValueError`.

That is a valid and recoverable race condition so this commit
introduces an appropriate error and handles it in all places where it
can be thrown.
@pirapira
Copy link
Contributor Author

@LefterisJP
Copy link
Contributor

More info needed

All right so I had a look at the log files and I am confused.

This should not happen, not according to the code and the logs.

But I can't be sure about the code since I can't seem to find which release you were testing @pirapira.

I thought this is testing a particular release but no release contains the raise ValueErrror() statement for this problem at line 610 of raiden/api/python.py

2019-07-25 10:03:29.719679 [info     ] 127.0.0.1 - - [2019-07-25 12:03:29] "GET /api/v1/channels HTTP/1.1" 200 1707 0.003652 [raiden.api.rest.pywsgi] 
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "raiden/connection_manager.py", line 322, in _join_partner
  File "raiden/api/python.py", line 610, in set_total_channel_deposit
ValueError: Channel is not in an open state.

When I checked the logs I see the version shown as: 0.100.5.dev113+g2ba85cbee

Which means this is not a release. Plus I can't find the commit g2ba85cbee anywhere.

Can you help me and tell me the exact commit or release you were testing? Also why were you not testing a release? I thought that the raiden wizard downloads specific Raiden releases.

Current findings

Now ... onto why I can't see why this is possible at least according to the code in current develop and the logs.

I visualized the logs with our tool and the following parameters:

./tools/debugging/json-log-to-html.py -o yoichi.html ~/wizard_error.log -l filelock -l raiden.api.rest.pywsgi -l raiden.utils.filters --no-time-diff

Yoichi's node is 0x16Fa85F3a477B95c7180209B25ef0291ade95e3F
Partner is 0x5aC194D2E9089B88f17B6047f6c654EE6DE52871
Token network is: 0xf5D654f8Fc609117E9ce77b09798ee09E47F3b97

At line 11286 of the generated logs, just right before the crash that channel is shown to be open when the channel list is queried via the API.

And then at line 11296 the crash happens.

The crash happens due to the unhandled ValueError which is thrown if the channel is not in an open state. The local state is what queried, and not the blockchain.

This is how the query looks like:

def get_status(channel_state: NettingChannelState) -> ChannelState:
if channel_state.settle_transaction:
finished_successfully = (
channel_state.settle_transaction.result == TransactionExecutionStatus.SUCCESS
)
running = channel_state.settle_transaction.finished_block_number is None
if finished_successfully:
result = ChannelState.STATE_SETTLED
elif running:
result = ChannelState.STATE_SETTLING
else:
result = ChannelState.STATE_UNUSABLE
elif channel_state.close_transaction:
finished_successfully = (
channel_state.close_transaction.result == TransactionExecutionStatus.SUCCESS
)
running = channel_state.close_transaction.finished_block_number is None
if finished_successfully:
result = ChannelState.STATE_CLOSED
elif running:
result = ChannelState.STATE_CLOSING
else:
result = ChannelState.STATE_UNUSABLE
else:
result = ChannelState.STATE_OPENED
return result

The exception contains the channel state:

NettingChannelState(canonical_identifier=CanonicalIdentifier(chain_identifier=5, token_network_address=b'\xf5\xd6T\xf8\xfc`\x91\x17\xe9\xcew\xb0\x97\x98\xee\t\xe4\x7f;\x97', channel_identifier=10), open_transaction=TransactionExecutionStatus(started_block_number=None, finished_block_number=1001986, result='success'), close_transaction=None, settle_transaction=None, update_transaction=None)])>

and in there you can see that close_transaction and settle_transaction are None.

The above in combination with the mistaken source code lines lead me to believe the problem may lie in the code of the commit tested which is why I want to find out what commit that is.

@pirapira
Copy link
Contributor Author

pirapira commented Sep 2, 2019

Can you help me and tell me the exact commit or release you were testing? Also why were you not testing a release? I thought that the raiden wizard downloads specific Raiden releases.

@taleldayekh sent me an email on 24.7.2019 and it contained the following link:
https://send.firefox.com/download/0733fba87dc12c12/#2Gn8lAQ_bn_du8euJZyfkQ

@pirapira
Copy link
Contributor Author

pirapira commented Sep 2, 2019

but the link has expired. I'm sending @LefterisJP the executable I ran.

@LefterisJP
Copy link
Contributor

It seems that the version of Raiden wizard you sent me is installing the Raiden nightly from 24/07/2019.

This is the commit.

The line number matches there but the code is not that much different and the same question I posed above stands. I don't see how this can happen as both close_transaction and settle_transaction are None.

Perhaps the DB can contain more data? It should be under ~/.raiden

@pirapira
Copy link
Contributor Author

pirapira commented Sep 2, 2019

I sent the db files as well.

@pirapira
Copy link
Contributor Author

pirapira commented Sep 2, 2019

db.zip

@LefterisJP
Copy link
Contributor

Okay so I misunderstood the logs it seems.

The following:

NettingChannelState(canonical_identifier=CanonicalIdentifier(chain_identifier=5, token_network_address=b'\xf5\xd6T\xf8\xfc`\x91\x17\xe9\xcew\xb0\x97\x98\xee\t\xe4\x7f;\x97', channel_identifier=10), open_transaction=TransactionExecutionStatus(started_block_number=None, finished_block_number=1001986, result='success'), close_transaction=None, settle_transaction=None, update_transaction=None)])>

is not the channel state of the channel that lead to the crush! It's part of the repr of the connection manager which shows (I have no idea why...) the open channels. That log entry is so confusing and cost me a hell of a lot of time ...

def __repr__(self) -> str:
if self.raiden.wal is None:
return (
f"{self.__class__.__name__}(target={self.initial_channel_target} "
"WAL not initialized)"
)
open_channels = views.get_channelstate_open(
chain_state=views.state_from_raiden(self.raiden),
payment_network_address=self.registry_address,
token_address=self.token_address,
)
return (
f"{self.__class__.__name__}(target={self.initial_channel_target} "
+ f"channels={len(open_channels)}:{open_channels!r})"
)

So in conclusion ... another channel was closed/closing. I can't say which one since:

  1. The connection manager is not part of the state machine and can't see information about the attempted funding there
  2. There is no logging in the _join_partner() function and as such can't see which is the channel for which the deposit failed.

I have a good guess, since a little less than 3 mins before the crash Yoichi seems to have closed channel with identifier 11.

So with the lack of logs I can only assume that it was a timing problem. Before the closing was processed the connection manager entered here:

open_channels = views.get_channelstate_open(

and counted the channel in the open channels.

Later down the line a greenlet was spawned to further fund this channel here

And by the time we got to the total deposit for the channel the state had switched to closing.

So in any case the PR I made should handle this.

LefterisJP added a commit to LefterisJP/raiden that referenced this issue Sep 3, 2019
If I had these logs when debugging this:
raiden-network#4451 (comment)
things would have gone much faster.
LefterisJP added a commit that referenced this issue Sep 3, 2019
Maybe fix #4451

If a channel is not open, or is in a closing state (still open but
waiting on the closing transaction result) the `set_total_deposit`
function was throwing a `ValueError`.

That is a valid and recoverable race condition so this commit
introduces an appropriate error and handles it in all places where it
can be thrown.
hackaugusto pushed a commit that referenced this issue Sep 3, 2019
If I had these logs when debugging this:
#4451 (comment)
things would have gone much faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants