Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getPastEvents throws uncaught exception when processing invalid UTF-8 characters [1.0.0-beta.34] #1610

Closed
dskvr opened this issue May 4, 2018 · 12 comments
Assignees
Labels
1.x 1.0 related issues 2.x 2.0 related issues Needs Clarification Requires additional input Stale Has not received enough activity

Comments

@dskvr
Copy link

dskvr commented May 4, 2018

Problem

  • Solidity contract methods can accept invalid UTF-8 characters from user input (by design?)
  • Solidity contracts can log an invalid UTF-8 character
  • Web3js aggregates logs
  • Javascript is unicode, doesn't play nice with invalid UTF-8 characters.
  • If Web3js encounters a log with an invalid UTF-8 character, it has an irrecoverable error
  • utf-8 module's utf8.decode() utilized by Web3 throws an uncatchable error if there is an invalid utf-8 character.
  • The uncaught exception prevents the developer from proceeding

Impact

Contracts with lack of or improper user input validation on public methods.

What I expect to happen

Web3js catches the error and skips the record

What actually happens

Irrecoverable error

Possible Solution

Catch the error and skip the record

@dskvr dskvr changed the title getPastEvents throws uncaught exception when processing invalid UTF-8 characters getPastEvents throws uncaught exception when processing invalid UTF-8 characters [1.0.0-beta.34] May 4, 2018
@dskvr
Copy link
Author

dskvr commented May 4, 2018

utf8 throws an uncatchable exception when it cannot decode hex.

From UTF8:

Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)

For my implementation purposes I have patched line 218 of utils:
https://github.com/ethereum/web3.js/blob/1.0/packages/web3-utils/src/utils.js#L218

With

let returnValue;
try {
  returnValue = utf8.decode(str);
}
catch(e) {
  returnValue = "";
}
finally {
  return returnValue;
}

commit (don't mind the meta)

I'm not going to be submitting a PR because this is not a universally safe patch, but works for my implementation. Leaving open.

here's the hex that broke web3 for me:
0x0e00000000000f8c0000000000000005000f0000070000000909

Cheers.

@hexagon6
Copy link

Maybe https://github.com/mathiasbynens/wtf-8 might be useful for that case?

@dskvr
Copy link
Author

dskvr commented Jun 19, 2018

@hexagon6 Agreed, but this would be a bigger change than the above hack.

@nivida nivida self-assigned this Aug 9, 2018
@nivida nivida added the Bug Addressing a bug label Aug 9, 2018
@D-Nice
Copy link

D-Nice commented Oct 2, 2018

This would appear to be a serious bug that really needs to be fixed, as any web3 dependent service using this library, which takes any form of user input, could be taken down from this bug.

@nivida nivida added the In Progress Currently being worked on label May 2, 2019
@nivida nivida added 1.x 1.0 related issues 2.x 2.0 related issues labels Jun 20, 2019
@ricmoo
Copy link
Contributor

ricmoo commented Nov 20, 2019

I am considering adding an “unsafeString” type or something to ethers.js that would help web3 deal with this issue, which would allow for invalid surrogate pairs, overlong sequences and invalid ordinal bits set.

It is a hard problem to solve, because not sanitizing the output can lead to serious security implications for dapps; if you were to hash the string, it would not match, for example, which may be used in a commit-reveal. If I can trick you into committing to a wrong hash, you cannot reveal, which may mean you cannot properly challenge, for example.

I would also recommend that Web3.js also add UTF8 sanitization to all contract calls, so it fails before it hits the blockchain.

That said, there are still multiple ways to try interpreting invalid UTF8 sequences. The current UTF8 library allows a “allowInvalid” flag (which is never set to true, but it was easy to add and potentially useful) which uses the “ignore” strategy (which simply drops the naughty bits), but for this I think the replace strategy makes more sense, replacing the invalid sequence with the placeholder, And continues to the next octet with the ordinal but cleared... I believe (please correct me if I’m wrong), that’s what the native JS string does.

I think for now, you could also use the “bytes” type for decoding, which doesn’t care about data representation. But that needs to be further in, since the contract sighash needs the string “string” to match.

Still thinking about this though...

@nivida
Copy link
Contributor

nivida commented Nov 21, 2019

I would also recommend that Web3.js also add UTF8 sanitization to all contract calls, so it fails before it hits the blockchain.

Agreed 👌

but for this I think the replace strategy makes more sense, replacing the invalid sequence with the placeholder, And continues to the next octet with the ordinal but cleared... I believe (please correct me if I’m wrong), that’s what the native JS string does.

I totally agree with you on this!

I think for now, you could also use the “bytes” type for decoding, which doesn’t care about data representation

Yep, this should work.

I was checking if there are other libs who do have the replacement logic as for example python does have (@ricmoo showed me the functionalities python has). But sadly there is currently no lib which does handle this as I would expect.

Luckily is @ricmoo already creating such an improved UTF-8 lib based on already existing code 💪

I will wait until Richard has finished his lib and will use it for the web3.utils.toUtf8call and all other utility functions we have. This means this specific issue will be fixed as soon as @ricmoo has finished his lib in a stable manner or the new AbiCoder v5 which does include this fix got released as a stable version.

@cag
Copy link

cag commented Apr 16, 2020

Copying over my comments on how an API design change might look to work with this. I assume the use of the TextEncoder standard in the comments, but probably the ethers.js codec can be swapped in if that is the way forward here.


Referring to getPastEvents, I would propose something like the following:

Add an option, acknowledging that parsing event data might fail due to string processing, maybe called stringParsingErrorMode or something. A parsing error might occur for some non-empty subset of the events being gathered. We have replace which can just replace the bad sequence starts with �, skip which skips that subset and returns the events for which string parsing didn't fail, or error which rejects with the encoding error encountered. It's fine if the default behavior remains error, as long as these other modes are supported.

In particular, skip would be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.

@cgewecke
Copy link
Collaborator

Just a note: we tried to reproduce this bug in #3497 without success.

If anyone has a reproduction case, please advise.

@cgewecke cgewecke added Needs Clarification Requires additional input and removed Bug Addressing a bug In Progress Currently being worked on labels Jun 19, 2020
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions

@github-actions github-actions bot added the Stale Has not received enough activity label Jul 20, 2020
@dskvr
Copy link
Author

dskvr commented Jul 22, 2020

@cgewecke At one point I had the exact block number and event that caused utf8 to throw exception, but it has been more than 3 2 years since I reported this.

@github-actions github-actions bot removed the Stale Has not received enough activity label Jul 23, 2020
@dskvr
Copy link
Author

dskvr commented Aug 4, 2020

@cgewecke In the original thread I mentioned a hex that brought up the issue for me. You could include that hex string in a unit test against whatever module is presently implemented for utf-8

0x0e00000000000f8c0000000000000005000f0000070000000909

@github-actions
Copy link

github-actions bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. If you believe this was a mistake, please comment.

@github-actions github-actions bot added the Stale Has not received enough activity label Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x 1.0 related issues 2.x 2.0 related issues Needs Clarification Requires additional input Stale Has not received enough activity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants