Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43086: Added handling for excess data in binascii.a2b_base64 #24402

Merged
merged 26 commits into from
Jul 19, 2021

Conversation

idan22moral
Copy link
Contributor

@idan22moral idan22moral commented Jan 31, 2021

Currently, when providing binascii.a2b_base64() base-64 input with excess data after the padding (=/==), the excess data is ignored.

Example:

import binascii
binascii.a2b_base64(b'aGVsbG8=')       # b'hello' (valid)
binascii.a2b_base64(b'aGVsbG8==')      # b'hello' (ignoring data)
binascii.a2b_base64(b'aGVsbG8=python') # b'hello' (ignoring data)

Note: MANY libraries (such as the all-time favorite base64) use this function as their decoder.

Why is it problematic:

  • User input can contain additional data after base64 data, which can lead to unintended behavior in products.
  • Well-crafted user input can be used to bypass conditions in code (example in the referenced tweet).
  • Can be used to target vulnerable libraries and bypass authentication mechanism such as JWT (potentially).

The logic behind my fix PR on GitHub:

  • Before deciding to finish the function (after knowing the fact that we passed the data padding),
    we should check if there's no more data after the padding.
  • If excess data exists, we should raise an error, free the allocated writer, and return null.
  • Else, everything's fine, and we can proceed to the function's end as previously.

Though not publicly disclosed, this behavior can lead to security issues in heavily-used projects.
Preventing this behavior sounds more beneficial than harmful, since there's no known good usage for this behavior.

From what I read, the python implementation in not so close (when speaking about this case of course) to the base64 RFC.
(link: https://tools.ietf.org/html/rfc4648#section-3.3)

Thanks to Ori Damari for bringing this behavior up,
and thanks to Ryan Mast, and many of the other great guys for discussing the problem in the comments.

Link to the tweet


Idan Moral
Twitter: https://twitter.com/idan_moral
GitHub: https://github.com/idan22moral

https://bugs.python.org/issue43086

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

Recognized GitHub username

We couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames:

@idan22moral

This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@github-actions
Copy link

github-actions bot commented Mar 5, 2021

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Mar 5, 2021
@idan22moral
Copy link
Contributor Author

Any feedback? 😅

@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Mar 7, 2021
@gpshead gpshead self-assigned this Mar 13, 2021
Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a regression test to Lib/test/test_binascii.py?

Modules/binascii.c Outdated Show resolved Hide resolved
@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@gpshead
Copy link
Member

gpshead commented Mar 13, 2021

In addition to the above, this PR as is causes a bunch of test suite failures that need to be addressed.

@idan22moral
Copy link
Contributor Author

@gpshead While writing the tests for my PR I noticed something:
binascii.b2a_base64 encodes a given payload and adds a newline at the end (no idea what's the reason behind it).
When trying to decode ab==\n, the Excess data after padding ... error raises, which makes the relation between b2a and a2b broken.

I thought about ignoring newlines (or any other character in table_a2b_base64 that's mapped to -1) after the padding,
but the same security problem raises, since it's still excess data.

So, I'm opening this to discussion - should the b2a method be re-implemented too?

@gpshead
Copy link
Member

gpshead commented Mar 13, 2021

b2a_base64 has always added a newline by default. (digging into why would require code history spelunking probably back to the 1990s; I could guess but it doesn't really matter). The ability to disable adding the newline was added in 3.6 via https://bugs.python.org/issue25357 as the base64 module does not want the newline.

Decoder wise it is a good idea for a2b_base64 to accept what b2a_base64 produces by default so I recommend we continue to accept a single trailing \n if present. We're still much better off from a security perspective being strict and only having a mere two possible variants of input ascii data that lead to the same bytes output instead of today's infinite possibilities by ignoring all trailing data after padding = signs.

I suggest adding an additional keyword only parameter as a flag when implementing this:

# Ignores a single newline at the end, regardless of `=`.
a2b_base64(ascii, ignore_trailing_newline=True)  # Make this the default.
# and
a2b_base64(ascii, ignore_trailing_newline=False)   # No extra data is allowed ever (raises an Error).

There will be a minority of applications out there that want the extra data ignoring behavior. This allows them to retain that without implementing their function wrapper to discard it first. If I'm understanding the current behavior correctly, people who want their code to have today's existing behavior of ignoring all trailing data can use and be compatible with the library before and after this fix:

binary = a2b_base64(ascii.split('=', 1)[0])

@gpshead
Copy link
Member

gpshead commented Mar 13, 2021

Adding such a ignore_trailing_newline=True parameter to allow disabling the newline behavior is probably best done in a followup PR for this issue as the new parameter can't be backported to stable releases.

The reason I recommend that default behavior at all is to maintain compatibility with the expectations of existing code including the round trip from b2a -> a2b working.

If we ever want to change the default newline behavior in the future, it should be done on both b2a and a2b with our usual "couple of release cycles" deprecation period. I'm not convinced the mere newline behavior default warrants changing.

@gpshead
Copy link
Member

gpshead commented Mar 13, 2021

It is also plausible that other users of a2b_base64 in the stdlib may want to ignore more padding and should be updated to use the split trick or similar. For example https://tools.ietf.org/html/rfc4648#section-3.3 suggests MIME and thus the email package is likely one that prefers leniency. Code inspection to find out would be required if no tests explicitly cover these situations.

@idan22moral
Copy link
Contributor Author

@gpshead Finally had the time to come back to this subject.
I thought about it for a few more days and here are some more insights:

In the current implementation, every character that isn't one of the following is ignored:

''.join([chr(_) if (table_a2b_base64[_] == -1) else '' for _ in range(128)])
# '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*,-.:;<>?@[\\]^_`{|}~\x7f

Which can lead to the same problem that we talked about earlier.
For example:

a2b_base64(b'ab==') # b'i'
a2b_base64(b'a:(){:|:&};:b==') # b'i'

So, if we combine your solution with this "bigger" problem, the solution is much simpler,
and it is to raise an exception when:

if (strict_mode && table_a2b_base64[ascii_data[i]] == -1) {
    binascii_state *state = PyModule_GetState(module);
    if (state) {
        PyErr_SetString(state->Error, "Only base64 data is allowed when using strict mode");
    }
    ...
}

So the parameter would be strict_mode, which by default will be False.
(Strict Mode - non-base64 data in not allowed)
What's your opinion on this matter?

@idan22moral
Copy link
Contributor Author

idan22moral commented Mar 29, 2021

Forgot to mention that the strict_mode flag controls the excess data behavior too,
so when it's set to True and there's excess data (even newlines) - an exception will raise.

The counterpart function (b2a) will have the same flag to respond to the new update, ofc.
Edit: b2a already has the newline flag, so strict_mode is not necessary there.

@idan22moral
Copy link
Contributor Author

Figured that implementing it will save some time later if the idea is approved, so I did it.

Some examples:

>>> from binascii import a2b_base64
>>> a2b_base64("Yb==\n")
b'a'
>>> a2b_base64("Yb==")
b'a'
>>> a2b_base64("Yb :(){:|:&};: ==")
b'a'
>>> a2b_base64("Yb :(){:|:&};: ==", strict_mode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Only base64 data is allowed when using strict mode
>>> a2b_base64("Y b==", strict_mode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Only base64 data is allowed when using strict mode
>>> a2b_base64("Yb==\n", strict_mode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Excess data after padding is not allowed when using strict mode
>>> a2b_base64("Y\x00b==", strict_mode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Only base64 data is allowed when using strict mode
>>> a2b_base64("Y\nb==", strict_mode=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Only base64 data is allowed when using strict mode

Documentation:

a2b_base64(data, /, *, strict_mode=False)
    Decode a line of base64 data.
    
    strict_mode
      When set to True, bytes that are not part of the base64 standard are not allowed.
      The same applies to excess data after padding (= / ==).

@gpshead
Copy link
Member

gpshead commented Mar 31, 2021

I haven't had a chance to look this over yet, but I like where you've headed with this and the addition of strict_mode.

@idan22moral
Copy link
Contributor Author

Awesome!
Let me know how we proceed with this issue when you have the time 🙂

@idan22moral idan22moral requested a review from gpshead July 10, 2021 19:09
@idan22moral
Copy link
Contributor Author

idan22moral commented Jul 14, 2021

@gpshead can you re-review my changes?

@gpshead
Copy link
Member

gpshead commented Jul 19, 2021

FWIW I believe my change of the error message from 'Malformed padding' to 'Leading padding' in this PR was half incorrect. I overlooked the label and goto that caused that message to be used in multiple circumstances. The later goto malformed_padding should probably just be turned into one raising its own unique error message.

not a huge deal, but a potentially misleading error message in the = in the middle of data circumstance.

shihai1991 added a commit to shihai1991/cpython that referenced this pull request Jul 20, 2021
* origin/main: (1146 commits)
  bpo-42064: Finalise establishing sqlite3 global state (pythonGH-27155)
  bpo-44678: Separate error message for discontinuous padding in binascii.a2b_base64 strict mode (pythonGH-27249)
  correct spelling (pythonGH-27076)
  bpo-44524: Add missed __name__ and __qualname__ to typing module objects (python#27237)
  bpo-27513: email.utils.getaddresses() now handles Header objects (python#13797)
  Clean up comma usage in Doc/library/functions.rst (python#27083)
  bpo-42238: Fix small rst issue in NEWS.d/. (python#27238)
  bpo-41972: Tweak fastsearch.h string search algorithms (pythonGH-27091)
  bpo-44340: Add support for building with clang full/thin lto (pythonGH-27231)
  bpo-44661: Update property_descr_set to use vectorcall if possible. (pythonGH-27206)
  bpo-44645: Check for interrupts on any potentially backwards edge (pythonGH-27216)
  bpo-41546: make pprint (like print) not write to stdout when it is None (pythonGH-26810)
  bpo-44554: refactor pdb targets (and internal tweaks) (pythonGH-26992)
  bpo-43086: Add handling for out-of-spec data in a2b_base64 (pythonGH-24402)
  bpo-44561: Update hyperlinks in Doc/distributing/index.rst (python#27032)
  bpo-42355: symtable.get_namespace() now checks whether there are multiple or any namespaces found (pythonGH-23278)
  bpo-44654: Do not export the union type related symbols (pythonGH-27223)
  bpo-44633: Fix parameter substitution of the union type with wrong types. (pythonGH-27218)
  bpo-44654: Refactor and clean up the union type implementation (pythonGH-27196)
  bpo-20291: Fix MSVC warnings in getargs.c (pythonGH-27211)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement type-security A security issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants