Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Base64Decoder to ignore space chars, add IsValid methods and tests #78951

Closed
wants to merge 1 commit into from

Conversation

heathbm
Copy link
Contributor

@heathbm heathbm commented Nov 29, 2022

Implementation of api-approved issue: #76020

Chars now ignored:
9: Line feed
10: Horizontal tab
13: Carriage return
32: Space
-- Vertical tab omitted
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 29, 2022
@ghost
Copy link

ghost commented Nov 29, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Implementation of api-approved issue: #76020

Author: heathbm
Assignees: -
Labels:

area-System.Memory, new-api-needs-documentation

Milestone: -

@@ -35,6 +35,215 @@ public static partial class Base64
/// or if the input is incomplete (i.e. not a multiple of 4) and <paramref name="isFinalBlock"/> is <see langword="true"/>.
/// </returns>
public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
// Validation must occur prior to decoding as the actual length will impact future calculations
bool containsIgnoredBytes = utf8.IndexOfAny(BytesToIgnore) != -1;
Copy link
Member

@stephentoub stephentoub Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this is going to regress performance, which we want to avoid. It should be possible to structure this such that we only pay additional cost in the existing implementation when it encounters an invalid character according to its current definition, which includes whitespace, at which point it can fall back to an implementation that allows for whitespace but is slower.

if (containsIgnoredBytes)
{
// Create a new span without bytes to be ignored
Span<byte> utf8WithIgnoredBytesRemoved = stackalloc byte[utf8.Length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe and could easily stack overflow.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your interest in working on this. I only took a cursory look, but I think the whole approach taken here needs to be revisited.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Nov 29, 2022
if (containsIgnoredBytes)
{
// Create a new span without bytes to be ignored
Span<byte> utf8WithIgnoredBytesRemoved = stackalloc byte[utf8.Length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if utf8.Length is huge? Stackoverflow. Stack allocation must be guarded by some reasonable threshold (e.g. 256 byte), otherwise fall back to renting an array from the ArrayPool or if considered as rare-path just allocate (ToArray).

@@ -35,6 +35,215 @@ public static partial class Base64
/// or if the input is incomplete (i.e. not a multiple of 4) and <paramref name="isFinalBlock"/> is <see langword="true"/>.
/// </returns>
public static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
{
// Validation must occur prior to decoding as the actual length will impact future calculations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will most likely cause a perf-regression, as decoding changed from $O(n)$ to be $O(2n)$ with this upfront check. If there's space encountered it's even $O(3n)$ due the "copy and remove space" loop.

Maybe it's better to just start decoding w/o any upfront check or copy, and if an invalid input is encountered fall back and decide what to do. If it's truly invalid then exit. If it's space, then ignore that position and continue. Thus the $O(n)$ remains.

To implement this you can split the decoding into two methods. One "driver" method, and one "worker" method. This is only relevant for the invalid-case, as then the driver can evaluate the state and call the worker again. Look at the buffer-chain examples / tests to see how someting like this can be done.

}

// Check for invalid chars
int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(ValidBase64CharsSortedAsc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use #68328 (comment)
(see linked PRs for examples).


// Check for invalid chars
int indexOfFirstNonBase64 = base64Text.IndexOfAnyExcept(ValidBase64CharsSortedAsc);
if (indexOfFirstNonBase64 > -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (indexOfFirstNonBase64 > -1)
if (indexOfFirstNonBase64 >= 0)

Is a tiny little bit better at CPU-level as comparisons with 0 followed by a jump (branch) can be optimized by CPUs.

int paddingCount = 0;

// Check if there are chars that need to be ignored while determining the length
if (base64Text.IndexOfAny(CharsToIgnore) > -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (base64Text.IndexOfAny(CharsToIgnore) > -1)
if (base64Text.IndexOfAny(CharsToIgnore) >= 0)

And use IndexOfAnyValues (see above).

/// <param name="base64TextUtf8">The input span which contains UTF-8 encoded text in base64 that needs to be validated.</param>
/// <param name="decodedLength">The maximum length (in bytes) if you were to decode the base 64 encoded text <paramref name="base64TextUtf8"/> within a byte span.</param>
/// <returns>true if <paramref name="base64TextUtf8"/> is decodable; otherwise, false.</returns>
public static unsafe bool IsValid(ReadOnlySpan<byte> base64TextUtf8, out int decodedLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above apply.

@gfoidl
Copy link
Member

gfoidl commented Nov 29, 2022

@stephentoub race condition -> you won 😄

@heathbm
Copy link
Contributor Author

heathbm commented Nov 29, 2022

Thanks for your interest in working on this. I only took a cursory look, but I think the whole approach taken here needs to be revisited.

Thanks for the feedback @stephentoub and @gfoidl , I will use it to improve to make better contributions in the future.

@stephentoub
Copy link
Member

Thanks, @heathbm. I do appreciate your efforts here. Are you planning to take another look at this one, or you're going to leave it for someone else and look at other things?

@heathbm
Copy link
Contributor Author

heathbm commented Nov 30, 2022

Thanks, @heathbm. I do appreciate your efforts here. Are you planning to take another look at this one, or you're going to leave it for someone else and look at other things?

@stephentoub I'm currently working on a new PR that does retain the O(1n) performance characteristic using the existing implementation as outlined in the feedback. Since I am not assigned to the issue, if anyone submits a complete PR before I do, I will gladly step aside however.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member needs-author-action An issue or pull request that requires more info or actions from the author. new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants