Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinaryReader.Read(byte[], int, int) is a leaky abstraction #68139

Open
masonwheeler opened this issue Apr 17, 2022 · 6 comments
Open

BinaryReader.Read(byte[], int, int) is a leaky abstraction #68139

masonwheeler opened this issue Apr 17, 2022 · 6 comments
Milestone

Comments

@masonwheeler
Copy link
Contributor

masonwheeler commented Apr 17, 2022

Description

System.IO.BinaryReader provides an abstraction layer over Stream that takes several common reading tasks and makes them Just Work without having to deal with all the fiddly details of working with the limited Stream API. Among these operations are two ways to read from the stream into an array: ReadBytes will read a specified number of bytes and allocate a new array, while the Read(byte[], int, int) overload will read a specified number of bytes into a pre-existing array.

Unfortunately, the two methods have wildly different semantics. While ReadBytes works exactly as expected, always returning the specified number of bytes, (unless the underlying stream reaches its end or some error occurs in the stream,) the Read overload is basically just a pass-through to the underlying stream's similarly-named method. This means that if the underlying stream's implementation is such that Read has a limit on the number of bytes it will process at one time, BinaryReader will shove this same limitation right in your face, rather than taking care of it for you under the hood which is the entire point of BinaryReader existing in the first place.

Reproduction Steps

using System.IO;
using System.IO.Compression;

class Repro
{
    private readonly byte[] _data;

    public Repro()
    {
        using var stream = new MemoryStream();
        using var compressor = new BrotliStream(stream, CompressionMode.Compress);
        using var writer = new StreamWriter(compressor);
        for (int i = 0; i < 250_000; i++)
        {
            writer.Write(i);
        }
        compressor.Flush();
        _data = stream.ToArray();
    }

    public void Test()
    {
        var bytes = T1();
        var buffer = new byte[bytes.Length];
        T2(buffer);
    }

    private byte[] T1()
    {
        using var ms = new MemoryStream(_data);
        using var decompressor = new BrotliStream(ms, CompressionMode.Decompress);
        using var reader = new BinaryReader(decompressor);
        return reader.ReadBytes(1_000_000);
    }

    private void T2(byte[] buffer)
    {
        using var ms = new MemoryStream(_data);
        using var decompressor = new BrotliStream(ms, CompressionMode.Decompress);
        using var reader = new BinaryReader(decompressor);
        var bytesRead = reader.Read(buffer, 0, buffer.Length);
        if (bytesRead != buffer.Length)
        {
            throw new Exception($"The abstraction is leaking!  Requested {buffer.Length} bytes but only got {bytesRead} of them.");
        }
    }
}

class Program
{
    static void Main()
    {
        new Repro().Test();
    }
}

Expected behavior

The test should run successfully.

Actual behavior

The test throws an exception, because BrotliStream.Read does not return the full requested number of bytes, and BinaryReader fails to compensate for this.

Regression?

No response

Known Workarounds

At first glance, it looks like this should be able to be worked around trivially with a while loop and a running total, but in my testing I'm finding cases where this does not produce correct results. It's difficult to say exactly why, as it works most of the time, and I'm dealing with large (1 MB+) blocks of binary data. All I know is, with the data I have, ReadBytes() always works correctly but generates massive amounts of garbage, and attempting to refactor it to improve memory performance by reusing the buffer occasionally corrupts data.

    void ReadNextBlock()
    {
            var size = _reader.Read7BitEncodedInt();
            var buffer = _memStream.GetBuffer();
            if (buffer == null || buffer.Length < size) { // this branch always works correctly
                buffer = _reader.ReadBytes(size);
                _memStream.SetLength(0);
                _memStream.Write(buffer, 0, size - sizeof(int));
            } else { // this branch *usually* works correctly but sometimes fails CRC validation
                int bytesRead = 0;
                while (bytesRead < size) {
                    bytesRead += _reader.Read(buffer, bytesRead, size - bytesRead);
                }
                _memStream.SetLength(size - sizeof(int));
            }
            ValidateBlockCrc(buffer, size); // CRC checksum validation to make sure the block is good
            _memStream.Position = 0;
    }

Configuration

Completely standard install of .NET 6, Visual Studio 2022, Windows 10.

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 17, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Apr 17, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

System.IO.BinaryReader provides an abstraction layer over Stream that takes several common reading tasks and makes them Just Work without having to deal with all the fiddly details of working with the limited Stream API. Among these operations are two ways to read from the stream into an array: ReadBytes will read a specified number of bytes and allocate a new array, while the Read(byte[], int, int) overload will read a specified number of bytes into a pre-existing array.

Unfortunately, the two methods have wildly different semantics. While ReadBytes works exactly as expected, always returning the specified number of bytes, (unless the underlying stream reaches its end or some error occurs in the stream,) the Read overload is basically just a pass-through to the underlying stream's similarly-named method. This means that if the underlying stream's implementation is such that Read has a limit on the number of bytes it will process at one time, BinaryReader will shove this same limitation right in your face, rather than taking care of it for you under the hood which is the entire point of BinaryReader existing in the first place.

Reproduction Steps

using System.IO;
using System.IO.Compression;

class Repro
{
    private readonly byte[] _data;

    public Repro()
    {
        using var stream = new MemoryStream();
        using var compressor = new BrotliStream(stream, CompressionMode.Compress);
        using var writer = new StreamWriter(compressor);
        for (int i = 0; i < 250_000; i++)
        {
            writer.Write(i);
        }
        compressor.Flush();
        _data = stream.ToArray();
    }

    public void Test()
    {
        var bytes = T1();
        var buffer = new byte[bytes.Length];
        T2(buffer);
    }

    private byte[] T1()
    {
        using var ms = new MemoryStream(_data);
        using var decompressor = new BrotliStream(ms, CompressionMode.Decompress);
        using var reader = new BinaryReader(decompressor);
        return reader.ReadBytes(1_000_000);
    }

    private void T2(byte[] buffer)
    {
        using var ms = new MemoryStream(_data);
        using var decompressor = new BrotliStream(ms, CompressionMode.Decompress);
        using var reader = new BinaryReader(decompressor);
        var bytesRead = reader.Read(buffer, 0, buffer.Length);
        if (bytesRead != buffer.Length)
        {
            throw new Exception($"The abstraction is leaking!  Requested {buffer.Length} bytes but only got {bytesRead} of them.");
        }
    }
}

class Program
{
    static void Main()
    {
        new Repro().Test();
    }
}

Expected behavior

The test should run successfully.

Actual behavior

The test throws an exception, because BrotliStream.Read does not return the full requested number of bytes, and BinaryReader fails to compensate for this.

Regression?

No response

Known Workarounds

At first glance, it looks like this should be able to be worked around trivially with a while loop and a running total, but in my testing I'm finding cases where this does not produce correct results. It's difficult to say exactly why, as it works most of the time, and I'm dealing with large (1 MB+) blocks of binary data. All I know is, with the data I have, ReadBytes() always works correctly but generates massive amounts of garbage, and attempting to refactor it to improve memory performance by reusing the buffer occasionally corrupts data.

    void ReadNextBlock()
    {
            var size = _reader.Read7BitEncodedInt();
            var buffer = _memStream.GetBuffer();
            if (buffer == null || buffer.Length < size) { // this branch always works correctly
                buffer = _reader.ReadBytes(size);
                _memStream.SetLength(0);
                _memStream.Write(buffer, 0, size - sizeof(int));
            } else { // this branch *usually* works correctly but sometimes fails CRC validation
                int bytesRead = 0;
                while (bytesRead < size) {
                    bytesRead += _reader.Read(buffer, bytesRead, size - bytesRead);
                }
                _memStream.SetLength(size - sizeof(int));
            }
            ValidateBlockCrc(buffer, size); // CRC checksum validation to make sure the block is good
            _memStream.Position = 0;
    }

Configuration

Completely standard install of .NET 6, Visual Studio 2022, Windows 10.

Other information

No response

Author: masonwheeler
Assignees: -
Labels:

area-System.IO, untriaged

Milestone: -

@huoyaoyuan
Copy link
Member

By documentation, the Read method Reads bytes from the underlying stream and advances the current position of the stream.. It's expected to just do the same thing of underlying stream. It's of course not useful, but behaves correctly.

Improvement of stream to fulfill the buffer is tracked in #58216.

@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Apr 20, 2022
@adamsitnik adamsitnik added this to the 7.0.0 milestone Apr 20, 2022
@adamsitnik
Copy link
Member

@eerhardt I've allowed myself to assign you as you are going to solve this issue by solving #16598

@eerhardt
Copy link
Member

eerhardt commented May 19, 2022

Unfortunately, #16598 isn't going to solve this issue.

From looking at BinaryReader, it is a bit confusing at which APIs treat count (or buffer.Length) parameters as the actual count to read, and which are treated as a "maximum count".

Actual Count APIs:

  1. public virtual byte[] ReadBytes(int count)
  2. public virtual int Read(char[] buffer, int index, int count)
  3. public virtual int Read(Span<char> buffer)
  4. public virtual char[] ReadChars(int count)

These are either char APIs, or in the case of byte[] ReadBytes, you are requesting exactly how many bytes you want read.

Count is a "maximum" APIs:

  1. public virtual int Read(byte[] buffer, int index, int count)
  2. public virtual int Read(Span<byte> buffer)

These are just thin wrappers over Stream methods with the same signatures.

@huoyaoyuan is correct above that the documentation doesn't say the bottom APIs need to read exactly count (or buffer.Length) bytes. And in the Returns section it calls out:

The number of bytes read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.

https://docs.microsoft.com/en-us/dotnet/api/system.io.binaryreader.read?view=net-6.0#system-io-binaryreader-read(system-byte()-system-int32-system-int32)
https://docs.microsoft.com/en-us/dotnet/api/system.io.binaryreader.read?view=net-6.0#system-io-binaryreader-read(system-span((system-byte)))

With #16598, you will be able to accomplish what you want above by saying reader.BaseStream.ReadExactly(buffer). And since all the above methods are virtual, you can override them in a derived BinaryReader class call BaseStream.ReadExactly yourself, so the code consuming BinaryReader doesn't need to change.

One potential here is to add ReadExactly and/or ReadAtLeast to BinaryReader as thin wrappers over BaseStream.ReadXXX, but I'm not sure it is worth it. We added the span-based Read APIs with #22851, but that was part of an overall "spanify all the System.* APIs".

@adamsitnik @jozkee @stephentoub - thoughts here? Do you think it would be worth it to add ReadExactly and/or ReadAtLeast to BinaryReader?

@adamsitnik
Copy link
Member

Do you think it would be worth it to add ReadExactly and/or ReadAtLeast to BinaryReader?

It's not a bad idea, but I would prefer to wait until the issue receives feedback from more customers.

@adamsitnik adamsitnik modified the milestones: 7.0.0, Future Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants