Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Add ZlibEncoder and ZlibDecoder to System.IO.Compression. #62113

Open
AraHaan opened this issue Nov 28, 2021 · 18 comments
Open
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Milestone

Comments

@AraHaan
Copy link
Member

AraHaan commented Nov 28, 2021

Background and motivation

Currently there is no non-stream based apis to zlib compression. As such I feel like an encoder / decoder implementation is needed similar to the Brotli implementation.

The brotli implementation also uses the encoder / decoder internally in the streams, which can help make the implementations of the zlib based streams (GZipStream, DeflateStream, and ZLibStream) better.

A single ZlibEncoder and ZlibDecoder that takes a class of values (ZlibOptions), where ZlibOptions also has subclasses named DeflateOptions, and GZipOptions where only the window bits are different.

This issue partially addresses:

Implementing this issue should also resolve this one as well:

I currently have a baseline implementation locally of this (except for the stream changes that would need to be done), and it should be ready by the time .NET 7 goes into an api freeze (until .NET 8's development starts).

API Proposal

namespace System.Buffers
{
    public enum OperationStatus
    {
        Error = 4,
    }
}
namespace System.IO.Compression
{
    public sealed class ZlibResult : System.IDisposable
    {
        public ZlibResult() { }
        public bool IsDisposed { get { throw null; } }
        public int LastBytesWritten { get { throw null; } }
        public int LastBytesRead { get { throw null; } }
        public int TotalBytesWritten { get { throw null; } }
        public int TotalBytesRead { get { throw null; } }
        public System.Buffers.OperationStatus Status { get { throw null; } }
        public void Dispose() { throw null; }
    }
    public class ZlibEncoder
    {
        public ZlibEncoder() { }
        public ZlibOptions? Options { get { throw null; } set { throw null; } }
        public bool TryCompress(ref ZlibResult zlibResult, ReadOnlySpan<byte> source, Span<byte> dest) { throw null; }
        public void Compress(ref ZlibResult zlibResult, ReadOnlySpan<byte> source, Span<byte> dest) { throw null; }
#pragma warning disable CS3002 // Return type is not CLS-compliant
        public uint CalculateChecksum(ReadOnlySpan<byte> source) { throw null; }
#pragma warning restore CS3002 // Return type is not CLS-compliant
    }
    public class ZlibDecoder
    {
        public ZlibDecoder() { }
        public ZlibOptions? Options { get { throw null; } set { throw null; } }
        public bool TryDecompress(ref ZlibResult zlibResult, ReadOnlySpan<byte> source, Span<byte> dest) { throw null; }
        public void Decompress(ref ZlibResult zlibResult, ReadOnlySpan<byte> source, Span<byte> dest) { throw null; }
#pragma warning disable CS3002 // Return type is not CLS-compliant
        public uint CalculateChecksum(ReadOnlySpan<byte> source) { throw null; }
#pragma warning restore CS3002 // Return type is not CLS-compliant
    }
}

API Usage

// allocate source and dest arrays.
var zlibEncoder = new ZlibEncoder()
{
    Options = new ZlibOptions(ZlibCompressionLevel.Level5),
};
uint adler = zlibEncoder.CalculateChecksum(source);
using var zlibResult = new ZlibResult();
bool result = zlibEncoder.TryCompress(ref zlibResult, source, dest);
if (!result)
{
    Console.WriteLine("Compression failed.");
}
else
{
    Console.WriteLine($"Adler-32: {adler}, Bytes Compressed: {zlibResult.LastBytesWritten}.");
}

// allocate new source array to compare against the original one.
var zlibDecoder = new ZlibDecoder()
{
    Options = new ZlibOptions(),
};
using var zlibResult2 = new ZlibResult();
result = zlibDecoder.TryDecompress(ref zlibResult2, dest, decSource);
uint adler2 = zlibDecoder.CalculateChecksum(decSource);
if (!result)
{
    Console.WriteLine("Decompression failed.");
}
else
{
    Console.WriteLine($"Adler-32: {adler2}, Bytes Decompressed: {zlibResult2.LastBytesWritten}.");
}

if (adler.Equals(adler2))
{
    Console.WriteLine("Decompressed contents verified to be correct.");
}

Alternative Designs

I wanted to use the System.Buffers.OperationStatus enum, but I felt it lacked enough members to denote the zlib specific status codes (where they would be used for zlib, deflate, and gzip compression / decompression). Likewise adler32's and the total_out member from zlib are represented as unsigned, I do not know if I should have the encoder / decoder output signed, or leave them unsigned and keep the CLS-Compliancy issues suppressed from it.

Risks

Minimal

Changelog

  • Replaced ZlibOperationStatus with System.Buffers.OperationStatus.
  • Removed adler32 unsigned output parameter from (Try)Compress and (Try)Decompress and moved it to a separate CalculateChecksum method.
  • Removed bytesWritten unsigned output parameter from (Try)Compress and (Try)Decompress and moved it to a property that stores the last amount of bytes written by (Try)Compress and (Try)Decompress.
  • added public Options property that can get and change the options entirely in the encoder/decoder instance.
  • changed from structs to classes due to an issue with compiling.
  • Change LastBytesWritten in Encoder and Decoder from uint to int.
  • Removed all properties in Encoder and Decoder except for the options property.
  • Added ZlibResult type that can be used with both Encoder and Decoder.
  • Remove Disposable interface on both Encoder and Decoder and make the ZlibResult type Disposable.
  • Modify TryDecompress/TryCompress to accept in an reference to the ZlibResult type that it will use for compressing or decompressing.
@AraHaan AraHaan added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 28, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Nov 28, 2021
@ghost
Copy link

ghost commented Nov 28, 2021

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Currently there is no non-stream based apis to zlib compression. As such I feel like an encoder / decoder implementation is needed similar to the Brotli implementation.

The brotli implementation also uses the encoder / decoder internally in the streams, which can help make the implementations of the zlib based streams (GZipStream, DeflateStream, and ZLibStream) better.

A single ZlibEncoder and ZlibDecoder that takes a class of values (ZlibOptions), where ZlibOptions also has subclasses named DeflateOptions, and GZipOptions where only the window bits are different.

Dependency issues that will need to be addressed before this:
#42820

Implementing this issue should also resolve this one as well:
#39327

I currently have a baseline implementation locally of this (except for the stream changes that would need to be done), and it should be ready by the time .NET 7 goes into an api freeze (until .NET 8's development starts).

API Proposal

    public enum ZlibOperationStatus
    {
        VersionError = -6,
        BufError,
        MemError,
        DataError,
        StreamError,
        ErrorNo,
        Ok,
        Done = Ok,
        StreamEnd,
        NeedDictionary,
        DestinationTooSmall,
        OperationNotCompression,
        OperationNotDecompression,
        Disposed,
    }
    public struct ZlibEncoder : System.IDisposable
    {
        public ZlibEncoder(ZlibOptions options) { }
        public bool IsDisposed { get { throw null; } }
        public void Dispose() { throw null; }
#pragma warning disable CS3001 // Argument type is not CLS-compliant
        public bool TryCompress(System.ReadOnlySpan<byte> source, System.Span<byte> dest, out uint adler32, out uint bytesWritten) { throw null; }
#pragma warning restore CS3001 // Argument type is not CLS-compliant
#pragma warning disable CS3001 // Argument type is not CLS-compliant
        public ZlibOperationStatus Compress(System.ReadOnlySpan<byte> source, System.Span<byte> dest, out uint adler32, out uint bytesWritten) { throw null; }
#pragma warning restore CS3001 // Argument type is not CLS-compliant
    }
    public struct ZlibDecoder : System.IDisposable
    {
        public ZlibDecoder(ZlibOptions options) { }
        public bool IsDisposed { get { throw null; } }
        public void Dispose() { throw null; }
#pragma warning disable CS3001 // Argument type is not CLS-compliant
        public bool TryDecompress(ReadOnlySpan<byte> source, Span<byte> dest, out uint adler32, out uint bytesWritten, out uint avail) { throw null; }
#pragma warning restore CS3001 // Argument type is not CLS-compliant
#pragma warning disable CS3001 // Argument type is not CLS-compliant
        public ZlibOperationStatus Decompress(ReadOnlySpan<byte> source, Span<byte> dest, out uint adler32, out uint bytesWritten, out uint avail) { throw null; }
#pragma warning restore CS3001 // Argument type is not CLS-compliant
    }

API Usage

// allocate source and dest arrays.
using var zlibEncoder = new ZlibEncoder(new ZlibOptions(ZlibCompressionLevel.Level5));
bool result = zlibEncoder.TryCompress(source, dest, out uint adler, out uint bytesWritten);
if (!result)
{
    Console.WriteLine("Compression failed.");
}
else
{
    Console.WriteLine($"Adler-32: {adler}, Bytes Compressed: {bytesWritten}.");
}

// allocate new source array to compare against the original one.
using var zlibDecoder = new ZlibDecoder(new ZlibOptions());
result = zlibDecoder.TryDecompress(dest, decSource, out adler, out bytesWritten, out uint avail);
if (!result)
{
    Console.WriteLine("Decompression failed.");
}
else
{
    Console.WriteLine($"Adler-32: {adler}, Bytes Decompressed: {bytesWritten}, Available: {avail}.");
}

Alternative Designs

I wanted to use the System.Buffers.OperationStatus enum, but I felt it lacked enough members to denote the zlib specific status codes (where they would be used for zlib, deflate, and gzip compression / decompression). Likewise adler32's and the total_out member from zlib are represented as unsigned, I do not know if I should have the encoder / decoder output signed, or leave them unsigned and keepthe CLS-Compliancy issues suppressed from it.

Risks

Minimal

Author: AraHaan
Assignees: -
Labels:

api-suggestion, area-System.IO.Compression, untriaged

Milestone: -

@AraHaan
Copy link
Member Author

AraHaan commented Nov 29, 2021

I have modified the proposal somewhat to add a few things I missed to ensure that TryDecompress will never throw or get exceptions that it would need to catch.

@carlossanlop
Copy link
Member

@AraHaan thank you so much for the detailed proposal.

This proposal would partially address both #42820 and #39327 . I don't think this PR depends on the first one, we can consider that one the "uber issue", because it was attempting to address the same you're addressing, but for all compression stream classes.

I see you added your own ZLib-specific operation status enums. Contrary to BrotliDecoder/BrotliEncoder, which reused System.Buffers.OperationStatus:

Some feedback on the new enum values:

  • Ok and Done seem redundant. We should simply keep Done, just like OperationStatus.Done.
  • DestinationTooSmall seems fine. It's aligned to OperationStatus.DestinationTooSmall already.
  • I'm not sure we need OperationNotCompression, OperationNotDecompression, Disposed, InitError, ErrorNo. Shouldn't those throw instead?
  • What does VersionError mean?
  • DataError, StreamEnd, StreamError, BufError, MemError, NeedDictionary, could be aligned to OperationStatus.InvalidData.
  • I think DataError could be OperationStatus.NeedMoreData.

What's preventing us from just reusing OperationStatus?

@AraHaan
Copy link
Member Author

AraHaan commented Nov 29, 2021

I'm not sure we need OperationNotCompression, OperationNotDecompression, Disposed, InitError, ErrorNo. Shouldn't those throw instead?

We would need these for TryDecompress / TryCompress where it would be expansive to throw and then "catch" the exception within the Try version.
VersionError is an zlib error code for when the expected zlib version and the runtime zlib version does not match (on the major version I think).

I guess OperationStatus could add an generic Error onto it at the end of it for these above (see edited proposal above).

According to the Manual DataError is for when "inflate detects an error in the zlib compressed data format, which means that either the data is not a zlib stream to begin with, or that the data was corrupted somewhere along the way since it was compressed."

Also in that manual StreamEnd is returned when inflate/deflate is at the end of the zlib/deflate/gzip stream.

https://zlib.net/zlib_how.html is where I get some of the details as well too.

As for BufError, that might fall under NeedMoreData.

Edit: I have also removed ZlibOperationStatus with a minor change to System.Buffers.OperationStatus.

cc: @carlossanlop I think this might just now be ready for review (hopefully).

@AraHaan
Copy link
Member Author

AraHaan commented Feb 8, 2022

cc: @dotnet/area-system-io-compression I think this is ready for review.

@AraHaan
Copy link
Member Author

AraHaan commented Feb 17, 2022

Anything else stalling this?

@rhuijben
Copy link

I would like to see some way how many bytes of the original source were read to generate the output, to allow the next decompress to continue right there. Currently the implementation only shows the output buffer results, while the input buffer is just as useful.

This is especially so in case some other information (e.g. not compressed) is written right after the zlib data, such as within Git package files.
The current implementation is nice if you have some outer frame handling that handles/hides overreading input for you, but the zlib streaming was designed to do the framing for you.

@AraHaan
Copy link
Member Author

AraHaan commented Apr 11, 2022

I think that can be accessed with the LastBytesWritten property (I think it gets changed on every pass).

However I do agree with another thing as well, perhaps there should be an TotalBytesWritten as well for the total amount written in the instance of the Encoder/Decoder.

And possibly an Reset function that can act as an way to reset those 2 get only properties for those who want to be able to reuse it to compress/decompress multiple different sets of input data.

@rhuijben
Copy link

rhuijben commented May 9, 2022

@AraHaan I also need the LastBytesRead for this case. (The NextInIndex/ next_in_index as it is called in most zlib wrappers), to allow continuing right after the input that was processed. Not just NextOut*.

@wegylexy
Copy link
Contributor

Using streams, we need to scavenge the input buffer for raw processing as GIT packs delimit compressed data with uncompressed headers without specifying the compressed size. See #61405 for a working hack.
Decompress() in the proposed API would need to return the number of bytes read from the source and that written to the destination, along with the operation status.

public sealed struct OperationResult
{
    public readonly OperationStatus Status;
    public readonly int Read, Written;
}

@AraHaan
Copy link
Member Author

AraHaan commented May 23, 2022

Also we have to consider if Read/Written should be int, uint, long or, ulong for large data sets (since large files can also be compressed that would normally overflow an Int32 (some I think can even overflow an int64).

However the native implementation I believe uses uint's for the sizes (size_t or something along the lines of that).

So it is something to consider to support files over 2 gigabytes in length as well, however such support is tricky to get correct.

Also the implementation will also require that adler32 be exposed (currently only crc32 is exposed) in System.IO.Compression.Native to work (see AraHaan@bebf43f for all the code changes that would need to be made in System.IO.Compression.Native and System.IO.Compression).

I would love to move ZlibEncoder and ZlibDecoder to System.IO.Compression.ZlibEncoder but it would need InternalsVisibleTo on it to see types in System.IO.Compression. A con to this would be that it would be another library to maintain in dotnet/runtime.

@rhuijben
Copy link

Both Stream and Span use Int as transfer format. ZLib uses an 32 bit value, which overflows in some well defined cases where third party code depends on.

Crc32 is now also exposed via System.IO.Hashing, so maybe Adler32 belongs there too?
(I see some minor sorting issues in the .def files in your current development version. Not sure if sorting adler before crc would be necessary, but it breaks the alphabetic ordering in that region)

@AraHaan
Copy link
Member Author

AraHaan commented May 24, 2022

crc32 is exported in System.IO.Compression.Native as well, but with my changes I also added adler32 which is also a part of the native zlib impl as well.

@AraHaan
Copy link
Member Author

AraHaan commented May 31, 2022

I guess an option could be to replace the proposed LastBytesRead/LastBytesWritten properties on the encoder / decoder with an ZlibResult struct that is as follows:

public sealed struct ZlibResult
{
    public readonly OperationStatus Status { get; } // internal setter.
    public readonly int Read { get; } // internal setter.
    public readonly int Written { get; } // internal setter.
}

And then the operation function could return an instance of the struct that contains the information they need.

@wegylexy
Copy link
Contributor

That's what I said 8 days ago.

@jozkee jozkee added this to the 8.0.0 milestone Jun 22, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jun 22, 2022
@AraHaan
Copy link
Member Author

AraHaan commented Jul 21, 2022

I am going to work on this a bit more and will consider moving the properties to an ZlibResult type that is Disposable instead so that way the Encoder and decoder types could be non-disposables..

@AraHaan
Copy link
Member Author

AraHaan commented Jul 22, 2022

Alright just got done with modifying the API locally.

AraHaan added a commit to AraHaan/runtime that referenced this issue Aug 29, 2022
New public types:
- System.IO.Compression.ZlibResult
- System.IO.Compression.ZlibEncoder
- System.IO.Compression.ZlibDecoder

Implements: dotnet#62113.
@AraHaan
Copy link
Member Author

AraHaan commented Jul 15, 2024

@carlossanlop

* What does `VersionError` mean?

if the zlib library version is incompatible with the version assumed by the caller can also happen when This can also occur if the size of the z_stream structure made in the C program using the description in zlib.h does not match the size of the compiled z_stream structure libz.so.1. from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Projects
None yet
Development

No branches or pull requests

5 participants