Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

adding gzip compression support to the schema #4220

Closed
wants to merge 12 commits into from
Closed

adding gzip compression support to the schema #4220

wants to merge 12 commits into from

Conversation

mmigdiso
Copy link

@mmigdiso mmigdiso commented Sep 9, 2019

Goals (and why): Adding Gzip support to the schema. Currently atlasdb supports Lz4 algorithm and in some cases, gzip might be more useful as it is more than two times more space effective (specially in json streams)

Implementation Description (bullets): Lz4 and Gzip compressions now derives from the same abstract class.

Testing (What was existing testing like? What have you done to improve it?): The existing unit test class for lz4 has been generalized now to support Gzip compression

Concerns (what feedback would you like?):

Where should we start reviewing?:

Priority (whenever / two weeks / yesterday): two weeks

@palantirtech
Copy link
Member

Thanks for your interest in palantir/atlasdb, @mmigdiso! Before we can accept your pull request, you need to sign our contributor license agreement - just visit https://cla.palantir.com/ and follow the instructions. Once you sign, I'll automatically update this pull request.

@changelog-app
Copy link

changelog-app bot commented Sep 9, 2019

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

adding gzip compression support to the schema

Goals (and why): Adding Gzip support to the schema. Currently atlasdb supports Lz4 algorithm and in some cases, gzip might be more useful as it is more than two times more space effective (specially in json streams)

Implementation Description (bullets): Lz4 and Gzip compressions now derives from the same abstract class.

Testing (What was existing testing like? What have you done to improve it?): The existing unit test class for lz4 has been generalized now to support Gzip compression

Concerns (what feedback would you like?):

Where should we start reviewing?:

Priority (whenever / two weeks / yesterday): two weeks

Check the box to generate changelog(s)

  • Generate changelog entry

Murat Migdisoglu added 3 commits September 10, 2019 11:31
Gzip input may throw an exception if it can not find the gzip magic chars.
Gzip input may throw an exception if it can not find the gzip magic chars.
Copy link
Contributor

@j-baker j-baker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we talked about this, we agreed that for this to go in AtlasDB, the decompressor needs to be able to decide whether it's reading LZ4 or GZIP and choose between them respectively - the straightforward migration is where the value add is. This PR currently does not do this, which means that the complexity is probably considerably higher than we'd want.

Copy link
Contributor

@schlosna schlosna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For stream heavy workloads, we may want to consider pooling Deflater/Inflater instances as we’ve seen heavy GC Finalizer pressure. This is better in OpenJDK 11+ with https://bugs.openjdk.java.net/browse/JDK-8185582 but not all AtlasDB are running 11+ (though most that would use this feature likely are).

super(new GzipStreamEnumeration(in, bufferSize));
}

protected static class GzipStreamEnumeration implements Enumeration<InputStream> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing Enumeration for 3 elements is the Java 1.0 equivalent of implementing an Iterable for the same thing.

What you do here is create a list, turn it into an iterator, and then implement an enumeration around the iterator. Instead, just do Collections.enumeration(theListYouHad) in order to convert the list into an enumeration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of the Enum implementation lies in the overridden next() method. the trailer should be generated after the content stream(deflater) is exhausted to be able to properly calculate the CRC and other trailer fields

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still don't need to implement enumeration for that, though. That said, using suppliers makes sense now.

Copy link
Contributor

@j-baker j-baker Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So explicitly, I'm proposing you write something like:

List<Supplier<InputStream>> streams = ImmutableList.of(
    this::createHeaderStream,
    () -> countingStream,
    () -> createTrailerStream(crc, countingStream));
Enumeration<InputStream> inputStream = Collections.enumeration(Lists.transform(streams, Supplier::get)); 

Copy link
Author

@mmigdiso mmigdiso Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as the GzipCompressingInputStream is extending from the SequenceInputStream, the first call of the constructor needs to be a super(). So I cannot initialize the inputstream enum above in the GzipCompressingInputStream constructor. Hence is the enum implementation

byte[] decompressedData = new byte[17 * BLOCK_SIZE];
int bytesRead = ByteStreams.read(decompressingStream, decompressedData, 0, decompressedData.length);
assertEquals(uncompressedData.length, bytesRead);
assertArrayEquals(uncompressedData, Arrays.copyOf(decompressedData, bytesRead));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of arrayEquals, assertArrayEquals and the other JUnit asserts, please can we use AssertJ's assertThat methods? Leads to better perf. assertThat(decompressedData).startsWith(uncompressedData) might well be possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but I just moved the existing LZ4CompressionTests into this abstract class to generalize the test cases and to cover the gzip implementation with the same unit tests that were covering the lz4. Therefore I didn't touch the logic of the existing test cases, bc they were my checkpoint to assure the correctness of the new compression implementation. I'm a bit concerned of modifying the unit test during a refactor.

@mmigdiso mmigdiso requested a review from j-baker October 7, 2019 15:11
public enum ClientCompressor {
GZIP(GzipCompressingInputStream.class, GZIPInputStream.class, GzipCompressingInputStream.GZIP_HEADER),
LZ4(LZ4CompressingInputStream.class, LZ4BlockInputStream.class,
new byte[] {'L', 'Z', '4', 'B', 'l', 'o', 'c', 'k'}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"LZ4Block".getBytes(StandardCharsets.UTF_8)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks nicer but different than the original implementation. Unfortunately the access modifier does not allow to use it.
https://github.com/lz4/lz4-java/blob/d43546e24388533eebd40fccb4be5468f0411788/src/java/net/jpountz/lz4/LZ4BlockOutputStream.java#L37

Comparator.comparingInt(x -> x.magic.length)).get().magic.length;
buff.mark(maxLen);
byte[] headerBuffer = new byte[maxLen];
int len = buff.read(headerBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually doesn't do the right thing, because buff.read can return any number of bytes, it doesn't have to actually fill the header buffer

Copy link
Contributor

@j-baker j-baker Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to write a method like:

private static boolean startsWith(InputStream stream, byte[] maybePrefix) {
     stream.mark();
     try {
         for (int i = 0; i < maybePrefix.length; i++) {
             if (stream.read() != maybePrefix[i]) {
                 return false;
             }
         }
         return true;
     } catch (IOException e) {
         throw new RuntimeException(e);
     } finally {
          stream.reset();
     }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the code does not expect the headerBuffer to be filled. that is why we have the len parameter which is passed to matchMagic method later on.
on the other hand, as we are abstracting inputstream, read() and read(byte[]) has different performances when it comes to fileinputstream(at least in the sense of # of system calls even if we consider file system cache)

Copy link
Contributor

@j-baker j-baker Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess I'm saying you have a correctness bug right now. Also, you're only calling the single byte read method like 8 times, so the performance difference is ~zero.

Copy link
Author

@mmigdiso mmigdiso Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got what you mean. what do you think about using the ByteStreams.read(buff, headerBuffer, 0, maxLen); and also adding a filter for the magicchars larger than the bytes read from the stream compressors.stream().filter( t -> t.magic.length <= len

GZIP(GzipCompressingInputStream.class, GZIPInputStream.class, GzipCompressingInputStream.GZIP_HEADER),
LZ4(LZ4CompressingInputStream.class, LZ4BlockInputStream.class,
new byte[] {'L', 'Z', '4', 'B', 'l', 'o', 'c', 'k'}),
NONE(null, null, new byte[] {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is quite scary, because the prefix you provide actually matches all magic byte sequences

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things that avoids this situation. (if the comment above was for the NONE (byte[]{}) part.
1- there is magic.length > 0; in the return statement of the matchMagic method.
2- values() method returns the enum's in the order they are defined. And the NONE is defined last.

@mmigdiso mmigdiso requested a review from j-baker October 8, 2019 23:03
Comparator.comparingInt((ClientCompressor t) -> t.magic.length).reversed()
).collect(
Collectors.toList());
int maxLen = compressors.get(0).magic.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int maxLen = Arrays.stream(ClientCompressor.values()).mapToInt(c -> c.magic.length).max()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can avoid all of the sorting and reversing stuff :)

Copy link
Author

@mmigdiso mmigdiso Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @j-baker , but i need to start from the longest prefix, otherwise a shorter magic char which is a substring of another magic char would cause a bug. the purpose of this code was not only to find the maxlen.

@mmigdiso mmigdiso requested a review from j-baker October 10, 2019 09:05
@tpetracca
Copy link
Contributor

This was merged via #4311

@tpetracca tpetracca closed this Oct 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants