adding gzip compression support to the schema #4220

mmigdiso · 2019-09-09T13:28:22Z

Goals (and why): Adding Gzip support to the schema. Currently atlasdb supports Lz4 algorithm and in some cases, gzip might be more useful as it is more than two times more space effective (specially in json streams)

Implementation Description (bullets): Lz4 and Gzip compressions now derives from the same abstract class.

Testing (What was existing testing like? What have you done to improve it?): The existing unit test class for lz4 has been generalized now to support Gzip compression

Concerns (what feedback would you like?):

Where should we start reviewing?:

Priority (whenever / two weeks / yesterday): two weeks

palantirtech · 2019-09-09T13:28:24Z

Thanks for your interest in palantir/atlasdb, @mmigdiso! Before we can accept your pull request, you need to sign our contributor license agreement - just visit https://cla.palantir.com/ and follow the instructions. Once you sign, I'll automatically update this pull request.

changelog-app · 2019-09-09T13:28:29Z

Generate changelog in `changelog/@unreleased`

Type

Description

adding gzip compression support to the schema

Goals (and why): Adding Gzip support to the schema. Currently atlasdb supports Lz4 algorithm and in some cases, gzip might be more useful as it is more than two times more space effective (specially in json streams)

Implementation Description (bullets): Lz4 and Gzip compressions now derives from the same abstract class.

Testing (What was existing testing like? What have you done to improve it?): The existing unit test class for lz4 has been generalized now to support Gzip compression

Concerns (what feedback would you like?):

Where should we start reviewing?:

Priority (whenever / two weeks / yesterday): two weeks

Check the box to generate changelog(s)

Generate changelog entry

Gzip input may throw an exception if it can not find the gzip magic chars.

…asdb into gzip_stream_support

j-baker

When we talked about this, we agreed that for this to go in AtlasDB, the decompressor needs to be able to decide whether it's reading LZ4 or GZIP and choose between them respectively - the straightforward migration is where the value add is. This PR currently does not do this, which means that the complexity is probably considerably higher than we'd want.

...db-commons/src/main/java/com/palantir/common/compression/AbstractCompressingInputStream.java

…asdb into gzip_stream_support

schlosna

For stream heavy workloads, we may want to consider pooling Deflater/Inflater instances as we’ve seen heavy GC Finalizer pressure. This is better in OpenJDK 11+ with https://bugs.openjdk.java.net/browse/JDK-8185582 but not all AtlasDB are running 11+ (though most that would use this feature likely are).

...-client/src/main/java/com/palantir/atlasdb/table/description/render/StreamStoreRenderer.java

…asdb into gzip_stream_support

j-baker · 2019-10-02T08:40:41Z

atlasdb-commons/src/main/java/com/palantir/common/compression/GzipCompressingInputStream.java

+        super(new GzipStreamEnumeration(in, bufferSize));
+    }
+
+    protected static class GzipStreamEnumeration implements Enumeration<InputStream> {


Implementing Enumeration for 3 elements is the Java 1.0 equivalent of implementing an Iterable for the same thing.

What you do here is create a list, turn it into an iterator, and then implement an enumeration around the iterator. Instead, just do Collections.enumeration(theListYouHad) in order to convert the list into an enumeration.

The reason of the Enum implementation lies in the overridden next() method. the trailer should be generated after the content stream(deflater) is exhausted to be able to properly calculate the CRC and other trailer fields

You still don't need to implement enumeration for that, though. That said, using suppliers makes sense now.

So explicitly, I'm proposing you write something like:

List<Supplier<InputStream>> streams = ImmutableList.of( this::createHeaderStream, () -> countingStream, () -> createTrailerStream(crc, countingStream)); Enumeration<InputStream> inputStream = Collections.enumeration(Lists.transform(streams, Supplier::get));

But as the GzipCompressingInputStream is extending from the SequenceInputStream, the first call of the constructor needs to be a super(). So I cannot initialize the inputstream enum above in the GzipCompressingInputStream constructor. Hence is the enum implementation

atlasdb-commons/src/main/java/com/palantir/common/compression/GzipCompressingInputStream.java

j-baker · 2019-10-02T08:44:37Z

atlasdb-commons/src/test/java/com/palantir/common/compression/AbstractCompressionTests.java

+        byte[] decompressedData = new byte[17 * BLOCK_SIZE];
+        int bytesRead = ByteStreams.read(decompressingStream, decompressedData, 0, decompressedData.length);
+        assertEquals(uncompressedData.length, bytesRead);
+        assertArrayEquals(uncompressedData, Arrays.copyOf(decompressedData, bytesRead));


instead of arrayEquals, assertArrayEquals and the other JUnit asserts, please can we use AssertJ's assertThat methods? Leads to better perf. assertThat(decompressedData).startsWith(uncompressedData) might well be possible.

Makes sense, but I just moved the existing LZ4CompressionTests into this abstract class to generalize the test cases and to cover the gzip implementation with the same unit tests that were covering the lz4. Therefore I didn't touch the logic of the existing test cases, bc they were my checkpoint to assure the correctness of the new compression implementation. I'm a bit concerned of modifying the unit test during a refactor.

...-client/src/main/java/com/palantir/atlasdb/table/description/render/StreamStoreRenderer.java

atlasdb-commons/src/main/java/com/palantir/common/compression/EnumClientCompressor.java

j-baker · 2019-10-08T09:31:51Z

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

+public enum ClientCompressor {
+    GZIP(GzipCompressingInputStream.class, GZIPInputStream.class, GzipCompressingInputStream.GZIP_HEADER),
+    LZ4(LZ4CompressingInputStream.class, LZ4BlockInputStream.class,
+            new byte[] {'L', 'Z', '4', 'B', 'l', 'o', 'c', 'k'}),


"LZ4Block".getBytes(StandardCharsets.UTF_8)

this looks nicer but different than the original implementation. Unfortunately the access modifier does not allow to use it.
https://github.com/lz4/lz4-java/blob/d43546e24388533eebd40fccb4be5468f0411788/src/java/net/jpountz/lz4/LZ4BlockOutputStream.java#L37

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

j-baker · 2019-10-08T09:38:14Z

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

+                Comparator.comparingInt(x -> x.magic.length)).get().magic.length;
+        buff.mark(maxLen);
+        byte[] headerBuffer = new byte[maxLen];
+        int len = buff.read(headerBuffer);


this actually doesn't do the right thing, because buff.read can return any number of bytes, it doesn't have to actually fill the header buffer

I would be tempted to write a method like:

private static boolean startsWith(InputStream stream, byte[] maybePrefix) { stream.mark(); try { for (int i = 0; i < maybePrefix.length; i++) { if (stream.read() != maybePrefix[i]) { return false; } } return true; } catch (IOException e) { throw new RuntimeException(e); } finally { stream.reset(); } }

But the code does not expect the headerBuffer to be filled. that is why we have the len parameter which is passed to matchMagic method later on.
on the other hand, as we are abstracting inputstream, read() and read(byte[]) has different performances when it comes to fileinputstream(at least in the sense of # of system calls even if we consider file system cache)

guess I'm saying you have a correctness bug right now. Also, you're only calling the single byte read method like 8 times, so the performance difference is ~zero.

I got what you mean. what do you think about using the ByteStreams.read(buff, headerBuffer, 0, maxLen); and also adding a filter for the magicchars larger than the bytes read from the stream compressors.stream().filter( t -> t.magic.length <= len

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

j-baker · 2019-10-08T09:43:20Z

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

+    GZIP(GzipCompressingInputStream.class, GZIPInputStream.class, GzipCompressingInputStream.GZIP_HEADER),
+    LZ4(LZ4CompressingInputStream.class, LZ4BlockInputStream.class,
+            new byte[] {'L', 'Z', '4', 'B', 'l', 'o', 'c', 'k'}),
+    NONE(null, null, new byte[] {});


this code is quite scary, because the prefix you provide actually matches all magic byte sequences

Two things that avoids this situation. (if the comment above was for the NONE (byte[]{}) part.
1- there is magic.length > 0; in the return statement of the matchMagic method.
2- values() method returns the enum's in the order they are defined. And the NONE is defined last.

...b-commons/src/main/java/com/palantir/common/compression/CompressorForwardingInputStream.java

atlasdb-commons/src/test/java/com/palantir/common/compression/NotCompressedStreamTests.java

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

...b-commons/src/main/java/com/palantir/common/compression/CompressorForwardingInputStream.java

j-baker · 2019-10-09T10:22:52Z

atlasdb-commons/src/main/java/com/palantir/common/compression/ClientCompressor.java

+                Comparator.comparingInt((ClientCompressor t) -> t.magic.length).reversed()
+        ).collect(
+                Collectors.toList());
+        int maxLen = compressors.get(0).magic.length;


int maxLen = Arrays.stream(ClientCompressor.values()).mapToInt(c -> c.magic.length).max()

can avoid all of the sorting and reversing stuff :)

hey @j-baker , but i need to start from the longest prefix, otherwise a shorter magic char which is a substring of another magic char would cause a bug. the purpose of this code was not only to find the maxlen.

tpetracca · 2019-10-19T17:13:48Z

This was merged via #4311

adding gzip compression support to the schema

5c7718e

Murat Migdisoglu added 3 commits September 10, 2019 11:31

bugfix in stream renderer.

ca5c873

Gzip input may throw an exception if it can not find the gzip magic chars.

bugfix in stream renderer.

b464f4c

Gzip input may throw an exception if it can not find the gzip magic chars.

Merge branch 'gzip_stream_support' of https://github.com/mmigdiso/atl…

6b4bd28

…asdb into gzip_stream_support

j-baker reviewed Sep 12, 2019

View reviewed changes

...db-commons/src/main/java/com/palantir/common/compression/AbstractCompressingInputStream.java Outdated Show resolved Hide resolved

Merge branch 'gzip_stream_support' of https://github.com/mmigdiso/atl…

195cf95

…asdb into gzip_stream_support

schlosna reviewed Sep 24, 2019

View reviewed changes

...-client/src/main/java/com/palantir/atlasdb/table/description/render/StreamStoreRenderer.java Outdated Show resolved Hide resolved

...-client/src/main/java/com/palantir/atlasdb/table/description/render/StreamStoreRenderer.java Outdated Show resolved Hide resolved

Murat Migdisoglu added 3 commits September 27, 2019 18:22

refactor gzip logic

e64b0b3

Merge branch 'gzip_stream_support' of https://github.com/mmigdiso/atl…

f8aa774

…asdb into gzip_stream_support

adding changelog

e918eaf