[RFC] Splittable Format and API #654

sean-purcell · 2017-04-11T21:14:12Z

This is an initial implementation of a zstd splittable format and API, as per #395. Comments and feedback are appreciated.

Cyan4973 · 2017-04-11T21:19:29Z

For clarity of repository, we want to move the seekable format and demo into contrib,
it's not validated as part of Zstandard yet (it will come at a later stage).

Cyan4973 · 2017-04-11T21:52:24Z

contrib/seekable_format/zstd_seekable_compression_format.md

+
+__`Skippable_Magic_Number`__
+
+Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.


All 16 values are fine for "skippable frames" in general,
but for this specification in particular,
should we pin down a specific number instead ?
Maybe one which is less likely to be used, such as *E ?
cc @terrelln for reference (done something equivalent for pzstd)

*E sounds reasonable

Cyan4973 · 2017-04-11T21:58:46Z

contrib/seekable_format/zstd_seekable_compression_format.md

+#### `Seek_Table_Footer`
+The seek table footer format is as follows:
+
+|`Number_Of_Chunks`|`Seek_Table_Descriptor`|`Seekable_Magic_Number`|


It seems Seekable_Magic_Number is not defined afterwards,
maybe it's implied that it is the same as the Skippable_Magic_Number in previous paragraph ?
Even if that's so, it needs to be described.

Also interesting : explain the reasoning for this construction.

For reference : thejoshwolfe/yauzl#48 (comment)
showing that there is nothing obvious and even commonly distributed formats can have it wrong.

Oops, it does have a value that I forgot to include.

Cyan4973 · 2017-04-11T21:59:10Z

contrib/seekable_format/zstd_seekable_compression_format.md

+
+If the checksum flag is set, each of the seek table entries contains a 4 byte checksum of the uncompressed data contained in its chunk.
+
+`Reserved_Bits` are not currently used but may be used in the future for breaking changes, so a compliant decoder should ensure they are set to 0.  `Unused_Bits` may be used in the future for non-breaking changes, so a compliant decoder should not interpret these bits.


Cyan4973 · 2017-04-11T22:00:11Z

contrib/seekable_format/zstd_seekable_compression_format.md

+
+`Seek_Table_Entries` consists of `Number_Of_Chunks` (one for each chunk in the data, not including the seek table frame) entries of the following form, in sequence:
+
+|`Compressed_Size`|`Decompressed_Size`|`[Checksum]`|


Just a question : what about the initial idea of working with offsets ?

I decided on size because the main upside to offset is not having to parse the entire seek table, but I can't think of scenarios where that's too useful (in large files we have to do a lot of work/reading anyway, in small files the seek table is small), and using offsets would either make the table bigger (8 bytes per field), or complicate the format. As is a max chunk size of 4 GB seems reasonable.

We could change this however if it turns out the ability to seek to the middle of the jump table is desirable.

I agree.
The main benefit to using offsets is to keep the table on-disk, to preserve memory usage, only reading the required offsets when useful.
I can think of a few scenarios which would benefit from this construction, but they all involve specific situations (lowmem) we should not worry about at this stage. There's ample time to decide if it's a worthwhile scenario in some distant future.

Cyan4973 · 2017-04-11T22:02:07Z

contrib/seekable_format/zstdseek_decompress.c

+    U32 hi = table->tableLen;
+
+    while (lo + 1 < hi) {
+        U32 mid = lo + ((hi - lo) >> 1);


(minor) const whenever you can

Cyan4973 · 2017-04-11T22:08:37Z

contrib/seekable_format/zstd_seekable.h

+#define ZSTD_SEEKABLE_MAXCHUNKS 0x8000000U
+
+/* 0xFE03F607 is the largest number x such that ZSTD_compressBound(x) fits in a 32-bit integer */
+#define ZSTD_SEEKABLE_MAX_CHUNK_DECOMPRESSED_SIZE 0xFE03F607


Could we get a static assert on this value ?

ZSTD_compressBound() was recently changed, it may be changed again in the future.
In case of a future change, it's necessary to ensure that ZSTD_compressBound(ZSTD_SEEKABLE_MAX_CHUNK_DECOMPRESSED_SIZE) doesn't overflow.

Note that it may require to expose a macro version of ZSTD_compressBound()

Cyan4973 · 2017-04-11T22:10:33Z

contrib/seekable_format/zstd_seekable.h

+/*-****************************************************************************
+*  Seekable Format
+*
+*  The seekable format splits the compressed data into a series of "chunks",


naming : "chunks" or "frames" ?

I guess the meaning lines up well with regular zstd frames, I can change chunks to frames.

Cyan4973 · 2017-04-11T22:24:24Z

contrib/seekable_format/zstd_seekable.h

+ZSTDLIB_API ZSTD_seekable_DStream* ZSTD_seekable_createDStream(void);
+ZSTDLIB_API size_t ZSTD_seekable_freeDStream(ZSTD_seekable_DStream* zds);
+
+/*===== Seekable decompression functions =====*/


The decoding API feels generally slightly awkward.
I suspect it's designed to look like the generic streaming API, which works from buffer to buffer.

But it could be that FILE* operations are in fact a better fit for the job.
That could also be 2 different set of functions.

These are difficult questions.
We will probably need to discuss this a bit together to converge on a design.
A generic tool for this objective is to develop sample user programs to get a "feel" of how usable the API is.

Yeah, it would be much cleaner with FILE* objects, however I wasn't sure if that would be too restrictive for layers built on top.

It certainly is restrictive, and that's why we should have a brainstorm session about it.

If you work with FILE* objects, you could allow advanced usage like:

typedef void(*ZSTD_seekable_read)(void* opaque, void* buffer, size_t n); typedef void(*ZSTD_seekable_seek)(void* opaque, unsigned long long offset); /* Careful with files > 4 GiB and fseek() */ typedef struct { void* opaque; ZSTD_seekable_read read; ZSTD_seekable_seek seek; } ZSTD_seekable_CustomFile;

Users will probably also want to be able to read the seek table.

terrelln · 2017-04-12T20:03:19Z

contrib/seekable_format/zstd_seekable.h

+*
+*  Data streamed to the seekable compressor will automatically be split into
+*  frames of size `maxFrameSize` (provided in ZSTD_seekable_initCStream()),
+*  or if none is provided, will be cut off whenver ZSTD_endFrame() is called


whenever ZSTD_seekable_endFrame()

terrelln · 2017-04-12T20:06:56Z

contrib/seekable_format/zstd_seekable.h

+*            small, a size hint for how much data to provide.
+*            An error code may also be returned, checkable with ZSTD_isError()
+*
+*  Use ZSTD_initDStream to prepare for a new decompression operation using the


ZSTD_seekable_initDStream()

terrelln · 2017-04-12T20:26:10Z

contrib/seekable_format/zstd_seekable.h

+ZSTDLIB_API ZSTD_seekable_DStream* ZSTD_seekable_createDStream(void);
+ZSTDLIB_API size_t ZSTD_seekable_freeDStream(ZSTD_seekable_DStream* zds);
+
+/*===== Seekable decompression functions =====*/


If you work with FILE* objects, you could allow advanced usage like:

typedef void(*ZSTD_seekable_read)(void* opaque, void* buffer, size_t n); typedef void(*ZSTD_seekable_seek)(void* opaque, unsigned long long offset); /* Careful with files > 4 GiB and fseek() */ typedef struct { void* opaque; ZSTD_seekable_read read; ZSTD_seekable_seek seek; } ZSTD_seekable_CustomFile;

Users will probably also want to be able to read the seek table.

terrelln · 2017-04-12T20:31:27Z

contrib/seekable_format/zstdseek_compress.c

+    if (zcs->framelog.size == ZSTD_SEEKABLE_MAXFRAMES)
+        return ERROR(frameIndex_tooLarge);
+
+    zcs->framelog.entries[zcs->framelog.size] = (framelogEntry_t)


Do we need to support C versions older than C99?

terrelln · 2017-04-12T20:35:00Z

contrib/seekable_format/zstdseek_compress.c

+    /* grow the buffer if required */
+    if (zcs->framelog.size == zcs->framelog.capacity) {
+        /* exponential size increase for constant amortized runtime */
+        size_t const newCapacity = zcs->framelog.capacity * 2;


No need to change this, but see https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md#memory-handling for an interesting read.

terrelln · 2017-04-12T21:44:06Z

contrib/seekable_format/zstdseek_decompress.c

+    /* restrict range to the end of the file, of non-negative size */
+    rangeStart = MIN(rangeStart, zds->seekTable.entries[zds->seekTable.tableLen].dOffset);
+    rangeEnd = MIN(rangeEnd, zds->seekTable.entries[zds->seekTable.tableLen].dOffset);
+    rangeEnd = MAX(rangeEnd, rangeStart);


rangeEnd = MIN(rangeEnd, zds->seekTable.entries[zds->seekTable.tableLen].dOffset); rangeStart = MIN(rangeStart, rangeEnd);

terrelln · 2017-04-12T21:46:10Z

contrib/seekable_format/zstdseek_decompress.c

+
+        {   /* Allocate an extra entry at the end so that we can do size
+             * computations on the last element without special case */
+            seekEntry_t* entries = malloc(sizeof(seekEntry_t) * (numFrames + 1));


I think some compilers will complain that you don't cast to a seekEntry_t*.

terrelln · 2017-04-12T21:48:03Z

contrib/seekable_format/zstdseek_decompress.c

+                        zds->targetStart - zds->decompressedOffset;
+                size_t const prevInputPos = input->pos;
+
+                ZSTD_outBuffer outTmp = {


Same question here about C99.

terrelln · 2017-04-12T21:52:30Z

contrib/seekable_format/zstdseek_decompress.c

+    zds->stage = zsds_seek;
+
+    /* force a seek first */
+    zds->curFrame = (U32) -1;


No space between cast and value

terrelln · 2017-04-12T21:58:27Z

contrib/seekable_format/zstdseek_decompress.c

+                                jt->entries[zds->curFrame + 1].cOffset)
+                        return ERROR(corruption_detected);
+                    ZSTD_resetDStream(zds->dstream);
+                    zds->stage = zsds_seek;


Why do you need to seek after this? You've already asserted you are in the correct spot, right?

The next frame to read might not be the frame immediately after, e.g. in the case of interleaved skippable frames. This will seek over them.

Right, since dOffset == 0 for skippable frames. Can you add a comment that explains that?

Not quite, dSize == 0 for skippable frames, which translates to entries[i].dOffset == entries[i+1].dOffset if i is the index of a skippable frame. Using offset in the decoder even though the format uses size is a bit confusing, but necessary for the decoder to be efficient. I'll add a comment explaining why the seek is necessary.

Cyan4973 · 2017-04-19T18:02:43Z

contrib/seekable_format/zstd_seekable.h

+ZSTDLIB_API size_t ZSTD_seekable_getFrameCompressedSize(ZSTD_seekable* const zs, unsigned frameIndex);
+ZSTDLIB_API size_t ZSTD_seekable_getFrameDecompressedSize(ZSTD_seekable* const zs, unsigned frameIndex);
+
+ZSTDLIB_API unsigned ZSTD_seekable_offsetToFrame(ZSTD_seekable* const zs, unsigned long long offset);


ZSTD_seekable_offsetToFrameIndex for consistent naming

Cyan4973 · 2017-04-19T18:04:49Z

contrib/seekable_format/zstd_seekable.h

+/* Limit the maximum size to avoid any potential issues storing the compressed size */
+#define ZSTD_SEEKABLE_MAX_FRAME_DECOMPRESSED_SIZE 0x80000000U
+
+#define ZSTD_SEEKABLE_FRAMEINDEX_TOOLARGE (0ULL-2)


This macro value is of different nature.
I would rather #define it closer to the family of functions which need it.

sean-purcell · 2017-04-21T21:29:28Z

Demo using parallel_processing.c, which sums up the values of all the bytes in seekable compressed file by processing frames in parallel using a thread pool:

# create test file
devvm27563 examples [splittable] $ ../../../tests/datagen -g4096M > test.dat
devvm27563 examples [splittable ?] $ gcc -x c - -O3 <<EOF
> #include <stdio.h>
> int main() {
> char c;
> unsigned long long total = 0;
> while ((c = fgetc(stdin)) != EOF) { total += c; }
> printf("Sum: %llu\n", total);
> return 0;
> }
> EOF
# compute the expected answer with a test program
devvm27563 examples [splittable ?] $ cat test.dat | ./a.out
Sum: 240342218668
# create seekable file (compression level 5)
devvm27563 examples [splittable ?] $ time ./seekable_compression test.dat $((16 * 1024 * 1024))

real    1m29.590s
user    1m27.569s
sys     0m2.005s
# test with 1, 2, 4, 8, 16 threads (on a 24 core machine)
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 1
Sum: 240342218668

real    0m9.352s
user    0m8.946s
sys     0m0.399s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 2
Sum: 240342218668

real    0m4.701s
user    0m9.004s
sys     0m0.383s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 4
Sum: 240342218668

real    0m2.485s
user    0m9.426s
sys     0m0.421s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 8
Sum: 240342218668

real    0m1.289s
user    0m9.443s
sys     0m0.499s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 16
Sum: 240342218668

real    0m0.768s
user    0m10.493s
sys     0m0.709s

As seen from the timing results, the decompression and processing speed parallelizes very efficiently, gaining a ~2x speed up each time the number of cores is doubled.

Cyan4973 · 2017-04-21T21:44:03Z

Yes, that's a great example, very clear.
Hopefully, it also helps to get a user-side feeling of the API for such a use case.

Obviously, now that you successfully completed parallel processing,
there will be one to wonder, why not compression in parallel too ...

sean-purcell · 2017-04-28T19:27:32Z

Demo using parallel_compression.c, which does compression manually and uses the seekable API to construct a seek table from it. (./parallel_compression FILE BLOCK_SIZE NB_THREADS).

devvm25806 examples [splittable ?] $ ./datagen -g4192M > test.dat
devvm25806 examples [splittable !?] $ time ./parallel_compression test.dat 1048576 1

real    1m28.619s
user    1m25.908s
sys     0m5.726s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 2

real    0m45.397s
user    1m25.755s
sys     0m5.411s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 4

real    0m24.319s
user    1m27.992s
sys     0m5.293s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 8

real    0m13.489s
user    1m28.168s
sys     0m5.592s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 16

real    0m8.563s
user    1m30.531s
sys     0m5.454s

The files can be decompressed using zstd and are compatible with the seekable capabilities of parallel_processing and seekable_decompression.

Cyan4973 · 2017-05-03T18:43:07Z

To be merged after release v1.2.0

sean-purcell added 4 commits April 11, 2017 13:53

Seekable compression demo

c3ba15e

Make seekable streams work w/ small buffers, misc fixes

a3b7c22

Add format specification

45f3bc4

Fixes

b13da70

facebook-github-bot added the CLA Signed label Apr 11, 2017

Move seekable format content to /contrib

d048fef

Cyan4973 reviewed Apr 11, 2017

View reviewed changes

sean-purcell added 3 commits April 12, 2017 11:15

Address PR comments and minor fixes

e80f1d7

s/chunk/frame/

5ee1135

Reduce the limit on frame decompressed size to 2 GB

2785b28

terrelln reviewed Apr 12, 2017

View reviewed changes

Address @terrelln's comments

9626cf1

sean-purcell force-pushed the splittable branch from e0cacc9 to 9626cf1 Compare April 14, 2017 00:48

Update seekable API to simplify IO

0f7bd77

Cyan4973 reviewed Apr 19, 2017

View reviewed changes

sean-purcell added 3 commits April 20, 2017 16:48

Address comments and make sure all prototypes are rendered by gen_html

35186e6

Add parallel processing example for seekable API

11dc940

Merge remote-tracking branch 'origin/dev' into splittable

7d37ca1

Add raw seek table construction API and parallel compression example

470993c

Cyan4973 merged commit d47709b into facebook:dev May 8, 2017

azat mentioned this pull request Sep 16, 2018

Seekable fixes pull #1322

Merged


		__`Skippable_Magic_Number`__

		Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.


		If the checksum flag is set, each of the seek table entries contains a 4 byte checksum of the uncompressed data contained in its chunk.

		`Reserved_Bits` are not currently used but may be used in the future for breaking changes, so a compliant decoder should ensure they are set to 0. `Unused_Bits` may be used in the future for non-breaking changes, so a compliant decoder should not interpret these bits.


		`Seek_Table_Entries` consists of `Number_Of_Chunks` (one for each chunk in the data, not including the seek table frame) entries of the following form, in sequence:

		\|`Compressed_Size`\|`Decompressed_Size`\|`[Checksum]`\|

[RFC] Splittable Format and API #654

[RFC] Splittable Format and API #654

Conversation

sean-purcell commented Apr 11, 2017

Cyan4973 commented Apr 11, 2017

Cyan4973 Apr 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cyan4973 Apr 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sean-purcell commented Apr 21, 2017

Cyan4973 commented Apr 21, 2017

sean-purcell commented Apr 28, 2017

Cyan4973 commented May 3, 2017

Cyan4973 Apr 11, 2017 •

edited

Loading

Cyan4973 Apr 11, 2017 •

edited

Loading