Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Splittable Format and API #654

Merged
merged 14 commits into from
May 8, 2017
Merged

Conversation

sean-purcell
Copy link
Contributor

This is an initial implementation of a zstd splittable format and API, as per #395. Comments and feedback are appreciated.

@Cyan4973
Copy link
Contributor

For clarity of repository, we want to move the seekable format and demo into contrib,
it's not validated as part of Zstandard yet (it will come at a later stage).


__`Skippable_Magic_Number`__

Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
Copy link
Contributor

@Cyan4973 Cyan4973 Apr 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 16 values are fine for "skippable frames" in general,
but for this specification in particular,
should we pin down a specific number instead ?
Maybe one which is less likely to be used, such as *E ?
cc @terrelln for reference (done something equivalent for pzstd)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*E sounds reasonable

#### `Seek_Table_Footer`
The seek table footer format is as follows:

|`Number_Of_Chunks`|`Seek_Table_Descriptor`|`Seekable_Magic_Number`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Seekable_Magic_Number is not defined afterwards,
maybe it's implied that it is the same as the Skippable_Magic_Number in previous paragraph ?
Even if that's so, it needs to be described.

Also interesting : explain the reasoning for this construction.

For reference : thejoshwolfe/yauzl#48 (comment)
showing that there is nothing obvious and even commonly distributed formats can have it wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it does have a value that I forgot to include.


If the checksum flag is set, each of the seek table entries contains a 4 byte checksum of the uncompressed data contained in its chunk.

`Reserved_Bits` are not currently used but may be used in the future for breaking changes, so a compliant decoder should ensure they are set to 0. `Unused_Bits` may be used in the future for non-breaking changes, so a compliant decoder should not interpret these bits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


`Seek_Table_Entries` consists of `Number_Of_Chunks` (one for each chunk in the data, not including the seek table frame) entries of the following form, in sequence:

|`Compressed_Size`|`Decompressed_Size`|`[Checksum]`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question : what about the initial idea of working with offsets ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided on size because the main upside to offset is not having to parse the entire seek table, but I can't think of scenarios where that's too useful (in large files we have to do a lot of work/reading anyway, in small files the seek table is small), and using offsets would either make the table bigger (8 bytes per field), or complicate the format. As is a max chunk size of 4 GB seems reasonable.

We could change this however if it turns out the ability to seek to the middle of the jump table is desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
The main benefit to using offsets is to keep the table on-disk, to preserve memory usage, only reading the required offsets when useful.
I can think of a few scenarios which would benefit from this construction, but they all involve specific situations (lowmem) we should not worry about at this stage. There's ample time to decide if it's a worthwhile scenario in some distant future.

U32 hi = table->tableLen;

while (lo + 1 < hi) {
U32 mid = lo + ((hi - lo) >> 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) const whenever you can

#define ZSTD_SEEKABLE_MAXCHUNKS 0x8000000U

/* 0xFE03F607 is the largest number x such that ZSTD_compressBound(x) fits in a 32-bit integer */
#define ZSTD_SEEKABLE_MAX_CHUNK_DECOMPRESSED_SIZE 0xFE03F607
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a static assert on this value ?

ZSTD_compressBound() was recently changed, it may be changed again in the future.
In case of a future change, it's necessary to ensure that ZSTD_compressBound(ZSTD_SEEKABLE_MAX_CHUNK_DECOMPRESSED_SIZE) doesn't overflow.

Note that it may require to expose a macro version of ZSTD_compressBound()

/*-****************************************************************************
* Seekable Format
*
* The seekable format splits the compressed data into a series of "chunks",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming : "chunks" or "frames" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the meaning lines up well with regular zstd frames, I can change chunks to frames.

ZSTDLIB_API ZSTD_seekable_DStream* ZSTD_seekable_createDStream(void);
ZSTDLIB_API size_t ZSTD_seekable_freeDStream(ZSTD_seekable_DStream* zds);

/*===== Seekable decompression functions =====*/
Copy link
Contributor

@Cyan4973 Cyan4973 Apr 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoding API feels generally slightly awkward.
I suspect it's designed to look like the generic streaming API, which works from buffer to buffer.

But it could be that FILE* operations are in fact a better fit for the job.
That could also be 2 different set of functions.

These are difficult questions.
We will probably need to discuss this a bit together to converge on a design.
A generic tool for this objective is to develop sample user programs to get a "feel" of how usable the API is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be much cleaner with FILE* objects, however I wasn't sure if that would be too restrictive for layers built on top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly is restrictive, and that's why we should have a brainstorm session about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you work with FILE* objects, you could allow advanced usage like:

typedef void(*ZSTD_seekable_read)(void* opaque, void* buffer, size_t n);
typedef void(*ZSTD_seekable_seek)(void* opaque, unsigned long long offset); /* Careful with files > 4 GiB and fseek() */
typedef struct {
  void* opaque;
  ZSTD_seekable_read read;
  ZSTD_seekable_seek seek;
} ZSTD_seekable_CustomFile;

Users will probably also want to be able to read the seek table.

*
* Data streamed to the seekable compressor will automatically be split into
* frames of size `maxFrameSize` (provided in ZSTD_seekable_initCStream()),
* or if none is provided, will be cut off whenver ZSTD_endFrame() is called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whenever ZSTD_seekable_endFrame()

* small, a size hint for how much data to provide.
* An error code may also be returned, checkable with ZSTD_isError()
*
* Use ZSTD_initDStream to prepare for a new decompression operation using the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSTD_seekable_initDStream()

ZSTDLIB_API ZSTD_seekable_DStream* ZSTD_seekable_createDStream(void);
ZSTDLIB_API size_t ZSTD_seekable_freeDStream(ZSTD_seekable_DStream* zds);

/*===== Seekable decompression functions =====*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you work with FILE* objects, you could allow advanced usage like:

typedef void(*ZSTD_seekable_read)(void* opaque, void* buffer, size_t n);
typedef void(*ZSTD_seekable_seek)(void* opaque, unsigned long long offset); /* Careful with files > 4 GiB and fseek() */
typedef struct {
  void* opaque;
  ZSTD_seekable_read read;
  ZSTD_seekable_seek seek;
} ZSTD_seekable_CustomFile;

Users will probably also want to be able to read the seek table.

if (zcs->framelog.size == ZSTD_SEEKABLE_MAXFRAMES)
return ERROR(frameIndex_tooLarge);

zcs->framelog.entries[zcs->framelog.size] = (framelogEntry_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support C versions older than C99?

/* grow the buffer if required */
if (zcs->framelog.size == zcs->framelog.capacity) {
/* exponential size increase for constant amortized runtime */
size_t const newCapacity = zcs->framelog.capacity * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* restrict range to the end of the file, of non-negative size */
rangeStart = MIN(rangeStart, zds->seekTable.entries[zds->seekTable.tableLen].dOffset);
rangeEnd = MIN(rangeEnd, zds->seekTable.entries[zds->seekTable.tableLen].dOffset);
rangeEnd = MAX(rangeEnd, rangeStart);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rangeEnd = MIN(rangeEnd, zds->seekTable.entries[zds->seekTable.tableLen].dOffset);
rangeStart = MIN(rangeStart, rangeEnd);


{ /* Allocate an extra entry at the end so that we can do size
* computations on the last element without special case */
seekEntry_t* entries = malloc(sizeof(seekEntry_t) * (numFrames + 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some compilers will complain that you don't cast to a seekEntry_t*.

zds->targetStart - zds->decompressedOffset;
size_t const prevInputPos = input->pos;

ZSTD_outBuffer outTmp = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here about C99.

zds->stage = zsds_seek;

/* force a seek first */
zds->curFrame = (U32) -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No space between cast and value

jt->entries[zds->curFrame + 1].cOffset)
return ERROR(corruption_detected);
ZSTD_resetDStream(zds->dstream);
zds->stage = zsds_seek;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to seek after this? You've already asserted you are in the correct spot, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next frame to read might not be the frame immediately after, e.g. in the case of interleaved skippable frames. This will seek over them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, since dOffset == 0 for skippable frames. Can you add a comment that explains that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, dSize == 0 for skippable frames, which translates to entries[i].dOffset == entries[i+1].dOffset if i is the index of a skippable frame. Using offset in the decoder even though the format uses size is a bit confusing, but necessary for the decoder to be efficient. I'll add a comment explaining why the seek is necessary.

ZSTDLIB_API size_t ZSTD_seekable_getFrameCompressedSize(ZSTD_seekable* const zs, unsigned frameIndex);
ZSTDLIB_API size_t ZSTD_seekable_getFrameDecompressedSize(ZSTD_seekable* const zs, unsigned frameIndex);

ZSTDLIB_API unsigned ZSTD_seekable_offsetToFrame(ZSTD_seekable* const zs, unsigned long long offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSTD_seekable_offsetToFrameIndex for consistent naming

/* Limit the maximum size to avoid any potential issues storing the compressed size */
#define ZSTD_SEEKABLE_MAX_FRAME_DECOMPRESSED_SIZE 0x80000000U

#define ZSTD_SEEKABLE_FRAMEINDEX_TOOLARGE (0ULL-2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro value is of different nature.
I would rather #define it closer to the family of functions which need it.

@sean-purcell
Copy link
Contributor Author

Demo using parallel_processing.c, which sums up the values of all the bytes in seekable compressed file by processing frames in parallel using a thread pool:

# create test file
devvm27563 examples [splittable] $ ../../../tests/datagen -g4096M > test.dat
devvm27563 examples [splittable ?] $ gcc -x c - -O3 <<EOF
> #include <stdio.h>
> int main() {
> char c;
> unsigned long long total = 0;
> while ((c = fgetc(stdin)) != EOF) { total += c; }
> printf("Sum: %llu\n", total);
> return 0;
> }
> EOF
# compute the expected answer with a test program
devvm27563 examples [splittable ?] $ cat test.dat | ./a.out
Sum: 240342218668
# create seekable file (compression level 5)
devvm27563 examples [splittable ?] $ time ./seekable_compression test.dat $((16 * 1024 * 1024))

real    1m29.590s
user    1m27.569s
sys     0m2.005s
# test with 1, 2, 4, 8, 16 threads (on a 24 core machine)
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 1
Sum: 240342218668

real    0m9.352s
user    0m8.946s
sys     0m0.399s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 2
Sum: 240342218668

real    0m4.701s
user    0m9.004s
sys     0m0.383s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 4
Sum: 240342218668

real    0m2.485s
user    0m9.426s
sys     0m0.421s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 8
Sum: 240342218668

real    0m1.289s
user    0m9.443s
sys     0m0.499s
devvm27563 examples [splittable ?] $ time ./parallel_processing test.dat.zst 16
Sum: 240342218668

real    0m0.768s
user    0m10.493s
sys     0m0.709s

As seen from the timing results, the decompression and processing speed parallelizes very efficiently, gaining a ~2x speed up each time the number of cores is doubled.

@Cyan4973
Copy link
Contributor

Yes, that's a great example, very clear.
Hopefully, it also helps to get a user-side feeling of the API for such a use case.

Obviously, now that you successfully completed parallel processing,
there will be one to wonder, why not compression in parallel too ...

@sean-purcell
Copy link
Contributor Author

Demo using parallel_compression.c, which does compression manually and uses the seekable API to construct a seek table from it. (./parallel_compression FILE BLOCK_SIZE NB_THREADS).

devvm25806 examples [splittable ?] $ ./datagen -g4192M > test.dat
devvm25806 examples [splittable !?] $ time ./parallel_compression test.dat 1048576 1

real    1m28.619s
user    1m25.908s
sys     0m5.726s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 2

real    0m45.397s
user    1m25.755s
sys     0m5.411s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 4

real    0m24.319s
user    1m27.992s
sys     0m5.293s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 8

real    0m13.489s
user    1m28.168s
sys     0m5.592s
devvm25806 examples [splittable ?] $ time ./parallel_compression test.dat 1048576 16

real    0m8.563s
user    1m30.531s
sys     0m5.454s

The files can be decompressed using zstd and are compatible with the seekable capabilities of parallel_processing and seekable_decompression.

@Cyan4973
Copy link
Contributor

Cyan4973 commented May 3, 2017

To be merged after release v1.2.0

@Cyan4973 Cyan4973 merged commit d47709b into facebook:dev May 8, 2017
@azat azat mentioned this pull request Sep 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants