Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple calls to Write() has unexpected overheads? #22

Closed
jimsmart opened this issue Jan 15, 2018 · 10 comments
Closed

Multiple calls to Write() has unexpected overheads? #22

jimsmart opened this issue Jan 15, 2018 · 10 comments
Labels
blocker Something that needs to be fixed before next release bug

Comments

@jimsmart
Copy link

jimsmart commented Jan 15, 2018

If I make multiple calls to Write, the resulting compressed data stream length is much greater than buffering the data and making a single call to Write.

— Is this expected behaviour? The reason I ask is that if chained with other writers, one cannot make any guarantees as to how they may parcel up their data.

Code (error handling omitted):

func main() {

	b := &bytes.Buffer{}
	for i := 0; i < 500; i++ {
		b.Write([]byte("Hello World! "))
	}
	data1 := b.Bytes()
	fmt.Println("data len", len(data1))

	// Compress 1
	buffer1 := &bytes.Buffer{}
	w1 := zstd.NewWriterLevel(buffer1, CompressionLevel)
	w1.Write(data1)
	w1.Close()

	fmt.Println("buffer1 len", buffer1.Len())

	// Compress 2
	buffer2 := &bytes.Buffer{}
	w2 := zstd.NewWriterLevel(buffer2, CompressionLevel)
	for i := 0; i < 500; i++ {
		w2.Write([]byte("Hello World! "))
	}
	w2.Close()

	fmt.Println("buffer2 len", buffer2.Len())
}

Output:

data len 6500
buffer1 len 33
buffer2 len 5014

Regards

@jimsmart jimsmart changed the title Multiple calls to Write() not compressing as continuous stream? Multiple calls to Write() has great overheads? Jan 15, 2018
@jimsmart
Copy link
Author

Possibly related?
facebook/zstd#206

@jimsmart
Copy link
Author

jimsmart commented Jan 15, 2018

Ok, the best thing to do if one cannot ensure good-sized calls to Write, is to use a bufio.Writer like this:-

func main() {

	b := &bytes.Buffer{}
	for i := 0; i < 500; i++ {
		b.Write([]byte("Hello World! "))
	}
	data1 := b.Bytes()
	fmt.Println("data len", len(data1))

	// Compress 1
	buffer1 := &bytes.Buffer{}
	w1 := zstd.NewWriterLevel(buffer1, CompressionLevel)
	w1.Write(data1)
	w1.Close()

	fmt.Println("Buffer1 len", buffer1.Len())

	// Compress 2
	buffer2 := &bytes.Buffer{}
	w2 := zstd.NewWriterLevel(buffer2, CompressionLevel)
	bw := bufio.NewWriter(w2) // default buffer size = 4k
	// bw := bufio.NewWriterSize(w2, 8192) // buffer size = 8k
	for i := 0; i < 500; i++ {
		bw.Write([]byte("Hello World! "))
	}
	bw.Flush()
	w2.Close()

	fmt.Println("Buffer2 len", buffer2.Len())
}

Output:

data len 6500
Buffer1 len 33
Buffer2 len 44

It's not so elegant, but ¯\(ツ)

— Hope that helps someone!

@jimsmart jimsmart changed the title Multiple calls to Write() has great overheads? Multiple calls to Write() has unexpected overheads? Jan 15, 2018
@valyala
Copy link
Contributor

valyala commented Mar 31, 2018

@jimsmart , try gozstd.Writer. It uses another underlying zstd API, which should have lower overhead.

@rgeronimi
Copy link

The zstd bug referenced above (facebook/zstd#206) has been closed. Is this issue still ongoing? If yes, the need to wrap the writer with a buffer shall be documented, this is a pretty subtle usage advice.

@Viq111
Copy link
Collaborator

Viq111 commented Dec 4, 2018

Hi @rgeronimi,

I checked again previous result and indeed you'd currently have the same results.
zstd (this lib) & gozstd (from @valyala above) use 2 slightly different C zstd API with slightly different directions.

(this) zstd library uses ZSTD_compressContinue which basically use buffer-less zstd streaming compression, meaning we have complete control over memory at the expense of needing to do buffers Go-side if you want to optimize for compression size on small inputs.

gozstd uses ZSTD_compressStream which abstract that buffer logic into the C code (at the cost of having less control over memory consumption C land)

Hope this helps!

@Viq111 Viq111 added the question label Dec 4, 2018
@valyala
Copy link
Contributor

valyala commented Dec 5, 2018

The following limitations for ZSTD_compressContinue look scary:

  • ZSTD_compressContinue() presumes prior input is still accessible and unmodified (up to maximum distance size, see WindowLog). It remembers all previous contiguous blocks, plus one separated memory segment (which can itself consists of multiple contiguous blocks)
  • ZSTD_compressContinue() detects that prior input has been overwritten when src buffer overlaps. In which case, it will "discard" the relevant memory section from its history.

As I understand, they mean two things:

  • zstd could use garbage as a dictionary from the previous buffers used in the ZSTD_compressContinue call if address of these buffers doesn't match the address of the current buffer. This also may lead to segmentation fault if the underlying memory of the previous buffer has been unmapped from the process address space.
  • zstd may have bad compression rate, since it discards dictionary data from the previously compressed block if the buffer passed to the func is re-used.

cc'ing @Cyan4973 for further clarification.

@Cyan4973
Copy link

Cyan4973 commented Dec 5, 2018

@Viq111 explanations are correct.

ZSTD_compressContinue() is a fairly low level function, designed for systems which need absolute control over memory allocation. It requires a fairly good control over buffer content and lifetime. To be fair, it's more targeted at embedded environments than managed languages, but I'm not a qualified expert to tell if this is a good fit or not for go.

When in doubt, prefer using ZSTD_compressStream(). It's safer to use, and abstract all the machinery, at the cost of also managing its own internal buffers.

@rgeronimi
Copy link

rgeronimi commented Dec 5, 2018

ZSTD_compressContinue() is a fairly low level function, designed for systems which need absolute control over memory allocation. It requires a fairly good control over buffer content and lifetime. To be fair, it's more targeted at embedded environments than managed languages, but I'm not a qualified expert to tell if this is a good fit or not for go.

This depends on what the Go wrapper code does. I just checked it and it transmits directly the user-provided buffer as a C pointer. If I understand what @valyala wrote, this could be a critical bug as the zstd C code is expecting this buffer to remain accessible after the function returns. If true, this would have the potential for data corruptions, process crashes, and hard-to-reproduce cases.

When in doubt, prefer using ZSTD_compressStream(). It's safer to use, and abstract all the machinery, at the cost of also managing its own internal buffers.

@Viq111
Copy link
Collaborator

Viq111 commented Dec 5, 2018

Reading back at the code, we started implementing the go wrapper at zstd v0.5 which only had the ZBUFF_decompressContinue methods indeed: https://github.com/facebook/zstd/blob/201433a7f713af056cc7ea32624eddefb55e10c8/lib/zstd_buffered.h#L79

It may actually be also the issue for #39

If anyone could put up a PR for migrating to ZSTD_compressStream, we are accepting all contributions!

Otherwise I can also look into it as it seems it could bit a couple of people using the streaming interface

@Viq111 Viq111 added bug and removed question labels Dec 5, 2018
@rgeronimi
Copy link

We don't have the skillset to zoom into that soon. For storage tasks (e.g., blob storage in DB or compressed custom backup) this bug is a showstopper unfortunately.

@Viq111 Viq111 added the blocker Something that needs to be fixed before next release label Dec 28, 2018
@Viq111 Viq111 mentioned this issue Dec 28, 2018
rasky added a commit to rasky/zstd that referenced this issue Nov 20, 2019
Until DataDog#22 is fixed, this library is using zstd in a way that can cause data corruption, as confirmed by zstd maintainer himself. I think this is critical enough that should be mentioned at the top of README and the Go community should be alerted until the bug is fixed
Viq111 added a commit that referenced this issue Dec 3, 2019
@Viq111 Viq111 closed this as completed in ca147a0 Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Something that needs to be fixed before next release bug
Projects
None yet
Development

No branches or pull requests

5 participants