-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
command: use external sort for comparison in sync #483
Conversation
I ran a few tests with sync (when it was using encoding/xml, but now uses encoding/json). I created 1,000,000 files each with 16 bytes size (using Using old version, sync operation took about 500 seconds. It used about 10 GB RAM. I'm planning to set default chunk size to 100,000 which uses up to 2 GB RAM. Together with the file's content (when real files with significant size not merely 16 bytes are used), I guess 8 GB RAM would be sufficient in general. Footnotes |
Updatesit was using encoding/xml, but now uses encoding/json format to write disk. 1,000,000 objects, each one is 16 bytesFor sync'ing 1,000,000 local objects to s3.
When all those 1,000,000 objects are supposed to be uploaded
When none of the objects are supposed to be uploaded, with
Extra commentaries
9.1 GB semi-natural directoryA directory of 9.1 GB in 52250 objects. Directory content consists of the some repositories cloned from github, an .iso. and a script file: QuantumLibraries
Telegram
digitalgov.gov
fireship.io
go
openimage.py
s5cmd
sql-server-samples
ubuntu.iso
Hyperfine single run results (no reproducible difference):
'Old internal sort' ran
|
I guess using StableSort would make difference only if the filepath.ToSlash was called before the sort operation, not after. Furthermore I think using filepath.ToSlash is problematic. Because it, potentially, breaks the assumption that both of the lists/channels are listed. For example, in a windows device we may have such an order: So I propose these changes as fix: https://github.com/Kucukaslan/s5cmd/commit/40a22c91df08cea3f150455f6c2d060480ba745c
|
as an additional benefit, reduce the risk of blocking sends to those channel
I've implemented another variant that uses gob encoding instead of the json encoding in this branch. I've also written a test code to compare their respective encoding/decoding1 speeds. In my first tests, the json encoding was much faster than the gob encoding. For 1 million objects the json encoding took about 8.5 seconds, while gob encoding took about 30 seconds. After perusing their implementations2, I identified the main reason of the difference: The gob's default time.Time encoding seems to be slower than storing the time as RFC3339Nano formatted string. Now gob takes about 10 seconds for the same test values. I currently do not know how the encoded slice sizes will affect the external sorting.
test code:func TestFromToBytes(t *testing.T) {
key := "s3://bucket/key/w*ldcard/\net"
size := int64(53861)
modTime := time.Now()
ty := ObjectType{1}
count := 1000000
u, err := url.New(key, url.WithRaw(true))
if err != nil {
t.Fail()
}
o1 := Object{
URL: u,
Size: size,
ModTime: &modTime,
Type: ty,
}
b := o1.ToBytes()
fmt.Println(len(b), b)
start := time.Now()
for i := 0; i < count; i++ {
_ = FromBytes(o1.ToBytes()).(Object)
}
elapsed := time.Since(start)
fmt.Printf("Processing %d objects took %s", count, elapsed)
t.Fail()
} benchmarkcpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
func BenchmarkFromToBytes(b *testing.B) {
key := "s3://bucket/key/w*ldcard/\net"
size := int64(53861)
modTime := time.Now()
ty := ObjectType{1}
u, err := url.New(key, url.WithRaw(true))
if err != nil {
b.Fail()
}
o1 := Object{
URL: u,
Size: size,
ModTime: &modTime,
Type: ty,
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = FromBytes(o1.ToBytes()).(Object)
}
} Footnotes |
Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:
|
…ding to increase speed This change reduced the time it takes for 1.000.000 encode/decode operation from 30 seconds to 10 seconds.
My pleasure. I've rebased the gob encoding branch to this PR. ps. Time is converted to RFC3339Nano formatted string before gob encoding is called, and parsed from that string after gob decode. |
hello. any chance this fix will be merged and released? it's very useful for us |
I'd also be interested in seeing this merged! |
Sorry for the long wait @kucukaslan, and thank you for your hard work on this 😄 |
It was my pleasure, thanks 😊 |
I love you for this change |
It uses external sort instead of in-memory sort to dramatically reduce memory usage in the expense of the speed.
It uses encoding/gob format to store to disk and restore from there.
Fixes #441
Fixes #447