Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore benchmarks #4870

Closed
schomatis opened this issue Mar 23, 2018 · 34 comments
Closed

Datastore benchmarks #4870

schomatis opened this issue Mar 23, 2018 · 34 comments
Labels
topic/badger Topic badger topic/datastore Topic datastore topic/perf Performance

Comments

@schomatis
Copy link
Contributor

Considering that the default datastore will be transitioning from flatfs to badger soon (#4279) it would be useful to have an approximate idea of the performance gains (which are considerable) and also the losses (e.g., #4298).

I'll be working in a separate repository (it can be integrated here later) to develop some profiling statistics about it, this will also help in having a better understanding of Badger's internals (at least that's my expectation).

I'm open to (and in need of) suggestions for use cases to test (random reads, rewriting/deleting the same value several times, GC'ing while reading/writing, etc).

@leerspace
Copy link
Contributor

I think the time to complete some common operations would be interesting for repos with varying sizes, pin counts, and varying amounts of unpinned content:

  • ipfs pin ls
  • ipfs add-ing various directory and file mixes
  • ipfs repo verify
  • ipfs repo gc

If memory utilization comparisons can also be made I think it would be useful. I've seen some really high memory usage from the IPFS daemon in certain cases when working with large 300+GB repos which I attributed to badger, but I haven't gone back to test with flatfs to see if it actually was badger or just having a large repo (or something else).

@Kubuxu
Copy link
Member

Kubuxu commented Mar 24, 2018

@schomatis see also https://github.com/Kubuxu/go-ds-bench

It is a bit old but should work after few fixes.

@schomatis
Copy link
Contributor Author

See dgraph-io/badger#446 for a discussion of search key performance in the IPFS case.

@manishrjain
Copy link

@leerspace : Btw, Badger's memory usage can be reduced via options. For e.g., by mmap-ing LSM tree instead of loading to RAM. Or, by keeping value log on disk, instead of mmap-ing it.

@recoilme
Copy link

Hello
I am totally new at ipfs. I am a developer of https://github.com/recoilme/slowpoke datastore
Maybe you may use it instead of badger? Slowpoke has similar ideas too badger, but without boring lsm tree.

@recoilme
Copy link

@schomatis I create PR with slowpoke test: https://github.com/schomatis/datastore_benchmarks/pull/1

Quick summary
Slowpoke:
Put time: 1.16221931s
Get time: 805.776917ms
Db size: 1 048 570 bytes (Zero bytes overhead)
Index size: 5 033 136 bytes

Badger:
Put time: 902.318742ms
Get time: 723.95486ms
Vlog size: 7 247 634 bytes
Sst size: 6 445 276 bytes

Slowpoke looks little slow then badger, but not dramatically
And has many other advantages

@magik6k
Copy link
Member

magik6k commented Apr 20, 2018

@recoilme how does slowpoke scale in few TB to PB range? btw I'd put this discussion in a separate issue as it's kind of off-topic here.

@recoilme
Copy link

Ok, @magik6k please let me link to it please

In general, slowpoke may be little slow then badger on synthetic benchmarks but it scales better on the big databases. Slowpoke is proxy to the filesystem, like flatfs + indexes + management memory
Or like badger without Lsm tree inside. Each table work in the separate goroutine and each key store is a map with values size and address in file.

@kevina
Copy link
Contributor

kevina commented May 2, 2018

@recoilme please also test "repo gc" when say 99% of the repo is not pinned.

Badger seams an order of magnate (at least) slower than flatfs (i.e slowpoke), but this needs verification.

@schomatis schomatis added topic/datastore Topic datastore topic/badger Topic badger labels May 3, 2018
@schomatis schomatis changed the title [badger] Datastore benchmarks Datastore benchmarks May 3, 2018
@schomatis
Copy link
Contributor Author

@kevina Could you provide a simple example of a GC operation that would take an order of magnitude more than flatfs please so I could take a deeper look into this performance issue?

@kevina
Copy link
Contributor

kevina commented May 3, 2018

@schomatis on an empty repo do a

ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

This could take a very long time, you may need to use #4979.

Then do a "ipfs repo gc".

Note: If you use flatfs you should turn the sync option off, before populating the datastore, to help with performance.

I have not done former test yet. I am guessing you should see the same problem with any repo that contains lots of (over 100k) small objects with very few of the objects pinned.

Also see #4908.

@kevina
Copy link
Contributor

kevina commented May 3, 2018

@schomatis here is a script to reproduce the problem

#!/bin/bash

# requires 'jq': https://stedolan.github.io/jq/

set -e

RANDOM=$GOPATH/src/github.com/ipfs/go-ipfs/test/bin/random
TMP=/aux/scratch/tmp

RNDFILE="$TMP/128MB-rnd-file"
"$RANDOM" 134217728 > "$RNDFILE"

export IPFS_PATH="$TMP/ipfs-tmp"

ipfs init > /dev/null
mv "$TMP/ipfs-tmp"/config "$TMP/ipfs-tmp"/config.bk
cat "$TMP/ipfs-tmp"/config.bk | jq .Datastore.Spec.mounts[0].child.sync'='false > "$TMP/ipfs-tmp"/config
ipfs add --pin=false --chunker=size-1024 "$RNDFILE"
mv "$TMP/ipfs-tmp"/config.bk "$TMP/ipfs-tmp"/config
echo "calling repo gc, default config"
time ipfs repo gc > /dev/null

rm -rf "$TMP/ipfs-tmp"

ipfs init -p badgerds > /dev/null
ipfs add --pin=false --chunker=size-1024 "$RNDFILE"
echo "calling repo gc, badgerds"
time ipfs repo gc > /dev/null

rm -rf "$TMP/ipfs-tmp"

rm "$RNDFILE"

When TMP pointed to "/tmp" which uses 'tmpfs' the 'repo gc' was fine. When TMP pointed to a non-memory file system that is when it was very slow. It try and let it complete and report the results.

@kevina
Copy link
Contributor

kevina commented May 3, 2018

Okay. I gave up and killed "repo gc" when badgerds is used. Here are the results:
Flatfs: 47 sec
Badgerds: >26m17s or 1577s (killed the process)

So Badgerds is at least 30 times slower.

@schomatis
Copy link
Contributor Author

Great work @kevina! Thanks for the test script. I'll try to reproduce it on my end and see if I can pinpoint the bottleneck on the Badger implementation.

The GC is supposed to be slower in Badger due to the added complexity of checking for the deletion marks in the value log, but an order of magnitude slower (or more) would be too much to ask the user to bear during this transition.

@manishrjain
Copy link

manishrjain commented May 4, 2018

Let me see if I can run this. I'm doing a whole bunch of improvements related to GC, and versioning. I feel those should fix this up nicely.

What version of Badger are you guys on?

Update: When I run the script by @kevina above, it fails with

$ ./t.sh                                                                      ~/test
./t.sh: line 11: 20034: command not found

My go-ipfs/test/bin doesn't have a random binary, after make install. What am I missing?

@kevina

This comment has been minimized.

@schomatis
Copy link
Contributor Author

Hi @manishrjain, thanks for stepping in here. The master branch of IPFS is using v1.3.0, I'm not sure what version @kevina used for the tests, but feel free to assume we're using the latest version in Badger's master branch, Badger is still not the default datastore so we have some latitude here.

To use the latest version of Badger inside IPFS you can run the following commands:

export BADGER_HASH=QmdKhi5wUQyV9i3GcTyfUmpfTntWjXu8DcyT9HyNbznYrn
rm -rfv $GOPATH/src/gx/ipfs/$BADGER_HASH
git clone https://github.com/dgraph-io/badger $GOPATH/src/gx/ipfs/$BADGER_HASH/badger
wget https://ipfs.io/ipfs/$BADGER_HASH/badger/package.json -O $GOPATH/src/gx/ipfs/$BADGER_HASH/badger/package.json
cd $GOPATH/src/github.com/ipfs/go-ipfs
make install

Sorry for the convoluted commands, the current tool to work with development packages in gx (gx-go link) has an issue at the moment which prevents a more elegant solution (AFAIK).

@kevina
Copy link
Contributor

kevina commented May 4, 2018

@schomatis I am using master at Fri Apr 27 12:52:01 2018 +0900 commit 2162f7e.

@schomatis
Copy link
Contributor Author

@manishrjain The random util as mentioned by @kevina needs to be installed separately (it's a dependency of the sharness tests):

cd $GOPATH/src/github.com/ipfs/go-ipfs/test/sharness/
make deps

@schomatis
Copy link
Contributor Author

@schomatis I am using master at Fri Apr 27 12:52:01 2018 +0900 commit 2162f7e.

Great, so that should indeed be Badger's last stable version 1.3.0.

@manishrjain
Copy link

I pulled in Badger master head, not sure if it makes any difference here.

I set SyncWrites to false, which sped up the repo gc command significantly -- for GC, having async writes makes sense anyway. So, you might want to turn that on for the GC command.

Another thing I noticed is that each "removed Q.." is run as a separate transaction, serially. If deletions can be batched up in one transaction, that'd speed up things a lot. Though, looking at the code, it might be easier to run bs.DeleteBlock(k) concurrently in multiple goroutines. The input and output is already done via channels, so this should be feasible.

On my laptop, with $TMP set to $HOME/test/tmp:

$ ipfs add --pin=false --chunker=size-1024 128MB-rnd-file                                  ~/test/tmp
Creating badger
added QmfYcagv5oqa2cKKaxiDBJaU5bXu4VwKTUNz5h1DQx3Uye 128MB-rnd-file

 19:57:59  ~/test/tmp 
$ time ipfs repo gc > /dev/null                                                            ~/test/tmp
ipfs repo gc > /dev/null  11.28s user 1.47s system 179% cpu 7.085 total

@schomatis
Copy link
Contributor Author

I can confirm that setting SyncWrites to false takes down Badger GC time to almost the value of flatfs. Still, in this example provided, 128MB is below the 1GB threshold of ValueLogFileSize so no GC is actually triggered on the Badger side, just the deletion writes are issued on the IPFS side. Lowering ValueLogFileSize to force Badger GC increases the ipfs repo gc command time for the badger datastore, but just to 1.25-1.5x, more tests with bigger repo sizes are needed but this is very promising.

Thanks a lot @manishrjain!! This is a big win, please let me know if I can be of help with the ongoing GC improvements of dgraph-io/badger#454.

@kevina
Copy link
Contributor

kevina commented May 4, 2018

@schomatis the idea behind adding a small file with very small chunks was to approximate how a very large shared directory will be stored.

@schomatis
Copy link
Contributor Author

@kevina I see, my comment above is not about the file size (or chunk size) itself, but rather that it is necessary to surpass ValueLogFileSize (whatever that size is) to make sure there is more than one value log file so Badger actually does a GC over one of them (if there is only one log file as in this example only the deletion writes are measured but not Badger's search time when it inspects a value log file to see which key is deleted and which isn't).

@recoilme
Copy link

recoilme commented May 4, 2018

Sorry, just my 5 cents about nosync:

Setting SyncWrites to false == corrupted database
Without fsync - data not store and if app/os crash or shut down unexpectedly - all data will be lost and database may(will, in case of Lsmtree based db's) be corrupted

Async writes to file == may lead to corrupted data too
For async writes to file you must use only one file descriptor per file and mutexes like in this library: https://github.com/recoilme/syncfile
Or goroutines like in https://github.com/recoilme/slowpoke
But how do it with badger? You may not open badger file from another file descriptor with the guarded library like syncfile

Slowpoke don't have "nosync" option but has butch write (sets method) with fsync at the end. It's work like a transaction. And has DeleteFile and Close methods. It may store unpinned/pinned items separately and you may close (free keys from memory) not needed data and delete all files with not needed data fast and safely

@manishrjain
Copy link

Thanks a lot @manishrjain!! This is a big win, please let me know if I can be of help with the ongoing GC improvements of dgraph-io/badger#454.

Thanks for the offer, @schomatis . I could definitely use your help on various issues. That PR you referenced is part of a v2.0 launch. Want to chat over email? Mine is my first name at dgraph.io.

@schomatis
Copy link
Contributor Author

@recoilme That's a good point on syncing I/O (I'll have that in mind for the Badger tests), also thanks for submitting the PR with the slowpoke benchmarks.

Please note that this issue concerns the Badger benchmarks as a datastore for IPFS, if you would like to add slowpoke as an experimental datastore for IPFS I'm all for it but please open a new issue to discuss that in detail.

Regarding the benchmark results, I would like to point out that although performance is the main motivation to transitioning to Badger there are also other aspects to choosing a DB that in my opinion are also important to consider, such as how many developers are actively working on it, how many users in the community are using/testing it, who else has adopted it as its default DB, what documentation does the project have, what's the adaptive capacity of the system (to the IPFS use case). I think all of those (and I'm missing many more) are important aspects to consider besides the fact that a benchmark might suggest that one DB outperforms another by ~5% in some particular store/search scenario.

@recoilme
Copy link

recoilme commented May 8, 2018

@schomatis Thanks for the detailed answer. It seems to me that badger is an excellent choice. But I would be happy to add slowpoke as an experimental datastore for research. I just want to solve some problems specific to Ipfs for my storage because it's interesting. I implement the datastore interface after the vacation and open the issue for discussion if you like.

@schomatis
Copy link
Contributor Author

@recoilme Great!

@ajbouh
Copy link

ajbouh commented Jul 17, 2018

Is anyone actively working on this?

@schomatis
Copy link
Contributor Author

Hey @ajbouh, I am (sort of, I've been distracted with other issues), but any help is more than welcome :)

If you were just asking to use Badger as your datastore you can enable it as an experimental feature.

@ajbouh
Copy link

ajbouh commented Jul 17, 2018

I'm trying to make IPFS performance benchmarks real partially in service of:

And partly to help support some of the awesome efforts already underway like:

@schomatis
Copy link
Contributor Author

IPFS + TensorFlow 😍

So, these were just some informal benchmarks to get an idea of the already known benefits of using Badger over a flat architecture, and as they stand they are just an incomplete skeleton. As mentioned by @Stebalien in #4279 (comment) the priority right now is on making Badger as stable as possible, so benchmarking performance is not really at the top of the list at the moment. That being said, if you want to continue with this work feel free to write me and we could coordinate a plan forward so I could guide you through it.

@magik6k magik6k added the topic/perf Performance label Nov 4, 2018
@Stebalien
Copy link
Member

Covered by #6523.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/badger Topic badger topic/datastore Topic datastore topic/perf Performance
Projects
None yet
Development

No branches or pull requests

9 participants