perf: concurrent file import #41

achingbrain · 2019-11-27T08:13:40Z

Adds two new options:

fileImportConcurrency

This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files.

blockWriteConcurrency

This controls how many blocks from each file we hash and write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when fileImportConcurrency is also high can swamp the process.

'high' is relative to your hardware speed (CPU & disk)..

It also:

Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good
Replaces superstruct and deep-extend with merge-options which is better suited for merging options and is smaller
Replaces async-iterator-* modules with versions from the more zeitgeisty it-* namespace

Supersedes #38, sort of. No batching but atomicity guarantees are maintained and performance gains are broadly similar with the right tuning.

Adds two new options: `fileImportConcurrency` This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files. `blockWriteConcurrency` This controls how many blocks from each file we write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when `fileImportConcurrency` is also high can swamp the process. It also: 1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good 1. Replaces `superstruct` and `deep-extend` with `merge-options` which is better suited for merging options and is smaller 1. Replaces `async-iterator-*` modules with the more zeitgeisty `it-*` namespace Supersedes #38, sort of.

alanshaw · 2019-11-27T09:37:15Z

package.json

    "multihashes": "~0.4.14",
    "nyc": "^14.0.0",
    "sinon": "^7.1.0"
  },
  "dependencies": {
-    "async-iterator-all": "^1.0.0",
-    "async-iterator-batch": "~0.0.1",
-    "async-iterator-first": "^1.0.0",


It would be good to get PRs to js-ipfs and js-ipfs-http-client (and any other projects in our dependency tree) to switch them there also so we don't get multiple of the same thing.

Sure - I'll do that in the PR that adds support for the new options

alanshaw · 2019-11-27T09:40:31Z

src/dag-builder/file/flat.js


 module.exports = async function * (source, reduce) {
  const roots = []

-  for await (const chunk of batch(source, Infinity)) {
+  for await (const chunk of batch(source, Number.MAX_SAFE_INTEGER)) {


Just use it-all?

...I mean unless you actually want to receive a batch of Number.MAX_SAFE_INTEGER items at maximum 🤪

Historically I think the assumption has been that source could return variable length arrays so we use a batcher that flattens them first, though the tests seem to pass with a regular batcher so I'm not sure that it's true any more.

In fact I think it only ever worked because of how Array.concat treats its arguments (e.g. not rejecting non-array arguments).

it-batch (internal buffer uses Array.push and yields when the buffer gets big enough) input batcher output child, child, child, child -> [child, child, child, child] -> [child, child, child, child]

it-flat-batch (internal buffer uses Array.concat) input batcher output [child, child, child], [child] -> [child, child, child, child] -> [child, child, child, child]

it-all (like it-batch but with an unbounded buffer size) input output child, child, child, child -> [child, child, child, child]

What fun.

TBH I don't know why we have this builder, it's not terribly useful and you can construct the same DAG with the balanced builder and a really high maxChildrenPerNode value.

alanshaw · 2019-11-27T09:44:45Z

README.md

-  - `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size
+- `avgChunkSize` (positive integer, defaults to `262144`): the average chunk size (rabin chunker only)
+- `minChunkSize` (positive integer): the minimum chunk size (rabin chunker only)
+- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size


Please can we get an acccompanying PR to js-ipfs?

alanshaw · 2019-11-27T09:45:48Z

README.md

-  - `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size
+- `avgChunkSize` (positive integer, defaults to `262144`): the average chunk size (rabin chunker only)
+- `minChunkSize` (positive integer): the minimum chunk size (rabin chunker only)
+- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size


dirBuilder options (listed later here) have not been flattened...intentional?

Just a mistake. Internally it's been flattened for ages so the dirBuilder options haven't worked for a while. Oops.

alanshaw · 2019-11-27T10:00:56Z

Got any perf stats we can ogle at?

README.md

Adds two new arguments to the cli & http interface: `--file-import-concurrency` and `--block-write-concurrency` See ipfs-inactive/js-ipfs-unixfs-importer#41 for futher discussion.

achingbrain · 2019-11-27T12:32:55Z

Got any perf stats we can ogle at?

Data generation, 1000x files 1k-1MB in size filled with random(ish) data:

$ cat gen-data.sh 
#!/bin/bash

mkdir -p data

for n in {1..1000}; do
    dd if=/dev/urandom of=data/file-$( printf %03d "$n" ).bin bs=1 count=$( shuf -i 1024-1048576 -n 1 )
done

Generated data:

$ du -hs ./data
525M	./data

go-IPFS (IDK why it thinks the data folder is smaller than it is):

$ time ipfs add -r ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
 501.99 MiB / 501.99 MiB [============================] 100.00%
real	0m55.654s
user	0m4.931s
sys	0m5.502s

js-IPFS Master

$ time jsipfs add -r ./data
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real	0m52.505s
user	0m13.482s
sys	0m5.760s

File & block concurrency = 1 (e.g. how it was before this PR):

$ time jsipfs add -r --file-import-concurrency=1 --block-write-concurrency=1 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real	0m52.650s
user	0m14.849s
sys	0m5.410s

Proposed default values:

$ time jsipfs add -r --file-import-concurrency=50 --block-write-concurrency=10 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real	0m25.814s
user	0m11.552s
sys	0m5.262s

* perf: expose importer concurrency controls when adding files Adds two new arguments to the cli & http interface: `--file-import-concurrency` and `--block-write-concurrency` See ipfs-inactive/js-ipfs-unixfs-importer#41 for futher discussion. * chore: update deps, remove unused * fix: pass args from http * fix: hard code file concurrency for http requests * fix: fix up chunker parsing tests * chore: use ipfs-http-client branch temporarily * chore: increase bundlesize by 1kb * chore: update dep * chore: increase bundle size again

achingbrain requested review from alanshaw and hugomrdias November 27, 2019 08:13

achingbrain force-pushed the concurrent-file-import branch from 54a39d5 to 40908cc Compare November 27, 2019 08:15

alanshaw suggested changes Nov 27, 2019

View reviewed changes

chore: address review comments

f86ffcb

alanshaw reviewed Nov 27, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

docs: change hamt option names and expose hash function

6e26553

alanshaw approved these changes Nov 27, 2019

View reviewed changes

achingbrain mentioned this pull request Nov 27, 2019

perf: expose importer concurrency controls when adding files ipfs/js-ipfs#2637

Merged

1 task

achingbrain merged commit 68ac8cc into master Nov 27, 2019

achingbrain deleted the concurrent-file-import branch November 27, 2019 12:47

achingbrain mentioned this pull request Nov 29, 2019

feat: expose import concurrency controls ipfs-inactive/js-ipfs-http-client#1187

Merged

alanshaw mentioned this pull request Dec 9, 2019

⚡️ v0.41.0 RELEASE 🚀 ipfs/js-ipfs#2656

Closed

57 tasks

achingbrain mentioned this pull request Feb 21, 2020

add support for batch puts #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: concurrent file import #41

perf: concurrent file import #41

achingbrain commented Nov 27, 2019 •

edited

Loading

alanshaw Nov 27, 2019

achingbrain Nov 27, 2019

alanshaw Nov 27, 2019

alanshaw Nov 27, 2019

achingbrain Nov 27, 2019

alanshaw Nov 27, 2019

alanshaw Nov 27, 2019

achingbrain Nov 27, 2019 •

edited

Loading

alanshaw commented Nov 27, 2019

achingbrain commented Nov 27, 2019 •

edited

Loading

perf: concurrent file import #41

perf: concurrent file import #41

Conversation

achingbrain commented Nov 27, 2019 • edited Loading

alanshaw Nov 27, 2019

Choose a reason for hiding this comment

achingbrain Nov 27, 2019

Choose a reason for hiding this comment

alanshaw Nov 27, 2019

Choose a reason for hiding this comment

alanshaw Nov 27, 2019

Choose a reason for hiding this comment

achingbrain Nov 27, 2019

Choose a reason for hiding this comment

alanshaw Nov 27, 2019

Choose a reason for hiding this comment

alanshaw Nov 27, 2019

Choose a reason for hiding this comment

achingbrain Nov 27, 2019 • edited Loading

Choose a reason for hiding this comment

alanshaw commented Nov 27, 2019

achingbrain commented Nov 27, 2019 • edited Loading

achingbrain commented Nov 27, 2019 •

edited

Loading

achingbrain Nov 27, 2019 •

edited

Loading

achingbrain commented Nov 27, 2019 •

edited

Loading