-
Notifications
You must be signed in to change notification settings - Fork 4
Conversation
Adds two new options: `fileImportConcurrency` This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files. `blockWriteConcurrency` This controls how many blocks from each file we write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when `fileImportConcurrency` is also high can swamp the process. It also: 1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good 1. Replaces `superstruct` and `deep-extend` with `merge-options` which is better suited for merging options and is smaller 1. Replaces `async-iterator-*` modules with the more zeitgeisty `it-*` namespace Supersedes #38, sort of.
54a39d5
to
40908cc
Compare
"multihashes": "~0.4.14", | ||
"nyc": "^14.0.0", | ||
"sinon": "^7.1.0" | ||
}, | ||
"dependencies": { | ||
"async-iterator-all": "^1.0.0", | ||
"async-iterator-batch": "~0.0.1", | ||
"async-iterator-first": "^1.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to get PRs to js-ipfs
and js-ipfs-http-client
(and any other projects in our dependency tree) to switch them there also so we don't get multiple of the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - I'll do that in the PR that adds support for the new options
src/dag-builder/file/flat.js
Outdated
|
||
module.exports = async function * (source, reduce) { | ||
const roots = [] | ||
|
||
for await (const chunk of batch(source, Infinity)) { | ||
for await (const chunk of batch(source, Number.MAX_SAFE_INTEGER)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use it-all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...I mean unless you actually want to receive a batch of Number.MAX_SAFE_INTEGER
items at maximum 🤪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically I think the assumption has been that source
could return variable length arrays so we use a batcher that flattens them first, though the tests seem to pass with a regular batcher so I'm not sure that it's true any more.
In fact I think it only ever worked because of how Array.concat
treats its arguments (e.g. not rejecting non-array arguments).
it-batch (internal buffer uses Array.push and yields when the buffer gets big enough)
input batcher output
child, child, child, child -> [child, child, child, child] -> [child, child, child, child]
it-flat-batch (internal buffer uses Array.concat)
input batcher output
[child, child, child], [child] -> [child, child, child, child] -> [child, child, child, child]
it-all (like it-batch but with an unbounded buffer size)
input output
child, child, child, child -> [child, child, child, child]
What fun.
TBH I don't know why we have this builder, it's not terribly useful and you can construct the same DAG with the balanced builder and a really high maxChildrenPerNode
value.
- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size | ||
- `avgChunkSize` (positive integer, defaults to `262144`): the average chunk size (rabin chunker only) | ||
- `minChunkSize` (positive integer): the minimum chunk size (rabin chunker only) | ||
- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can we get an acccompanying PR to js-ipfs
?
- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size | ||
- `avgChunkSize` (positive integer, defaults to `262144`): the average chunk size (rabin chunker only) | ||
- `minChunkSize` (positive integer): the minimum chunk size (rabin chunker only) | ||
- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dirBuilder
options (listed later here) have not been flattened...intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a mistake. Internally it's been flattened for ages so the dirBuilder
options haven't worked for a while. Oops.
Got any perf stats we can ogle at? |
Adds two new arguments to the cli & http interface: `--file-import-concurrency` and `--block-write-concurrency` See ipfs-inactive/js-ipfs-unixfs-importer#41 for futher discussion.
Data generation, 1000x files 1k-1MB in size filled with random(ish) data: $ cat gen-data.sh
#!/bin/bash
mkdir -p data
for n in {1..1000}; do
dd if=/dev/urandom of=data/file-$( printf %03d "$n" ).bin bs=1 count=$( shuf -i 1024-1048576 -n 1 )
done Generated data: $ du -hs ./data
525M ./data go-IPFS (IDK why it thinks the data folder is smaller than it is): $ time ipfs add -r ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
501.99 MiB / 501.99 MiB [============================] 100.00%
real 0m55.654s
user 0m4.931s
sys 0m5.502s js-IPFS Master $ time jsipfs add -r ./data
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
real 0m52.505s
user 0m13.482s
sys 0m5.760s File & block concurrency = 1 (e.g. how it was before this PR): $ time jsipfs add -r --file-import-concurrency=1 --block-write-concurrency=1 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
real 0m52.650s
user 0m14.849s
sys 0m5.410s Proposed default values: $ time jsipfs add -r --file-import-concurrency=50 --block-write-concurrency=10 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
real 0m25.814s
user 0m11.552s
sys 0m5.262s |
* perf: expose importer concurrency controls when adding files Adds two new arguments to the cli & http interface: `--file-import-concurrency` and `--block-write-concurrency` See ipfs-inactive/js-ipfs-unixfs-importer#41 for futher discussion. * chore: update deps, remove unused * fix: pass args from http * fix: hard code file concurrency for http requests * fix: fix up chunker parsing tests * chore: use ipfs-http-client branch temporarily * chore: increase bundlesize by 1kb * chore: update dep * chore: increase bundle size again
Adds two new options:
fileImportConcurrency
This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files.
blockWriteConcurrency
This controls how many blocks from each file we hash and write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when
fileImportConcurrency
is also high can swamp the process.'high' is relative to your hardware speed (CPU & disk)..
It also:
superstruct
anddeep-extend
withmerge-options
which is better suited for merging options and is smallerasync-iterator-*
modules with versions from the more zeitgeistyit-*
namespaceSupersedes #38, sort of. No batching but atomicity guarantees are maintained and performance gains are broadly similar with the right tuning.