Skip to content
This repository has been archived by the owner on Apr 29, 2020. It is now read-only.

perf: concurrent file import #41

Merged
merged 3 commits into from
Nov 27, 2019
Merged

perf: concurrent file import #41

merged 3 commits into from
Nov 27, 2019

Conversation

achingbrain
Copy link
Collaborator

@achingbrain achingbrain commented Nov 27, 2019

Adds two new options:

fileImportConcurrency

This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files.

blockWriteConcurrency

This controls how many blocks from each file we hash and write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when fileImportConcurrency is also high can swamp the process.

'high' is relative to your hardware speed (CPU & disk)..

It also:

  1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good
  2. Replaces superstruct and deep-extend with merge-options which is better suited for merging options and is smaller
  3. Replaces async-iterator-* modules with versions from the more zeitgeisty it-* namespace

Supersedes #38, sort of. No batching but atomicity guarantees are maintained and performance gains are broadly similar with the right tuning.

Adds two new options:

`fileImportConcurrency`

This controls the number of files that are imported concurrently.
You may wish to set this high if you are importing lots of small
files.

`blockWriteConcurrency`

This controls how many blocks from each file we write to disk at
the same time.  Setting this high when writing large files will
significantly increase import speed, though having it high when
`fileImportConcurrency` is also high can swamp the process.

It also:

1. Flattens module options because validating deep objects was
  clunky and the separation of access to config sub objects within
  this module isn't very good
1. Replaces `superstruct` and `deep-extend` with `merge-options`
  which is better suited for merging options and is smaller
1. Replaces `async-iterator-*` modules with the more zeitgeisty
  `it-*` namespace

Supersedes #38, sort of.
"multihashes": "~0.4.14",
"nyc": "^14.0.0",
"sinon": "^7.1.0"
},
"dependencies": {
"async-iterator-all": "^1.0.0",
"async-iterator-batch": "~0.0.1",
"async-iterator-first": "^1.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to get PRs to js-ipfs and js-ipfs-http-client (and any other projects in our dependency tree) to switch them there also so we don't get multiple of the same thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I'll do that in the PR that adds support for the new options


module.exports = async function * (source, reduce) {
const roots = []

for await (const chunk of batch(source, Infinity)) {
for await (const chunk of batch(source, Number.MAX_SAFE_INTEGER)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use it-all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...I mean unless you actually want to receive a batch of Number.MAX_SAFE_INTEGER items at maximum 🤪

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically I think the assumption has been that source could return variable length arrays so we use a batcher that flattens them first, though the tests seem to pass with a regular batcher so I'm not sure that it's true any more.

In fact I think it only ever worked because of how Array.concat treats its arguments (e.g. not rejecting non-array arguments).

it-batch (internal buffer uses Array.push and yields when the buffer gets big enough)

input                         batcher                         output
child, child, child, child -> [child, child, child, child] -> [child, child, child, child] 
it-flat-batch (internal buffer uses Array.concat)

input                             batcher                         output
[child, child, child], [child] -> [child, child, child, child] -> [child, child, child, child] 
it-all (like it-batch but with an unbounded buffer size)

input                         output
child, child, child, child -> [child, child, child, child]

What fun.

TBH I don't know why we have this builder, it's not terribly useful and you can construct the same DAG with the balanced builder and a really high maxChildrenPerNode value.

- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size
- `avgChunkSize` (positive integer, defaults to `262144`): the average chunk size (rabin chunker only)
- `minChunkSize` (positive integer): the minimum chunk size (rabin chunker only)
- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we get an acccompanying PR to js-ipfs?

- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size
- `avgChunkSize` (positive integer, defaults to `262144`): the average chunk size (rabin chunker only)
- `minChunkSize` (positive integer): the minimum chunk size (rabin chunker only)
- `maxChunkSize` (positive integer, defaults to `262144`): the maximum chunk size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dirBuilder options (listed later here) have not been flattened...intentional?

Copy link
Collaborator Author

@achingbrain achingbrain Nov 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a mistake. Internally it's been flattened for ages so the dirBuilder options haven't worked for a while. Oops.

@alanshaw
Copy link
Contributor

Got any perf stats we can ogle at?

README.md Outdated Show resolved Hide resolved
achingbrain added a commit to ipfs/js-ipfs that referenced this pull request Nov 27, 2019
Adds two new arguments to the cli & http interface:

`--file-import-concurrency` and `--block-write-concurrency`

See ipfs-inactive/js-ipfs-unixfs-importer#41 for
futher discussion.
@achingbrain
Copy link
Collaborator Author

achingbrain commented Nov 27, 2019

Got any perf stats we can ogle at?

statto

Data generation, 1000x files 1k-1MB in size filled with random(ish) data:

$ cat gen-data.sh 
#!/bin/bash

mkdir -p data

for n in {1..1000}; do
    dd if=/dev/urandom of=data/file-$( printf %03d "$n" ).bin bs=1 count=$( shuf -i 1024-1048576 -n 1 )
done

Generated data:

$ du -hs ./data
525M	./data

go-IPFS (IDK why it thinks the data folder is smaller than it is):

$ time ipfs add -r ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
 501.99 MiB / 501.99 MiB [============================] 100.00%
real	0m55.654s
user	0m4.931s
sys	0m5.502s

js-IPFS Master

$ time jsipfs add -r ./data
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real	0m52.505s
user	0m13.482s
sys	0m5.760s

File & block concurrency = 1 (e.g. how it was before this PR):

$ time jsipfs add -r --file-import-concurrency=1 --block-write-concurrency=1 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real	0m52.650s
user	0m14.849s
sys	0m5.410s

Proposed default values:

$ time jsipfs add -r --file-import-concurrency=50 --block-write-concurrency=10 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real	0m25.814s
user	0m11.552s
sys	0m5.262s

@achingbrain achingbrain merged commit 68ac8cc into master Nov 27, 2019
@achingbrain achingbrain deleted the concurrent-file-import branch November 27, 2019 12:47
achingbrain added a commit to ipfs/js-ipfs that referenced this pull request Dec 11, 2019
* perf: expose importer concurrency controls when adding files

Adds two new arguments to the cli & http interface:

`--file-import-concurrency` and `--block-write-concurrency`

See ipfs-inactive/js-ipfs-unixfs-importer#41 for
futher discussion.

* chore: update deps, remove unused

* fix: pass args from http

* fix: hard code file concurrency for http requests

* fix: fix up chunker parsing tests

* chore: use ipfs-http-client branch temporarily

* chore: increase bundlesize by 1kb

* chore: update dep

* chore: increase bundle size again
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants