Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallelism for the distributedLoad command #9895

Merged
merged 6 commits into from
Sep 20, 2019

Conversation

liuhongtong
Copy link
Contributor

@liuhongtong liuhongtong commented Sep 16, 2019

distributedLoad traverses the specified path, distributes a load job of a file to job master and waits the job completed one by one, that is a serial process. So the performance may be not acceptable.
Now distributedLoad traverses the specified path, distributes a batch load job of files to job master and distributes a new job if one job completed, that is a intercurrent process.

Signed-off-by: liuhongtong [email protected]

Fix #9791

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • AmplabJenkins build check: PENDING
    • We were not able to detect AmplabJenkins test results on this PR. Status will update when testing completes.
  • Commits associated with Github account: PASS
  • PR title follows the conventions: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@apc999 apc999 changed the title Optimize distributedLoad #9791 Optimize performance of distributedLoad Sep 16, 2019
@apc999 apc999 self-requested a review September 16, 2019 17:46
@gpang
Copy link
Contributor

gpang commented Sep 16, 2019

@liuhongtong Thanks for this improvement! Could you provide some more information in the PR description? Thanks!

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/5532/
Test FAILed.

@apc999 apc999 requested a review from calvinjia September 16, 2019 23:01
Copy link
Contributor

@calvinjia calvinjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuhongtong Thanks for the substantial improvement to this call. Do you mind taking a look at PersistCommand and seeing if that pattern can apply to distributedload?

Unrelated to this PR - We have a few implementations of multi-threaded CLI and it would be good to consolidate them into a single framework.

.hasArg(true)
.desc("number of replicas to have for each block of the loaded file")
.build();
public static final Option THREAD_OPTION =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: THREADS_OPTION to be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean const value like C language?

@@ -35,7 +43,28 @@
*/
@ThreadSafe
public final class DistributedLoadCommand extends AbstractFileSystemCommand {
private static final String REPLICATION = "replication";
public static final Option REPLICATION =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: REPLCIATION_OPTION to be consistent

@liuhongtong
Copy link
Contributor Author

@gpang distributedLoad traverses the specified path, distributes a load job of a file to job master and waits the job completed one by one, that is a serial process. So the performance may be not acceptable.
Now distributedLoad traverses the specified path, distributes a batch load job of files to job master and distributes a new job if one job completed, that is a intercurrent process.

Signed-off-by: liuhongtong <[email protected]>
@liuhongtong
Copy link
Contributor Author

liuhongtong commented Sep 17, 2019

Loading 5TiB data from a big HDFS cluster with 30 Alluxio worker & job worker.
While there is no other I/O in Alluxio.
NIC: 10000Mb/s

load test

date; alluxio fs load /ns1_lht_test_alluxio; date
Sun Aug 25 17:13:07 CST 2019
Aug 26, 2019 7:18:42 AM
about 14h 5m

distributedLoad test

date; alluxio fs distributedLoad /ns1_lht_test_alluxio; date
Mon Aug 26 08:57:36 CST 2019
Mon Aug 26 18:07:15 CST 2019
about 9h

new distributedLoad test

default (128 threads)
time alluxio fs distributedLoad /ns1_lht_test_alluxio/
real	5m11.984s

256 threads
time alluxio fs distributedLoad /ns1_lht_test_alluxio/ -thread 256
real	4m29.423s

512 threads
time alluxio fs distributedLoad /ns1_lht_test_alluxio/ -thread 512
real	4m7.610s

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/5554/
Test PASSed.

@liuhongtong
Copy link
Contributor Author

@apc999 @calvinjia updated. PTAL. Thanks.

@liuhongtong liuhongtong changed the title Optimize performance of distributedLoad Optimize performance of distributedLoad #9791 Sep 17, 2019
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • AmplabJenkins build check: PASS
  • Commits associated with Github account: PASS
  • PR title follows the conventions: PASS

All checks passed!

@gpang gpang changed the title Optimize performance of distributedLoad #9791 Implement parallelism for the distributedLoad command Sep 17, 2019
@calvinjia
Copy link
Contributor

@liuhongtong Could you open a github issue for consolidating optionally multi-threaded CLIs to a general framework?

@liuhongtong
Copy link
Contributor Author

@calvinjia OK. I would like to open a new issue and consolidate a general framework for multi-threaded CLIs.

Signed-off-by: liuhongtong <[email protected]>
@liuhongtong
Copy link
Contributor Author

New issue: #9905

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/5576/
Test PASSed.

Signed-off-by: liuhongtong <[email protected]>
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/5577/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/5582/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Alluxio-Pull-Request-Builder/5601/
Test PASSed.

@apc999 apc999 self-requested a review September 20, 2019 06:58
@apc999
Copy link
Contributor

apc999 commented Sep 20, 2019

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit f5b70fd into Alluxio:master Sep 20, 2019
@liuhongtong
Copy link
Contributor Author

@apc999 Thanks for merging this pr.

bf8086 pushed a commit to bf8086/alluxio that referenced this pull request Oct 1, 2019
distributedLoad traverses the specified path, distributes a load job of
a file to job master and waits the job completed one by one, that is a
serial process. So the performance may be not acceptable.
Now distributedLoad traverses the specified path, distributes a batch
load job of files to job master and distributes a new job if one job
completed, that is a intercurrent process.

Signed-off-by: liuhongtong <[email protected]>

Fix Alluxio#9791

pr-link: Alluxio#9895
change-id: cid-f7ece37fda7dd3bd2a6c38783b77edc410e22fab
apc999 pushed a commit to apc999/alluxio that referenced this pull request Oct 2, 2019
distributedLoad traverses the specified path, distributes a load job of
a file to job master and waits the job completed one by one, that is a
serial process. So the performance may be not acceptable.
Now distributedLoad traverses the specified path, distributes a batch
load job of files to job master and distributes a new job if one job
completed, that is a intercurrent process.

Signed-off-by: liuhongtong <[email protected]>

Fix Alluxio#9791

pr-link: Alluxio#9895
change-id: cid-f7ece37fda7dd3bd2a6c38783b77edc410e22fab
apc999 pushed a commit to apc999/alluxio that referenced this pull request Oct 2, 2019
distributedLoad traverses the specified path, distributes a load job of
a file to job master and waits the job completed one by one, that is a
serial process. So the performance may be not acceptable.
Now distributedLoad traverses the specified path, distributes a batch
load job of files to job master and distributes a new job if one job
completed, that is a intercurrent process.

Signed-off-by: liuhongtong <[email protected]>

Fix Alluxio#9791

pr-link: Alluxio#9895
change-id: cid-f7ece37fda7dd3bd2a6c38783b77edc410e22fab
apc999 pushed a commit to apc999/alluxio that referenced this pull request Oct 23, 2019
distributedLoad traverses the specified path, distributes a load job of
a file to job master and waits the job completed one by one, that is a
serial process. So the performance may be not acceptable.
Now distributedLoad traverses the specified path, distributes a batch
load job of files to job master and distributes a new job if one job
completed, that is a intercurrent process.

Signed-off-by: liuhongtong <[email protected]>

Fix Alluxio#9791

pr-link: Alluxio#9895
change-id: cid-f7ece37fda7dd3bd2a6c38783b77edc410e22fab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Low performance of distributedLoad
7 participants