From 20ab141bf8a5c526a82af1d5dfd7c03194dc83b6 Mon Sep 17 00:00:00 2001 From: Batuhan Taskaya Date: Fri, 29 Jan 2021 11:17:55 +0300 Subject: [PATCH 1/5] get, get-url: add --jobs option --- content/docs/command-reference/get-url.md | 8 +++++++- content/docs/command-reference/get.md | 9 ++++++++- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 3fc3fe901c..f7752a379b 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -9,7 +9,7 @@ Download a file or directory from a supported URL (for example `s3://`, ## Synopsis ```usage -usage: dvc get-url [-h] [-q | -v] url [out] +usage: dvc get-url [-h] [-q | -v] [-j ] url [out] positional arguments: url (See supported URLs in the description.) @@ -72,6 +72,12 @@ $ wget https://example.com/path/to/data.csv ## Options +- `-j `, `--jobs ` - number of threads to run simultaneously to + handle the downloading of files from the remote. The default value is + `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs + may improve the total download speed if a combination of small and large files + are being fetched. + - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 86bd1ce846..24ceb216b9 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -8,7 +8,8 @@ directory. ## Synopsis ```usage -usage: dvc get [-h] [-q | -v] [-o ] [--rev ] url path +usage: dvc get [-h] [-q | -v] [-o ] [--rev ] [-j ] + url path positional arguments: url Location of DVC or Git repository to download from @@ -65,6 +66,12 @@ name. download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. +- `-j `, `--jobs ` - number of threads to run simultaneously to + handle the downloading of files from the remote. The default value is + `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs + may improve the total download speed if a combination of small and large files + are being fetched. + - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. If `path` is a Git-tracked file, this option is ignored. From 74e71aad6780da6dd6a45f61e6746602db724061 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Feb 2021 03:39:31 -0600 Subject: [PATCH 2/5] cmd: consistent --jobs desc. --- content/docs/command-reference/fetch.md | 4 ++-- content/docs/command-reference/get-url.md | 10 ++++------ content/docs/command-reference/get.md | 9 ++++----- content/docs/command-reference/import.md | 9 ++++----- content/docs/command-reference/pull.md | 4 ++-- content/docs/command-reference/push.md | 3 +-- content/docs/command-reference/status.md | 12 ++++++------ 7 files changed, 23 insertions(+), 28 deletions(-) diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 5a7a4575c7..4fb9ebed6b 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -86,8 +86,8 @@ specific one is given with `--remote`. - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` - config option with `dvc remote modify`. Using more jobs may improve the - overall transfer speed. + config option with `dvc remote modify`. Using more jobs may speed up the + operation. - `-a`, `--all-branches` - fetch cache for all Git branches instead of just the current workspace. This means DVC may download files needed to reproduce diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index f7752a379b..4151dbe530 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -31,7 +31,7 @@ while `out` can be used to specify the directory and/or file name desired for the downloaded data. If an existing directory is specified, then the file or directory will be placed inside. -DVC supports several types of (local or) remote locations (protocols): +DVC supports several types of (local or) remote data sources (protocols): | Type | Description | `url` format example | | --------- | ---------------------------- | --------------------------------------------- | @@ -72,11 +72,9 @@ $ wget https://example.com/path/to/data.csv ## Options -- `-j `, `--jobs ` - number of threads to run simultaneously to - handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs - may improve the total download speed if a combination of small and large files - are being fetched. +- `-j `, `--jobs ` - parallelism level for DVC to download data + from the source. The default value is `4 * cpu_count()`. For SSH remotes, the + default is `4`. Using more jobs may speed up the operation. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 24ceb216b9..ed260c8476 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -66,11 +66,10 @@ name. download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. -- `-j `, `--jobs ` - number of threads to run simultaneously to - handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs - may improve the total download speed if a combination of small and large files - are being fetched. +- `-j `, `--jobs ` - parallelism level for DVC to download data + from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the + default is `4`. Note that the default value can be set using the `jobs` config + option with `dvc remote modify`. Using more jobs may speed up the operation. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. If `path` is a Git-tracked file, diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 44134b3f57..7ab23de945 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -106,11 +106,10 @@ repo at `url`) are not supported. data already exist locally and you want to "DVCfy" this state of the project (see also `dvc commit`). -- `-j `, `--jobs ` - number of threads to run simultaneously to - handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs - may improve the total download speed if a combination of small and large files - are being fetched. +- `-j `, `--jobs ` - parallelism level for DVC to download data + from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the + default is `4`. Note that the default value can be set using the `jobs` config + option with `dvc remote modify`. Using more jobs may speed up the operation. - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 42d46f905b..5f8254324b 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -118,8 +118,8 @@ used to see what files `dvc pull` would download. - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` - config option with `dvc remote modify`. Using more jobs may improve the - overall transfer speed. + config option with `dvc remote modify`. Using more jobs may speed up the + operation. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 67a1c9dbf2..6f91710811 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -94,8 +94,7 @@ in the cache (compared to the default remote.) It can be used to see what files - `-j `, `--jobs ` - parallelism level for DVC to upload data to remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` config - option with `dvc remote modify`. Using more jobs may improve the overall - transfer speed. + option with `dvc remote modify`. Using more jobs may speed up the operation. - `--glob` - allows pushing files and directories that match the [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 3041052029..3ec5e4d09b 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -140,12 +140,12 @@ that. - `--show-json` - prints the command's output in easily parsable JSON format, instead of a human-readable table. -- `-j `, `--jobs ` - parallelism level for DVC to retrieve - information from remote storage. This only applies when the `--cloud` option - is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For - SSH remotes, the default is `4`. Note that the default value can be set using - the `jobs` config option with `dvc remote modify`. Using more jobs may speed - up the operation. +- `-j `, `--jobs ` - parallelism level for DVC to access data + from remote storage. This only applies when the `--cloud` option is used, or a + `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, + the default is `4`. Note that the default value can be set using the `jobs` + config option with `dvc remote modify`. Using more jobs may speed up the + operation. - `-h`, `--help` - prints the usage/help message, and exit. From 619e36a9e30ccae6d2c7d6bc449710accc894120 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 2 Feb 2021 04:21:46 -0600 Subject: [PATCH 3/5] cmd: add import-url --jobs per https://github.com/iterative/dvc/issues/5267#issuecomment-771757774 --- content/docs/command-reference/import-url.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 58564f7276..a6d6eb8894 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -10,8 +10,8 @@ import `.dvc` file is created). ## Synopsis ```usage -usage: dvc import-url [-h] [-q | -v] [--file ] [--no-exec] - [--desc ] +usage: dvc import-url [-h] [-q | -v] [-j ] [--file ] + [--no-exec] [--desc ] url [out] positional arguments: @@ -132,6 +132,10 @@ $ dvc run -n download_data \ already exist locally and you want to "DVCfy" this state of the project (see also `dvc commit`). +- `-j `, `--jobs ` - parallelism level for DVC to download data + from the source. The default value is `4 * cpu_count()`. For SSH remotes, the + default is `4`. Using more jobs may speed up the operation. + - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations. From 1bea5287da30100f7ef56c800c6923047f2813dc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 3 Feb 2021 04:09:00 -0600 Subject: [PATCH 4/5] cmd: roll back sync command changes per https://github.com/iterative/dvc.org/pull/2122#pullrequestreview-581733135 --- content/docs/command-reference/fetch.md | 2 +- content/docs/command-reference/pull.md | 2 +- content/docs/command-reference/push.md | 2 +- content/docs/command-reference/status.md | 12 ++++++------ 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index 4fb9ebed6b..d430600f18 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -86,7 +86,7 @@ specific one is given with `--remote`. - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` - config option with `dvc remote modify`. Using more jobs may speed up the + config option with `dvc remote modify`. Using more jobs may improve the operation. - `-a`, `--all-branches` - fetch cache for all Git branches instead of just the diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 5f8254324b..013f511d87 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -118,7 +118,7 @@ used to see what files `dvc pull` would download. - `-j `, `--jobs ` - parallelism level for DVC to download data from remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` - config option with `dvc remote modify`. Using more jobs may speed up the + config option with `dvc remote modify`. Using more jobs may improve the operation. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 6f91710811..cc925f7ec4 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -94,7 +94,7 @@ in the cache (compared to the default remote.) It can be used to see what files - `-j `, `--jobs ` - parallelism level for DVC to upload data to remote storage. The default value is `4 * cpu_count()`. For SSH remotes, the default is `4`. Note that the default value can be set using the `jobs` config - option with `dvc remote modify`. Using more jobs may speed up the operation. + option with `dvc remote modify`. Using more jobs may improve the operation. - `--glob` - allows pushing files and directories that match the [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 7de76d6cb4..b216a020bf 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -140,12 +140,12 @@ that. - `--show-json` - prints the command's output in easily parsable JSON format, instead of a human-readable table. -- `-j `, `--jobs ` - parallelism level for DVC to access data - from remote storage. This only applies when the `--cloud` option is used, or a - `--remote` is given. The default value is `4 * cpu_count()`. For SSH remotes, - the default is `4`. Note that the default value can be set using the `jobs` - config option with `dvc remote modify`. Using more jobs may speed up the - operation. +- `-j `, `--jobs ` - parallelism level for DVC to retrieve + information from remote storage. This only applies when the `--cloud` option + is used, or a `--remote` is given. The default value is `4 * cpu_count()`. For + SSH remotes, the default is `4`. Note that the default value can be set using + the `jobs` config option with `dvc remote modify`. Using more jobs may speed + up the operation. - `-h`, `--help` - prints the usage/help message, and exit. From 7a0514a0869e7c7bd853d3400f0b73d7e6cbd0c2 Mon Sep 17 00:00:00 2001 From: Batuhan Taskaya Date: Fri, 5 Feb 2021 13:40:57 +0300 Subject: [PATCH 5/5] -j config --- content/docs/command-reference/get-url.md | 3 ++- content/docs/command-reference/get.md | 5 +++-- content/docs/command-reference/import-url.md | 3 ++- content/docs/command-reference/import.md | 5 +++-- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 4151dbe530..a02f819018 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -74,7 +74,8 @@ $ wget https://example.com/path/to/data.csv - `-j `, `--jobs ` - parallelism level for DVC to download data from the source. The default value is `4 * cpu_count()`. For SSH remotes, the - default is `4`. Using more jobs may speed up the operation. + default is `4`. Note that the default value can be set using the `jobs` config + option with `dvc remote modify`. Using more jobs may speed up the operation. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index ed260c8476..05312cca53 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -68,8 +68,9 @@ name. - `-j `, `--jobs ` - parallelism level for DVC to download data from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the - default is `4`. Note that the default value can be set using the `jobs` config - option with `dvc remote modify`. Using more jobs may speed up the operation. + default is `4`. Using more jobs may speed up the operation. Note that the + default value can be set in the source repo using the `jobs` config option of + `dvc remote modify`. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. If `path` is a Git-tracked file, diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 749720b5a1..54b1a59c52 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -133,7 +133,8 @@ $ dvc run -n download_data \ - `-j `, `--jobs ` - parallelism level for DVC to download data from the source. The default value is `4 * cpu_count()`. For SSH remotes, the - default is `4`. Using more jobs may speed up the operation. + default is `4`. Note that the default value can be set using the `jobs` config + option with `dvc remote modify`. Using more jobs may speed up the operation. - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations. diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 9a00db8dba..b16a14b367 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -107,8 +107,9 @@ repo at `url`) are not supported. - `-j `, `--jobs ` - parallelism level for DVC to download data from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the - default is `4`. Note that the default value can be set using the `jobs` config - option with `dvc remote modify`. Using more jobs may speed up the operation. + default is `4`. Using more jobs may speed up the operation. Note that the + default value can be set in the source repo using the `jobs` config option of + `dvc remote modify`. - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations.