Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get, get-url: add --jobs option #2122

Merged
merged 6 commits into from
Feb 5, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions content/docs/command-reference/get-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Download a file or directory from a supported URL (for example `s3://`,
## Synopsis

```usage
usage: dvc get-url [-h] [-q | -v] url [out]
usage: dvc get-url [-h] [-q | -v] [-j <number>] url [out]

positional arguments:
url (See supported URLs in the description.)
Expand All @@ -31,7 +31,7 @@ while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, then the file or
directory will be placed inside.

DVC supports several types of (local or) remote locations (protocols):
DVC supports several types of (local or) remote data sources (protocols):

| Type | Description | `url` format example |
| --------- | ---------------------------- | --------------------------------------------- |
Expand Down Expand Up @@ -72,6 +72,10 @@ $ wget https://example.com/path/to/data.csv

## Options

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Using more jobs may speed up the operation.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
Expand Down
8 changes: 7 additions & 1 deletion content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ directory.
## Synopsis

```usage
usage: dvc get [-h] [-q | -v] [-o <path>] [--rev <commit>] url path
usage: dvc get [-h] [-q | -v] [-o <path>] [--rev <commit>] [-j <number>]
url path

positional arguments:
url Location of DVC or Git repository to download from
Expand Down Expand Up @@ -65,6 +66,11 @@ name.
download the file or directory from. The latest commit in `master` (tip of the
default branch) is used by default when this option is not specified.

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Note that the default value can be set using the `jobs` config
option with `dvc remote modify`. Using more jobs may speed up the operation.
Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the default value can be set using the jobs config option with dvc remote modify.

Does this apply to get (and import)? I.e. does it respect the .dvc/config settings of the source repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. does it respect the .dvc/config settings of the source repo?

I believe it should, though might be mistaken on this cc: @efiop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the source - yes, in the host - no. Meaning that if you run dvc get inside of some repo, its configs won't be used. This is consistent with all of the related commands (list/import/get). There are a few separate tasks for that, we are almost ready to get back to them, as most of the necessary pre-requisites have already been merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop what will be the behavior after those tasks are addressed? I'm trying to understand if we need some additional clarification here (what config is being used).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop ping ^^ I'm not sure I understood the comment about There are a few separate tasks for that, we are almost ready to get back to them, as most of the necessary pre-requisites have already been merged., could you clarify it a bit please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will be the behavior after those tasks are addressed?

Don't know right now. Maybe we'll try to use host configs too or maybe we'll require explicit --config flag or config flags.

I think the confusion here is that dvc remote modify mentioned is talking about source repo, not the host repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's confusing. Maybe:

Suggested change
default is `4`. Note that the default value can be set using the `jobs` config
option with `dvc remote modify`. Using more jobs may speed up the operation.
default is `4`. Using more jobs may speed up the operation. Note that the
default value can be set in the source repo using the `jobs` config option of
`dvc remote modify`.

Or even make that last sentence an independent block quote (at the bullet indentation level).


- `--show-url` - instead of downloading the file or directory, just print the
storage location (URL) of the target data. If `path` is a Git-tracked file,
this option is ignored.
Expand Down
8 changes: 6 additions & 2 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ import `.dvc` file is created).
## Synopsis

```usage
usage: dvc import-url [-h] [-q | -v] [--file <filename>] [--no-exec]
[--desc <text>]
usage: dvc import-url [-h] [-q | -v] [-j <number>] [--file <filename>]
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[--no-exec] [--desc <text>]
url [out]

positional arguments:
Expand Down Expand Up @@ -131,6 +131,10 @@ $ dvc run -n download_data \
finish the operation(s)); or if the target data already exist locally and you
want to "DVCfy" this state of the project (see also `dvc commit`).

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isidentical just to make sure - it's not affected by any configs at all, right? (I understand that this is not about remotes in it's default mode, but we also have core.jobs or something if I remember it right)

also, what happens if we do:

dvc import-url remote://name syntax and remote specifies a jobs in its config - is it a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood #2122 (comment) correctly, the configuration of the source repo is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For import-url, it goes with --jobs => remote.jobs => core.jobs => BaseTree.JOBS

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that not BaseTree.JOBS but more like Tree.JOBS, as defaults are different for different clouds. For ssh it is 4 for the rest it is 4 * NCPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to mention this here as we do in the other parts of this PR?

Copy link
Contributor

@efiop efiop Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein It is already mentioned here. Unless I'm misunderstanding your question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see it. (I don't mean defaults, but rather that it's affected or can be affected by config. We have something like this in the other places)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sure, could mention here.

from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Using more jobs may speed up the operation.

- `--desc <text>` - user description of the data (optional). This doesn't
affect any DVC operations.

Expand Down
9 changes: 4 additions & 5 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,10 @@ repo at `url`) are not supported.
finish the operation(s)); or if the target data already exist locally and you
want to "DVCfy" this state of the project (see also `dvc commit`).

- `-j <number>`, `--jobs <number>` - number of threads to run simultaneously to
handle the downloading of files from the remote. The default value is
`4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs
may improve the total download speed if a combination of small and large files
are being fetched.
- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the
default is `4`. Note that the default value can be set using the `jobs` config
option with `dvc remote modify`. Using more jobs may speed up the operation.

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.
Expand Down