Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: add config support for open/read #9611

Merged
merged 1 commit into from
Jun 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 24 additions & 5 deletions dvc/api/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ def open( # noqa, pylint: disable=redefined-builtin
remote: Optional[str] = None,
mode: str = "r",
encoding: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be easier and simpler if it were a remote_config, otherwise users have to be aware of our whole config structure.

Copy link
Member

@skshetry skshetry Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we could just name it config but we'll be passing:

{
  "core": {"remote": remote},
  "remote": {remote: config},
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would simplify even more to the structure of a single remote's config, like open(..., remote_config={"url": ..}). If there is also a remote arg, we can merge it with the config for that remote. if not, we can merge it with the default remote.

Related: iterative/dvc.org#4628 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is more powerful though and you could set new default remote there in one place as well. Sure, we could also do remote_config but that's more niche and we will need to tell what remote we want that to apply to. Remember that one could have multiple remotes in the repository (plus also for remote notation), which remote_config won't handle. I chose config because it is the most complete solution, while we could add edge-case params later in the future if there will be demand.

Copy link
Member

@skshetry skshetry Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is more powerful though

It should be simple to use too.

we will need to tell what remote we want that to apply to

That already exists, there's remote kwarg for that.

Remember that one could have multiple remotes in the repository

This is not applicable for open/read API though as they are about single file, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally only remote_config would be enough

Agreed. I think it can be enough in most cases where you want to use the default remote. Take this example:

with dvc.api.open("data", remote_config={"token": ...}) as f:

This is enough to set additional config options for the default remote, which I think is the most likely use case. How hard is it to add?

Compare that to how it looks with config:

with dvc.api.open("data", config={"remote": {"myremote": {"token": ...}}}) as f:

config is not only longer, but users have to know the config structure and the name of the default remote.

I would rather indeed keep config for now as the most powerful option that is great to have around even in the future.

Is there any way it will be used for anything besides remote config? I can't see any other config section that would make sense to override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is it to add?

For default - not hard, but currently if something is using non-default one it will work, and supporting that is the time-consuming part. I'm totally on the same page with you, remote_config is useful, just not sure it is worth investing into right now since config is already there.

Is there any way it will be used for anything besides remote config? I can't see any other config section that would make sense to override.

Remote config is the prime use case, but config is the most powerful mechanism that can allow one to get out of very sticky situations (e.g. if one drops git from the repo but still wants to use it he can set no_scm through this, this seems useful during deployment somewhere).

Copy link
Collaborator

@dberenbaum dberenbaum Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the level of effort to also add remote_config? Sorry, just repeating question now 🤦

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss when we talk tomorrow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is the most powerful mechanism that can allow one to get out of very sticky situations

@efiop, I think you are looking at it from a maintainer's point of view rather than the user. We can provide powerful mechanism in DvcFileSystem, read/open should be simpler. remote_config= solves most of the problem imo.

Creating a new remote, and/or providing a way to configure dvc repo is more of a niche usecase. Also, I don't feel comfortable exposing the whole config schema to the users. It has a large surface area, and can have unintended effect when not use correctly.

):
"""
Opens a file tracked in a DVC project.
Expand Down Expand Up @@ -114,6 +115,8 @@ def open( # noqa, pylint: disable=redefined-builtin
Defaults to None.
This should only be used in text mode.
Mirrors the namesake parameter in builtin `open()`_.
config(dict, optional): config to be passed to the DVC repository.
Defaults to None.

Returns:
_OpenContextManager: A context manager that generatse a corresponding
Expand Down Expand Up @@ -209,14 +212,24 @@ def open( # noqa, pylint: disable=redefined-builtin
"rev": rev,
"mode": mode,
"encoding": encoding,
"config": config,
}
return _OpenContextManager(_open, args, kwargs)


def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
repo_kwargs: Dict[str, Any] = {"subrepos": True, "uninitialized": True}
def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None, config=None):
if remote:
repo_kwargs["config"] = {"core": {"remote": remote}}
if config is not None:
raise ValueError(
"can't specify both `remote` and `config` at the same time"
)
config = {"core": {"remote": remote}}

repo_kwargs: Dict[str, Any] = {
"subrepos": True,
"uninitialized": True,
"config": config,
}

with Repo.open(repo, rev=rev, **repo_kwargs) as _repo:
with _wrap_exceptions(_repo, path):
Expand Down Expand Up @@ -251,13 +264,19 @@ def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
raise DvcIsADirectoryError(f"'{path}' is a directory") from exc


def read(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
def read(path, repo=None, rev=None, remote=None, mode="r", encoding=None, config=None):
"""
Returns the contents of a tracked file (by DVC or Git). For Git repos, HEAD
is used unless a rev argument is supplied. The default remote is tried
unless a remote argument is supplied.
"""
with open(
path, repo=repo, rev=rev, remote=remote, mode=mode, encoding=encoding
path,
repo=repo,
rev=rev,
remote=remote,
mode=mode,
encoding=encoding,
config=config,
) as fd:
return fd.read()
33 changes: 33 additions & 0 deletions tests/func/api/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,3 +218,36 @@ def test_open_from_remote(tmp_dir, erepo_dir, cloud, local_cloud):
remote="other",
) as fd:
assert fd.read() == "foo content"

with api.open(
os.path.join("dir", "foo"),
repo=f"file://{erepo_dir.as_posix()}",
config={"core": {"remote": "other"}},
) as fd:
assert fd.read() == "foo content"


def test_read_from_remote(tmp_dir, erepo_dir, cloud, local_cloud):
erepo_dir.add_remote(config=cloud.config, name="other")
erepo_dir.add_remote(config=local_cloud.config, default=True)
erepo_dir.dvc_gen({"dir": {"foo": "foo content"}}, commit="create file")
erepo_dir.dvc.push(remote="other")
remove(erepo_dir.dvc.cache.local.path)

assert (
api.read(
os.path.join("dir", "foo"),
repo=f"file://{erepo_dir.as_posix()}",
remote="other",
)
== "foo content"
)

assert (
api.read(
os.path.join("dir", "foo"),
repo=f"file://{erepo_dir.as_posix()}",
config={"core": {"remote": "other"}},
)
== "foo content"
)