Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: create docs #908

Merged
merged 109 commits into from
Mar 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
d56f2d2
api: create index and structure
jorgeorpinel Jan 8, 2020
d53833f
api ref: add summon page, improve format of all pages
jorgeorpinel Jan 8, 2020
aba4954
api ref: complete pages per initial desc.
jorgeorpinel Jan 8, 2020
0f5a3bb
api: better formatting, added links, and examples
jorgeorpinel Jan 9, 2020
701aeb7
api ref: reorder pages, refine descriptions and options, add examples
jorgeorpinel Jan 9, 2020
6a18d61
api: note that env mgr install is required
jorgeorpinel Jan 9, 2020
a3023be
api ref: typo in index
jorgeorpinel Jan 9, 2020
09c76c5
cmd ref: improve `path` description in get and import
jorgeorpinel Jan 9, 2020
8c14cbd
api: improve path/name and repo param descriptions
jorgeorpinel Jan 9, 2020
97c3972
api: improve `repo` param desc
jorgeorpinel Jan 9, 2020
6894e77
api: add assumed defaults to all params in all functions
jorgeorpinel Jan 9, 2020
711838d
api: oops, forgot about `repo` param's default :B
jorgeorpinel Jan 9, 2020
68a3dd0
api: comlpete get_url desc and real life example
jorgeorpinel Jan 9, 2020
95fa070
use-cases: link from data-registries to api ref
jorgeorpinel Jan 10, 2020
09db290
doc: add some basic links between API and cmd refs
jorgeorpinel Jan 10, 2020
bb05e9c
api: better desc. and example explanation in get_url
jorgeorpinel Jan 11, 2020
58e4bb5
api: improve index desc (again?)
jorgeorpinel Jan 11, 2020
08fe226
api: update get_url example to use iterative/dataset-registry repo
jorgeorpinel Jan 12, 2020
c207152
api: complete base examples for get_url, open, and read (they may nee…
jorgeorpinel Jan 12, 2020
4a50af9
install: add notes about installing as a Python lib
jorgeorpinel Jan 14, 2020
89734fd
api ref: add note about file existence in get_url, and related updates
jorgeorpinel Jan 15, 2020
c094233
api: copy edits
jorgeorpinel Jan 16, 2020
8b133ca
api: add notes about possible errors in function arguments
jorgeorpinel Jan 16, 2020
eec0848
api: add return types to first 3 functions
jorgeorpinel Jan 16, 2020
9b58c3f
api: add list of remotes you can strem from for open fn
jorgeorpinel Jan 16, 2020
af3ee37
Merge branch 'master' into api
jorgeorpinel Jan 16, 2020
eeb318c
Merge branch 'master' into api
jorgeorpinel Jan 19, 2020
3e0b909
api ref: remove `summon()` page
jorgeorpinel Jan 19, 2020
9cf939f
api ref: add full modue path to exceptions mentioned so far
jorgeorpinel Jan 21, 2020
59bb2f2
api: fix term "environment manager"
jorgeorpinel Jan 21, 2020
e340321
api ref: separate short and long desc, similar to cmd ref
jorgeorpinel Jan 21, 2020
cc4a7a8
api ref: small language refinements
jorgeorpinel Jan 21, 2020
24f2d67
api ref: add note about `.dir` cache files in get_url
jorgeorpinel Jan 23, 2020
e26ef15
Merge branch 'master' into api
jorgeorpinel Feb 7, 2020
026eaa2
api: correct exception path in get_url
jorgeorpinel Feb 7, 2020
3815a09
api: standard indentation and arg usage in all examples, and
jorgeorpinel Feb 7, 2020
382b195
api: improve last open() example
jorgeorpinel Feb 8, 2020
013f5df
api: std use of single vs double quotes and add mode='rb' in read() e…
jorgeorpinel Feb 8, 2020
e9340ce
api: update index and install section
jorgeorpinel Feb 8, 2020
0a725e8
cmd ref: refactor and simplify notes to emphasize those linking to ap…
jorgeorpinel Feb 8, 2020
d696063
api ref: rewrite get_url intro
jorgeorpinel Feb 8, 2020
f13b311
api ref: simplify note about get_url not checking for file/dir existence
jorgeorpinel Feb 8, 2020
73adff5
api ref: update note about directory JSON .dir files in get_url
jorgeorpinel Feb 9, 2020
6ebe371
api ref: std. param lang style
jorgeorpinel Feb 9, 2020
a921b40
api ref: simplify and improve basic param descs
jorgeorpinel Feb 9, 2020
a6f9eec
api ref: improvements to repo param and get_url example
jorgeorpinel Feb 9, 2020
c1eb598
api ref: further explain URL construction in get_url example
jorgeorpinel Feb 12, 2020
4f42b88
api ref: simplify api index
jorgeorpinel Feb 12, 2020
ffc0117
Merge branch 'master' into api
jorgeorpinel Feb 15, 2020
8208fd9
api: add link to dvcx repo
jorgeorpinel Feb 17, 2020
910d319
Merge branch 'master' into api
jorgeorpinel Feb 17, 2020
1d5d3a9
api: open() and read() support Git-tracked files
jorgeorpinel Feb 17, 2020
a11465e
links: fix link-check for api docs
jorgeorpinel Feb 17, 2020
f86afde
api: typo
jorgeorpinel Feb 17, 2020
979d70c
api: Signature -> definition section in all fns
jorgeorpinel Feb 18, 2020
099dc4e
api: copy edits and term artifact -> file or dir in get_url
jorgeorpinel Feb 18, 2020
9e21825
api: term artifact -> data since open() and read() don't support dirs
jorgeorpinel Feb 18, 2020
4eb85b3
api: typo
jorgeorpinel Feb 18, 2020
b6cd85e
api: improvements to fn params
jorgeorpinel Feb 18, 2020
11fb55e
api: updates to repo param
jorgeorpinel Feb 18, 2020
be2316f
install: reword link to api ref
jorgeorpinel Feb 19, 2020
c94610d
term: GitHub URLs -> hosted on GitHub
jorgeorpinel Feb 19, 2020
9c52cd8
api: add link to read() from open() desc.
jorgeorpinel Feb 19, 2020
4219494
Merge branch 'master' into api
jorgeorpinel Feb 25, 2020
942d81d
api: remove word "directly" fom exception lists
jorgeorpinel Feb 25, 2020
e76bd5f
api: add basic usage sections
jorgeorpinel Feb 25, 2020
2b8b4a7
api: improve model open() example
jorgeorpinel Feb 25, 2020
aa88cea
api: fix typos and remove lines between import stmts
jorgeorpinel Feb 25, 2020
dc93bb8
Merge branch 'api' of github.com:iterative/dvc.org into api
jorgeorpinel Feb 25, 2020
f66826c
api: fix closing parentheses in example
jorgeorpinel Feb 25, 2020
8b3929c
api: remove link to dvcx
jorgeorpinel Feb 25, 2020
3ec79a0
api: updates to open()
jorgeorpinel Feb 25, 2020
fcdf0c4
api: improve open() examples
jorgeorpinel Feb 27, 2020
e5b52ae
api: improve list of 3rd party lib examples in get_url
jorgeorpinel Feb 27, 2020
7f2981f
api ref: compact intro/signature before loner descs
jorgeorpinel Feb 27, 2020
03d4e72
api ref: improve dvc.api.open() desc similar to open() builtin
jorgeorpinel Feb 27, 2020
6ccfc80
api ref: updates to get_url
jorgeorpinel Feb 27, 2020
418651d
api ref: impros to open()
jorgeorpinel Feb 27, 2020
7161e8c
api ref: more impros to open()
jorgeorpinel Feb 27, 2020
a229803
api ref: remove term "source" from params
jorgeorpinel Feb 27, 2020
c43e704
api ref: better wording in path option
jorgeorpinel Feb 27, 2020
5acaffd
api ref: explain mode and encoding options (open/read())
jorgeorpinel Feb 27, 2020
8840338
api ref: typo project->cache
jorgeorpinel Feb 27, 2020
2e9dc3d
api ref: move default param behavior to fn descriptions
jorgeorpinel Feb 27, 2020
35674e3
api ref: add read() snippet in open() example
jorgeorpinel Feb 27, 2020
42e8563
api ref: add read() example explanation and fix link check
jorgeorpinel Feb 28, 2020
ed80616
api ref: name examples
jorgeorpinel Feb 28, 2020
c47b366
api ref: move the default arguments/behavior back to params
jorgeorpinel Feb 29, 2020
a1a6b34
api ref: use simple language in example titles
jorgeorpinel Feb 29, 2020
a251259
api ref: merge local open() example into --rev example
jorgeorpinel Feb 29, 2020
af551be
api ref: change motivation of `remote` arg example in open()
jorgeorpinel Feb 29, 2020
7277e11
api ref: unserialize -> deserialize
jorgeorpinel Feb 29, 2020
4f04a62
api ref: some last refinements 9on this feedback round)
jorgeorpinel Feb 29, 2020
925f520
api ref: rewrite intro blocks for simplicity, use type hints
jorgeorpinel Feb 29, 2020
7105df6
api ref: improve print output of code samples
jorgeorpinel Mar 1, 2020
a02b178
api ref: few text edits to match to core repo docstrings
jorgeorpinel Mar 2, 2020
466694c
api ref: correct docs about UrlNotDvcRepoError – it only exists in ge…
jorgeorpinel Mar 2, 2020
951742d
suggest some minor things to API
shcheklein Mar 3, 2020
410622f
Update public/static/docs/api-reference/get_url.md
shcheklein Mar 3, 2020
fcbdefd
Merge pull request #1032 from iterative/api-suggestions
jorgeorpinel Mar 4, 2020
8089423
api ref: address remaining feedback from PR #1032
jorgeorpinel Mar 4, 2020
e440284
typo
jorgeorpinel Mar 4, 2020
4ee9335
api ref: apply get_url improvements to open and read fns
jorgeorpinel Mar 4, 2020
6cd45b0
api ref: add info. about types returned/generated to open and read
jorgeorpinel Mar 5, 2020
ce5b42a
api ref: minor changes to open
jorgeorpinel Mar 5, 2020
f245cc8
Merge branch 'master' into api
jorgeorpinel Mar 8, 2020
6bd2740
api ref: address another round of feedback on open fn
jorgeorpinel Mar 8, 2020
90cb882
api ref: small change to reaf fn example title
jorgeorpinel Mar 8, 2020
7733fea
api ref: a few last improvements it seems
jorgeorpinel Mar 8, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions public/static/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# dvc.api.get_url()

Returns the URL to the storage location of a data file or directory tracked in a
<abbr>DVC project</abbr>.

```py
def get_url(path: str,
repo: str = None,
rev: str = None,
remote: str = None) -> str
```

#### Usage:

```py
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
import dvc.api

resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry')

# resource_url is now "https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355"
```

## Description

Returns the URL string of the storage location (in a
[DVC remote](/doc/command-reference/remote)) where a target file or directory,
specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.

The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is an
<abbr>output</abbr>. The URL schema returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).

If the target is a directory, the returned URL will end in `.dir`. Refer to
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
and `dvc add` to learn more about how DVC handles data directories.

⚠️ This function does not check for the actual existence of the file or
directory in the remote storage.

💡 Having the resource's URL, it should be possible to download it directly with
an appropriate library, such as
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj)
or
[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get).
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

## Parameters

- **`path`** - location and file name of the file or directory in `repo`,
relative to the project's root.

- `repo` - specifies the location of the DVC project. It can be a URL or a file
system path. Both HTTP and SSH protocols are supported for online Git repos
(e.g. `[user@]server:project.git`). _Default_: The current project is used
(the current working directory tree is walked up to find it).

- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this
option is ignored. _Default_: `HEAD`.

- `remote` - name of the [DVC remote](/doc/command-reference/remote) to use to
form the returned URL string. _Default_: The
[default remote](/doc/command-reference/remote/default) of `repo` is used.

## Exceptions

- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project.

- `dvc.exceptions.NoRemoteError` - no `remote` is found.

## Example: Getting the URL to a DVC-tracked file

```py
import dvc.api

resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
)

print(resource_url)
```

The script above prints

`https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355`

This URL represents the location where the data is stored, and is built by
reading the corresponding DVC-file
([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc))
where the `md5` file hash is stored,

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
path: get-started/data.xml
```

and the project configuration
([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config))
where the remote URL is saved:

```ini
['remote "storage"']
url = https://remote.dvc.org/dataset-registry
```
16 changes: 16 additions & 0 deletions public/static/docs/api-reference/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Python API
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

DVC can be used as a Python library, simply [install](/doc/install) with `pip`
or `conda`. This reference provides the details about the functions in the API
module `dvc.api`, which can be imported any regular way, for example:

```py
import dvc.api
```

The purpose of this API is to provide programatic access to the data or models
[stored and versioned](/doc/use-cases/versioning-data-and-model-files) in
<abbr>DVC repositories</abbr> from Python apps.

Please choose a function from the navigation sidebar to the left, or click the
`Next` button below to jump into the first one ↘
189 changes: 189 additions & 0 deletions public/static/docs/api-reference/open.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# dvc.api.open()

Opens a tracked file.

```py
def open(path: str,
repo: str = None,
rev: str = None,
remote: str = None,
mode: str = "r",
encoding: str = None)
```

#### Usage:

```py
import dvc.api

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
# ... fd is a file descriptor that can be processed normally.
```

## Description

Open a data or model file tracked in a <abbr>DVC project</abbr> and generate a
corresponding
[file object](https://docs.python.org/3/glossary.html#term-file-object). The
file can be tracked by DVC or by Git.

> The exact type of file object depends on the `mode` used. For more details,
> please refer to Python's
> [`open()`](https://docs.python.org/3/library/functions.html#open) built-in,
> which is used under the hood.

`dvc.api.open()` may only be used as a
[context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library)
(using the `with` keyword, as shown in the examples).

> Use `dvc.api.read()` to get the complete file contents in a single function
> call – no _context manager_ involved.

This function makes a direct connection to the
[remote storage](/doc/command-reference/remote/add#supported-storage-types)
(except for Google Drive), so the file contents can be streamed as they are
read. This means it does not require space on the disc to save the file before
making it accessible. The only exception is when using Google Drive as
[remote type](/doc/command-reference/remote/add#supported-storage-types).

## Parameters

- **`path`** - location and file name of the file in `repo`, relative to the
project's root.

- `repo` - specifies the location of the DVC project. It can be a URL or a file
system path. Both HTTP and SSH protocols are supported for online Git repos
(e.g. `[user@]server:project.git`). _Default_: The current project is used
(the current working directory tree is walked up to find it).

- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this
option is ignored. _Default_: `HEAD`.

- `remote` - name of the [DVC remote](/doc/command-reference/remote) to look for
the target data. _Default_: The
[default remote](/doc/command-reference/remote/default) of `repo` is used if a
`remote` argument is not given. For local projects, the <abbr>cache</abbr> is
tied before the default remote.

- `mode` - specifies the mode in which the file is opened. Defaults to `"r"`
(read). Mirrors the namesake parameter in builtin
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[`open()`](https://docs.python.org/3/library/functions.html#open).

- `encoding` -
[codec](https://docs.python.org/3/library/codecs.html#standard-encodings) used
to decode the file contents to a string. This should only be used in text
mode. Defaults to `"utf-8"`. Mirrors the namesake parameter in builtin
`open()`.

## Exceptions

- `dvc.exceptions.FileMissingError` - file in `path` is missing from `repo`.

- `dvc.exceptions.PathMissingError` - `path` cannot be found in `repo`.

- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project.

- `dvc.exceptions.NoRemoteError` - no `remote` is found.

## Example: Use data or models from DVC repositories

Any <abbr>data artifact</abbr> can be employed directly in your Python app by
using this API. For example, an XML file tracked in a public DVC repo on Github
can be processed directly in your Python app with:

```py
from xml.dom.minidom import parse
import dvc.api

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
xmldom = parse(fd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super relevant example would to show SAX or StAX parser instead of a DOM one - that's where it shines. Or we can make CSV example the main one and show how we process it in steam fashion (e.g. calculating sum or avg) - it would show the "streaming" aspect of the open() way better.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make CSV example the main one and show how we process it in steam fashion (e.g. calculating sum or avg

Thinking about this, I don't think we're talking about real-time data streaming (e.g. from a Kafka server) so that continuously calculating a metric would be logical. Or maybe I missed the point?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged the PR, but still thinking about this one. What's the advantage of streaming files in open/read? Probably just making a big file available quickly so you can start processing it before it's all downloaded, but again, I don't think you'll want to show the progress of such processing, or is that a major use case you guys see? Cc @Suor @shcheklein

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this discussion to a new PR: #1037

# ... Process DOM
```

> Notice that if you just need to load the complete file contents to memory, you
> can use `dvc.api.read()` instead:
>
> ```py
> xmldata = dvc.api.read('get-started/data.xml',
> repo='https://github.com/iterative/dataset-registry')
> xmldom = parse(xmldata)
> ```

Now let's imagine you want to deserialize and use a binary model from a private
repo. For a case like this, we can use an SSH URL instead (assuming the
[credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh)
locally):

```py
import pickle
import dvc.api

with dvc.api.open(
'model.pkl',
repo='[email protected]:path/to/repo.git'
) as fd:
model = pickle.load(fd)
# ... Use instanciated model
```

## Example: Use different versions of data

The `rev` argument lets you specify any Git commit to look for an artifact. This
way any previous version, or alternative experiment can be accessed
programmatically. For example, let's say your DVC repo has tagged releases of a
CSV dataset:

```py
import csv
import dvc.api

with dvc.api.open(
'clean.csv',
rev='v1.1.0'
) as fd:
reader = csv.reader(fd)
# ... Read clean data from version 1.1.0
```

Also, notice that we didn't supply a `repo` argument in this example. DVC will
attempt to find a <abbr>DVC project</abbr> to use in the current working
directory tree, and look for the file contents of `clean.csv` in its local
<abbr>cache</abbr>; no download will happen if found. See the
[Parameters](#parameters) section for more info.

Note: to specify the file encoding of a text file, use:

```py
import dvc.api

with dvc.api.open(
'data/nlp/words_ru.txt',
encoding='koi8_r') as fd:
# ...
```

## Example: Chose a specific remote as the data source

Sometimes we may want to choose the [remote](/doc/command-reference/remote) data
source, for example if the `repo` has no default remote set. This can be done by
providing a `remote` argument:

```py
import dvc.api

with open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ...
```
Loading