Skip to content

Commit

Permalink
Merge pull request #1037 from iterative/api
Browse files Browse the repository at this point in the history
api ref: better explanation on disc and memory usage for read/open
  • Loading branch information
jorgeorpinel authored Mar 9, 2020
2 parents 8bba8e8 + 7167c3f commit 08f7301
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 42 deletions.
2 changes: 1 addition & 1 deletion public/static/docs/api-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ import dvc.api

The purpose of this API is to provide programatic access to the data or models
[stored and versioned](/doc/use-cases/versioning-data-and-model-files) in
<abbr>DVC repositories</abbr> from Python apps.
<abbr>DVC repositories</abbr> from Python code.

Please choose a function from the navigation sidebar to the left, or click the
`Next` button below to jump into the first one ↘
79 changes: 45 additions & 34 deletions public/static/docs/api-reference/open.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,14 @@ file can be tracked by DVC or by Git.
[context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library)
(using the `with` keyword, as shown in the examples).

> Use `dvc.api.read()` to get the complete file contents in a single function
> call – no _context manager_ involved.
This function makes a direct connection to the
[remote storage](/doc/command-reference/remote/add#supported-storage-types)
(except for Google Drive), so the file contents can be streamed as they are
read. This means it does not require space on the disc to save the file before
making it accessible. The only exception is when using Google Drive as
[remote type](/doc/command-reference/remote/add#supported-storage-types).
(except for Google Drive), so the file contents can be streamed. Your code can
process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's
streamed, which optimizes memory usage.

> Use `dvc.api.read()` to load the complete file contents in a single function
> call – no _context manager_ involved. Neither function utilizes disc space.
## Parameters

Expand Down Expand Up @@ -91,46 +90,56 @@ making it accessible. The only exception is when using Google Drive as

## Example: Use data or models from DVC repositories

Any <abbr>data artifact</abbr> can be employed directly in your Python app by
using this API. For example, an XML file tracked in a public DVC repo on Github
can be processed directly in your Python app with:
Any <abbr>data artifact</abbr> hosted online can be processed directly in your
Python code with this API. For example, an XML file tracked in a public DVC repo
on Github can be processed like this:

```py
from xml.dom.minidom import parse
from xml.sax import parse
import dvc.api
from mymodule import mySAXHandler

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
xmldom = parse(fd)
# ... Process DOM
parse(fd, mySAXHandler)
```

> Notice that if you just need to load the complete file contents to memory, you
> can use `dvc.api.read()` instead:
Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because
`dvc.api.open()` is able to stream the data from
[remote storage](/doc/command-reference/remote/add#supported-storage-types).
(The `mySAXHandler` object should handle the event-driven parsing of the
document in this case.) This increases the performance of the code (minimizing
memory usage), and is typically faster than loading the whole data into memory.

> If you just needed to load the complete file contents into memory, you can use
> `dvc.api.read()` instead:
>
> ```py
> from xml.dom.minidom import parse
> import dvc.api
>
> xmldata = dvc.api.read('get-started/data.xml',
> repo='https://github.com/iterative/dataset-registry')
> xmldom = parse(xmldata)
> ```
Now let's imagine you want to deserialize and use a binary model from a private
repo. For a case like this, we can use an SSH URL instead (assuming the
## Example: Accessing private repos
This is just a matter of using the right `repo` argument, for example an SSH URL
(requires that the
[credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh)
locally):
```py
import pickle
import dvc.api
with dvc.api.open(
'model.pkl',
'features.dat',
repo='[email protected]:path/to/repo.git'
) as fd:
model = pickle.load(fd)
# ... Use instanciated model
# ... Process 'features'
```
## Example: Use different versions of data
Expand All @@ -149,7 +158,7 @@ with dvc.api.open(
rev='v1.1.0'
) as fd:
reader = csv.reader(fd)
# ... Read clean data from version 1.1.0
# ... Process 'clean' data from version 1.1.0
```
Also, notice that we didn't supply a `repo` argument in this example. DVC will
Expand All @@ -158,17 +167,6 @@ directory tree, and look for the file contents of `clean.csv` in its local
<abbr>cache</abbr>; no download will happen if found. See the
[Parameters](#parameters) section for more info.
Note: to specify the file encoding of a text file, use:

```py
import dvc.api

with dvc.api.open(
'data/nlp/words_ru.txt',
encoding='koi8_r') as fd:
# ...
```

## Example: Chose a specific remote as the data source
Sometimes we may want to choose the [remote](/doc/command-reference/remote) data
Expand All @@ -185,5 +183,18 @@ with open(
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ...
# ... Process users activity log
```
## Example: Specify the text encoding
To chose which codec to open a text file with, send an `encoding` argument:
```py
import dvc.api
with dvc.api.open(
'data/nlp/words_ru.txt',
encoding='koi8_r') as fd:
# ... Process Russian words
```
13 changes: 6 additions & 7 deletions public/static/docs/api-reference/read.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,17 @@ This function wraps [`dvc.api.open()`](/doc/api-reference/open), for a simple
way to return the complete contents of a file tracked in a <abbr>DVC
project</abbr>. The file can be tracked by DVC or by Git.

> This is similar to the `dvc get` command in our CLI.
The returned contents can be a
[string](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)
or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray).
These are loaded to memory directly (without using any disc space).

> The type returned depends on the `mode` used. For more details, please refer
> to Python's [`open()`](https://docs.python.org/3/library/functions.html#open)
> built-in, which is used under the hood.
> This is similar to the `dvc get` command in our CLI.
## Parameters

- **`path`** - location and file name of the file in `repo`, relative to the
Expand Down Expand Up @@ -80,11 +81,9 @@ or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray).

## Example: Load data from a DVC repository

Any <abbr>data artifact</abbr> can be employed directly in your Python app by
using this API.

For example, let's say that you want to unserialize and use a binary model from
an online repo:
Any <abbr>data artifact</abbr> hosted online can be loaded directly in your
Python code with this API. For example, let's say that you want to load and
unserialize a binary model from a repo on Github:

```py
import pickle
Expand Down

0 comments on commit 08f7301

Please sign in to comment.