Merge pull request #1037 from iterative/api

api ref: better explanation on disc and memory usage for read/open
iterative · Mar 9, 2020 · 08f7301 · 08f7301
2 parents 8bba8e8 + 7167c3f
commit 08f7301
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 42 deletions.
diff --git a/public/static/docs/api-reference/index.md b/public/static/docs/api-reference/index.md
@@ -10,7 +10,7 @@ import dvc.api
 
 The purpose of this API is to provide programatic access to the data or models
 [stored and versioned](/doc/use-cases/versioning-data-and-model-files) in
-<abbr>DVC repositories</abbr> from Python apps.
+<abbr>DVC repositories</abbr> from Python code.
 
 Please choose a function from the navigation sidebar to the left, or click the
 `Next` button below to jump into the first one ↘
diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md
@@ -39,15 +39,14 @@ file can be tracked by DVC or by Git.
 [context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library)
 (using the `with` keyword, as shown in the examples).
 
-> Use `dvc.api.read()` to get the complete file contents in a single function
-> call – no _context manager_ involved.
-
 This function makes a direct connection to the
 [remote storage](/doc/command-reference/remote/add#supported-storage-types)
-(except for Google Drive), so the file contents can be streamed as they are
-read. This means it does not require space on the disc to save the file before
-making it accessible. The only exception is when using Google Drive as
-[remote type](/doc/command-reference/remote/add#supported-storage-types).
+(except for Google Drive), so the file contents can be streamed. Your code can
+process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's
+streamed, which optimizes memory usage.
+
+> Use `dvc.api.read()` to load the complete file contents in a single function
+> call – no _context manager_ involved. Neither function utilizes disc space.
 
 ## Parameters
 
@@ -91,46 +90,56 @@ making it accessible. The only exception is when using Google Drive as
 
 ## Example: Use data or models from DVC repositories
 
-Any <abbr>data artifact</abbr> can be employed directly in your Python app by
-using this API. For example, an XML file tracked in a public DVC repo on Github
-can be processed directly in your Python app with:
+Any <abbr>data artifact</abbr> hosted online can be processed directly in your
+Python code with this API. For example, an XML file tracked in a public DVC repo
+on Github can be processed like this:
 
 ```py
-from xml.dom.minidom import parse
+from xml.sax import parse
 import dvc.api
+from mymodule import mySAXHandler
 
 with dvc.api.open(
         'get-started/data.xml',
         repo='https://github.com/iterative/dataset-registry'
         ) as fd:
-    xmldom = parse(fd)
-    # ... Process DOM
+    parse(fd, mySAXHandler)
 ```
 
-> Notice that if you just need to load the complete file contents to memory, you
-> can use `dvc.api.read()` instead:
+Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because
+`dvc.api.open()` is able to stream the data from
+[remote storage](/doc/command-reference/remote/add#supported-storage-types).
+(The `mySAXHandler` object should handle the event-driven parsing of the
+document in this case.) This increases the performance of the code (minimizing
+memory usage), and is typically faster than loading the whole data into memory.
+
+> If you just needed to load the complete file contents into memory, you can use
+> `dvc.api.read()` instead:
 >
 > ```py
+> from xml.dom.minidom import parse
+> import dvc.api
+>
 > xmldata = dvc.api.read('get-started/data.xml',
 >     repo='https://github.com/iterative/dataset-registry')
 > xmldom = parse(xmldata)
 > ```
 
-Now let's imagine you want to deserialize and use a binary model from a private
-repo. For a case like this, we can use an SSH URL instead (assuming the
+## Example: Accessing private repos
+
+This is just a matter of using the right `repo` argument, for example an SSH URL
+(requires that the
 [credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh)
 locally):
 
 ```py
-import pickle
 import dvc.api
 
 with dvc.api.open(
-        'model.pkl',
+        'features.dat',
         repo='[email protected]:path/to/repo.git'
         ) as fd:
-    model = pickle.load(fd)
-    # ... Use instanciated model
+    # ... Process 'features'
 ```
 
 ## Example: Use different versions of data
@@ -149,7 +158,7 @@ with dvc.api.open(
         rev='v1.1.0'
         ) as fd:
     reader = csv.reader(fd)
-    # ... Read clean data from version 1.1.0
+    # ... Process 'clean' data from version 1.1.0
 ```
 
 Also, notice that we didn't supply a `repo` argument in this example. DVC will
@@ -158,17 +167,6 @@ directory tree, and look for the file contents of `clean.csv` in its local
 <abbr>cache</abbr>; no download will happen if found. See the
 [Parameters](#parameters) section for more info.
 
-Note: to specify the file encoding of a text file, use:
-
-```py
-import dvc.api
-
-with dvc.api.open(
-        'data/nlp/words_ru.txt',
-        encoding='koi8_r') as fd:
-    # ...
-```
-
 ## Example: Chose a specific remote as the data source
 
 Sometimes we may want to choose the [remote](/doc/command-reference/remote) data
@@ -185,5 +183,18 @@ with open(
         ) as fd:
     for line in fd:
         match = re.search(r'user=(\w+)', line)
-        # ...
+        # ... Process users activity log
+```
+
+## Example: Specify the text encoding
+
+To chose which codec to open a text file with, send an `encoding` argument:
+
+```py
+import dvc.api
+
+with dvc.api.open(
+        'data/nlp/words_ru.txt',
+        encoding='koi8_r') as fd:
+    # ... Process Russian words
 ```
diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md
@@ -28,16 +28,17 @@ This function wraps [`dvc.api.open()`](/doc/api-reference/open), for a simple
 way to return the complete contents of a file tracked in a <abbr>DVC
 project</abbr>. The file can be tracked by DVC or by Git.
 
+> This is similar to the `dvc get` command in our CLI.
+
 The returned contents can be a
 [string](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)
 or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray).
+These are loaded to memory directly (without using any disc space).
 
 > The type returned depends on the `mode` used. For more details, please refer
 > to Python's [`open()`](https://docs.python.org/3/library/functions.html#open)
 > built-in, which is used under the hood.
 
-> This is similar to the `dvc get` command in our CLI.
-
 ## Parameters
 
 - **`path`** - location and file name of the file in `repo`, relative to the
@@ -80,11 +81,9 @@ or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray).
 
 ## Example: Load data from a DVC repository
 
-Any <abbr>data artifact</abbr> can be employed directly in your Python app by
-using this API.
-
-For example, let's say that you want to unserialize and use a binary model from
-an online repo:
+Any <abbr>data artifact</abbr> hosted online can be loaded directly in your
+Python code with this API. For example, let's say that you want to load and
+unserialize a binary model from a repo on Github:
 
 ```py
 import pickle