From 4a8111ee698a17038ffbc3f43ceef128f027de81 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 17:53:37 -0600 Subject: [PATCH 1/9] api ref: change XML example in open to use SAX per https://github.com/iterative/dvc.org/pull/908#discussion_r388043786 --- public/static/docs/api-reference/open.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 99602624c7..3a90062e40 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -96,21 +96,28 @@ using this API. For example, an XML file tracked in a public DVC repo on Github can be processed directly in your Python app with: ```py -from xml.dom.minidom import parse +from xml.sax import parse import dvc.api +from mymodule import mySAXHandler with dvc.api.open( 'get-started/data.xml', repo='https://github.com/iterative/dataset-registry' ) as fd: - xmldom = parse(fd) - # ... Process DOM + parse(fd, mySAXHandler) ``` -> Notice that if you just need to load the complete file contents to memory, you -> can use `dvc.api.read()` instead: +Notice that we want to use a [SAX](http://www.saxproject.org/) XML parser here +because `dvc.api.open()` is able to stream the file, the `mySAXHandler` object +must handle the event-driven parsing of the document in this case. + +> If you just need to load the complete file contents to memory, you can use +> `dvc.api.read()` instead: > > ```py +> from xml.dom.minidom import parse +> import dvc.api +> > xmldata = dvc.api.read('get-started/data.xml', > repo='https://github.com/iterative/dataset-registry') > xmldom = parse(xmldata) From 12ba7d1ec22ce9eae213a87758d3801cc168519e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 18:01:05 -0600 Subject: [PATCH 2/9] typo --- public/static/docs/api-reference/open.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 3a90062e40..3f29991428 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -108,8 +108,8 @@ with dvc.api.open( ``` Notice that we want to use a [SAX](http://www.saxproject.org/) XML parser here -because `dvc.api.open()` is able to stream the file, the `mySAXHandler` object -must handle the event-driven parsing of the document in this case. +because `dvc.api.open()` is able to stream the file. The `mySAXHandler` object +should handle the event-driven parsing of the document in this case. > If you just need to load the complete file contents to memory, you can use > `dvc.api.read()` instead: From 59646e8a800430caab2151e88f679ecf6e22d5da Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 18:16:09 -0600 Subject: [PATCH 3/9] api ref: emphasize that open/read avoid using local file storage --- public/static/docs/api-reference/open.md | 6 +++--- public/static/docs/api-reference/read.md | 9 ++++----- 2 files changed, 7 insertions(+), 8 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 3f29991428..282789d2f3 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -91,9 +91,9 @@ making it accessible. The only exception is when using Google Drive as ## Example: Use data or models from DVC repositories -Any data artifact can be employed directly in your Python app by -using this API. For example, an XML file tracked in a public DVC repo on Github -can be processed directly in your Python app with: +Any data artifact hosted online can be employed directly in your +Python app (without requiring local file storage) with this API. For example, an +XML file tracked in a public DVC repo on Github can be processed like this: ```py from xml.sax import parse diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md index 4fc6def66f..ee481bcfbd 100644 --- a/public/static/docs/api-reference/read.md +++ b/public/static/docs/api-reference/read.md @@ -80,11 +80,10 @@ or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray). ## Example: Load data from a DVC repository -Any data artifact can be employed directly in your Python app by -using this API. - -For example, let's say that you want to unserialize and use a binary model from -an online repo: +Any data artifact hosted online can be employed directly in your +Python app (without requiring local file storage) with this API. For example, +let's say that you want to load and unserialize a binary model from a repo on +Github: ```py import pickle From 5b6e21df4a51be851c6946f4e8198d62ef17e5b7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 18:42:41 -0600 Subject: [PATCH 4/9] api ref: more impros to the examples and explanations in open and read --- public/static/docs/api-reference/open.md | 28 ++++++++++++------------ public/static/docs/api-reference/read.md | 5 ++--- 2 files changed, 16 insertions(+), 17 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 282789d2f3..9b8e08cdbd 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -45,8 +45,8 @@ file can be tracked by DVC or by Git. This function makes a direct connection to the [remote storage](/doc/command-reference/remote/add#supported-storage-types) (except for Google Drive), so the file contents can be streamed as they are -read. This means it does not require space on the disc to save the file before -making it accessible. The only exception is when using Google Drive as +read. This means it does not require disc space to save the file before making +it accessible. The only exception is when using Google Drive as [remote type](/doc/command-reference/remote/add#supported-storage-types). ## Parameters @@ -92,8 +92,8 @@ making it accessible. The only exception is when using Google Drive as ## Example: Use data or models from DVC repositories Any data artifact hosted online can be employed directly in your -Python app (without requiring local file storage) with this API. For example, an -XML file tracked in a public DVC repo on Github can be processed like this: +Python app (no disc space needed) with this API. For example, an XML file +tracked in a public DVC repo on Github can be processed like this: ```py from xml.sax import parse @@ -107,11 +107,11 @@ with dvc.api.open( parse(fd, mySAXHandler) ``` -Notice that we want to use a [SAX](http://www.saxproject.org/) XML parser here -because `dvc.api.open()` is able to stream the file. The `mySAXHandler` object +Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because +`dvc.api.open()` is able to stream the data download. The `mySAXHandler` object should handle the event-driven parsing of the document in this case. -> If you just need to load the complete file contents to memory, you can use +> If you just needed to load the complete file contents into memory, you can use > `dvc.api.read()` instead: > > ```py @@ -123,21 +123,21 @@ should handle the event-driven parsing of the document in this case. > xmldom = parse(xmldata) > ``` -Now let's imagine you want to deserialize and use a binary model from a private -repo. For a case like this, we can use an SSH URL instead (assuming the +## Example: Accessing private repos + +This is just a matter of using the right `repo` argument, for example an SSH URL +(requires that the [credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) locally): ```py -import pickle import dvc.api with dvc.api.open( - 'model.pkl', + 'features.dat', repo='git@server.com:path/to/repo.git' ) as fd: - model = pickle.load(fd) - # ... Use instanciated model + # ... Process 'features' ``` ## Example: Use different versions of data @@ -156,7 +156,7 @@ with dvc.api.open( rev='v1.1.0' ) as fd: reader = csv.reader(fd) - # ... Read clean data from version 1.1.0 + # ... Process 'clean' data from version 1.1.0 ``` Also, notice that we didn't supply a `repo` argument in this example. DVC will diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md index ee481bcfbd..1f80485758 100644 --- a/public/static/docs/api-reference/read.md +++ b/public/static/docs/api-reference/read.md @@ -81,9 +81,8 @@ or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray). ## Example: Load data from a DVC repository Any data artifact hosted online can be employed directly in your -Python app (without requiring local file storage) with this API. For example, -let's say that you want to load and unserialize a binary model from a repo on -Github: +Python app (no disc space needed) with this API. For example, let's say that you +want to load and unserialize a binary model from a repo on Github: ```py import pickle From 1767042a599303ded6f8da846793ff9c0e845947 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 18:54:38 -0600 Subject: [PATCH 5/9] api ref: more impros to desc and examples of open and read... to emphasize about disc and memory usage in each one --- public/static/docs/api-reference/open.md | 37 +++++++++++++----------- public/static/docs/api-reference/read.md | 3 ++ 2 files changed, 23 insertions(+), 17 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 9b8e08cdbd..8faae29d12 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -45,9 +45,8 @@ file can be tracked by DVC or by Git. This function makes a direct connection to the [remote storage](/doc/command-reference/remote/add#supported-storage-types) (except for Google Drive), so the file contents can be streamed as they are -read. This means it does not require disc space to save the file before making -it accessible. The only exception is when using Google Drive as -[remote type](/doc/command-reference/remote/add#supported-storage-types). +downloaded. No disc space and very little memory are needed to save the file +before making it accessible. ## Parameters @@ -108,8 +107,10 @@ with dvc.api.open( ``` Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because -`dvc.api.open()` is able to stream the data download. The `mySAXHandler` object -should handle the event-driven parsing of the document in this case. +`dvc.api.open()` is able to stream the data download. (The `mySAXHandler` object +should handle the event-driven parsing of the document in this case.) This +increases the performance of the code, since very little memory is needed, and +is typically faster than loading the whole data into memory. > If you just needed to load the complete file contents into memory, you can use > `dvc.api.read()` instead: @@ -165,17 +166,6 @@ directory tree, and look for the file contents of `clean.csv` in its local cache; no download will happen if found. See the [Parameters](#parameters) section for more info. -Note: to specify the file encoding of a text file, use: - -```py -import dvc.api - -with dvc.api.open( - 'data/nlp/words_ru.txt', - encoding='koi8_r') as fd: - # ... -``` - ## Example: Chose a specific remote as the data source Sometimes we may want to choose the [remote](/doc/command-reference/remote) data @@ -192,5 +182,18 @@ with open( ) as fd: for line in fd: match = re.search(r'user=(\w+)', line) - # ... + # ... Process users activity log +``` + +## Example: Specify the text encoding + +To chose which codec to open a text file with, send an `encoding` argument: + +```py +import dvc.api + +with dvc.api.open( + 'data/nlp/words_ru.txt', + encoding='koi8_r') as fd: + # ... Process Russian words ``` diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md index 1f80485758..afd1a2ea85 100644 --- a/public/static/docs/api-reference/read.md +++ b/public/static/docs/api-reference/read.md @@ -38,6 +38,9 @@ or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray). > This is similar to the `dvc get` command in our CLI. +No disc space is needed to save the file before loading it to memory in order to +make the file accessible. + ## Parameters - **`path`** - location and file name of the file in `repo`, relative to the From 958939b724638a5373e0f6910ced2bee7942a925 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 19:25:49 -0600 Subject: [PATCH 6/9] api ref: more impros to desc and examples of open and read --- public/static/docs/api-reference/open.md | 18 +++++++++--------- public/static/docs/api-reference/read.md | 14 ++++++-------- 2 files changed, 15 insertions(+), 17 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 8faae29d12..afc07c724a 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -39,14 +39,14 @@ file can be tracked by DVC or by Git. [context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library) (using the `with` keyword, as shown in the examples). -> Use `dvc.api.read()` to get the complete file contents in a single function -> call – no _context manager_ involved. - This function makes a direct connection to the [remote storage](/doc/command-reference/remote/add#supported-storage-types) -(except for Google Drive), so the file contents can be streamed as they are -downloaded. No disc space and very little memory are needed to save the file -before making it accessible. +(except for Google Drive), so the file contents can be streamed. Your code can +process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's +streamed, which optimizes memory usage. + +> Use `dvc.api.read()` to load the complete file contents in a single function +> call – no _context manager_ involved. Neither function utilizes disc space. ## Parameters @@ -90,9 +90,9 @@ before making it accessible. ## Example: Use data or models from DVC repositories -Any data artifact hosted online can be employed directly in your -Python app (no disc space needed) with this API. For example, an XML file -tracked in a public DVC repo on Github can be processed like this: +Any data artifact hosted online can be processed directly in your +Python app with this API. For example, an XML file tracked in a public DVC repo +on Github can be processed like this: ```py from xml.sax import parse diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md index afd1a2ea85..7ba2fe25c8 100644 --- a/public/static/docs/api-reference/read.md +++ b/public/static/docs/api-reference/read.md @@ -28,19 +28,17 @@ This function wraps [`dvc.api.open()`](/doc/api-reference/open), for a simple way to return the complete contents of a file tracked in a DVC project. The file can be tracked by DVC or by Git. +> This is similar to the `dvc get` command in our CLI. + The returned contents can be a [string](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) or a [bytearray](https://docs.python.org/3/library/stdtypes.html#bytearray). +These are loaded to memory directly (without using any disc space). > The type returned depends on the `mode` used. For more details, please refer > to Python's [`open()`](https://docs.python.org/3/library/functions.html#open) > built-in, which is used under the hood. -> This is similar to the `dvc get` command in our CLI. - -No disc space is needed to save the file before loading it to memory in order to -make the file accessible. - ## Parameters - **`path`** - location and file name of the file in `repo`, relative to the @@ -83,9 +81,9 @@ make the file accessible. ## Example: Load data from a DVC repository -Any data artifact hosted online can be employed directly in your -Python app (no disc space needed) with this API. For example, let's say that you -want to load and unserialize a binary model from a repo on Github: +Any data artifact hosted online can be loaded directly in your +Python app with this API. For example, let's say that you want to load and +unserialize a binary model from a repo on Github: ```py import pickle From 6d95f7278068b7b8faeaabec3f3969eb36a31c2a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 19:28:32 -0600 Subject: [PATCH 7/9] api ref: better wordking in open example --- public/static/docs/api-reference/open.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index afc07c724a..ecfea05c8c 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -109,8 +109,8 @@ with dvc.api.open( Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because `dvc.api.open()` is able to stream the data download. (The `mySAXHandler` object should handle the event-driven parsing of the document in this case.) This -increases the performance of the code, since very little memory is needed, and -is typically faster than loading the whole data into memory. +increases the performance of the code (minimizing memory usage), and is +typically faster than loading the whole data into memory. > If you just needed to load the complete file contents into memory, you can use > `dvc.api.read()` instead: From fcafa7be72d4933c8c4c9d51e13861e554541301 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 19:31:05 -0600 Subject: [PATCH 8/9] api ref: term "Python app" -> "Python code" --- public/static/docs/api-reference/index.md | 2 +- public/static/docs/api-reference/open.md | 2 +- public/static/docs/api-reference/read.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/public/static/docs/api-reference/index.md b/public/static/docs/api-reference/index.md index fbeaf79a26..0298b185ab 100644 --- a/public/static/docs/api-reference/index.md +++ b/public/static/docs/api-reference/index.md @@ -10,7 +10,7 @@ import dvc.api The purpose of this API is to provide programatic access to the data or models [stored and versioned](/doc/use-cases/versioning-data-and-model-files) in -DVC repositories from Python apps. +DVC repositories from Python code. Please choose a function from the navigation sidebar to the left, or click the `Next` button below to jump into the first one ↘ diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index ecfea05c8c..1558646ce9 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -91,7 +91,7 @@ streamed, which optimizes memory usage. ## Example: Use data or models from DVC repositories Any data artifact hosted online can be processed directly in your -Python app with this API. For example, an XML file tracked in a public DVC repo +Python code with this API. For example, an XML file tracked in a public DVC repo on Github can be processed like this: ```py diff --git a/public/static/docs/api-reference/read.md b/public/static/docs/api-reference/read.md index 7ba2fe25c8..e83ae063b0 100644 --- a/public/static/docs/api-reference/read.md +++ b/public/static/docs/api-reference/read.md @@ -82,7 +82,7 @@ These are loaded to memory directly (without using any disc space). ## Example: Load data from a DVC repository Any data artifact hosted online can be loaded directly in your -Python app with this API. For example, let's say that you want to load and +Python code with this API. For example, let's say that you want to load and unserialize a binary model from a repo on Github: ```py From 7167c3fb8f63c6ee0fb20dd926edaa751c08a3c6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 8 Mar 2020 19:33:49 -0600 Subject: [PATCH 9/9] api ref: phrase "data download" -> "data from remote storage" --- public/static/docs/api-reference/open.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/public/static/docs/api-reference/open.md b/public/static/docs/api-reference/open.md index 1558646ce9..858090e0fb 100644 --- a/public/static/docs/api-reference/open.md +++ b/public/static/docs/api-reference/open.md @@ -107,10 +107,11 @@ with dvc.api.open( ``` Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because -`dvc.api.open()` is able to stream the data download. (The `mySAXHandler` object -should handle the event-driven parsing of the document in this case.) This -increases the performance of the code (minimizing memory usage), and is -typically faster than loading the whole data into memory. +`dvc.api.open()` is able to stream the data from +[remote storage](/doc/command-reference/remote/add#supported-storage-types). +(The `mySAXHandler` object should handle the event-driven parsing of the +document in this case.) This increases the performance of the code (minimizing +memory usage), and is typically faster than loading the whole data into memory. > If you just needed to load the complete file contents into memory, you can use > `dvc.api.read()` instead: