-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api ref: better explanation on disc and memory usage for read/open #1037
Conversation
jorgeorpinel
commented
Mar 8, 2020
For now I've implemented a SAX XML example in the doc, but let's continue the discussion on what we want to show in examples. What's the advantage of streaming files in open/read? Probably just making a big file available quickly so you can start processing it before it's all downloaded.
I don't think you'd want to show the progress of a real-time processing, or is that a major use case you guys see? |
@@ -45,8 +45,8 @@ file can be tracked by DVC or by Git. | |||
This function makes a direct connection to the | |||
[remote storage](/doc/command-reference/remote/add#supported-storage-types) | |||
(except for Google Drive), so the file contents can be streamed as they are | |||
read. This means it does not require space on the disc to save the file before | |||
making it accessible. The only exception is when using Google Drive as | |||
read. This means it does not require disc space to save the file before making |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are read -> is read? (similar to data, it's strange to see contents are
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"file contents are" seems correct to me. I can change it to "file content is" though. Same meaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
to emphasize about disc and memory usage in each one
[remote type](/doc/command-reference/remote/add#supported-storage-types). | ||
(except for Google Drive), so the file contents can be streamed. Your code can | ||
process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's | ||
streamed, which optimizes memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
`dvc.api.open()` is able to stream the data download. (The `mySAXHandler` object | ||
should handle the event-driven parsing of the document in this case.) This | ||
increases the performance of the code (minimizing memory usage), and is | ||
typically faster than loading the whole data into memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Woot woot 🎉 Cc @Suor feel fee to review this port-merge. Thanks! |