Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api ref: better explanation on disc and memory usage for read/open #1037

Merged
merged 9 commits into from
Mar 9, 2020

Conversation

jorgeorpinel
Copy link
Contributor

Extracted from #908 (review)

super relevant example would to show SAX or StAX parser instead of a DOM one - that's where it shines. Or we can make CSV example the main one and show how we process it in steam fashion (e.g. calculating sum or avg) - it would show the "streaming" aspect of the open() way better.

@shcheklein shcheklein temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 8, 2020 23:55 Inactive
@jorgeorpinel jorgeorpinel mentioned this pull request Mar 8, 2020
1 task
@jorgeorpinel jorgeorpinel requested a review from shcheklein March 8, 2020 23:56
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 00:01 Inactive
@jorgeorpinel
Copy link
Contributor Author

For now I've implemented a SAX XML example in the doc, but let's continue the discussion on what we want to show in examples.

What's the advantage of streaming files in open/read? Probably just making a big file available quickly so you can start processing it before it's all downloaded.

make CSV example the main one and show how we process it in steam fashion (e.g. calculating sum or avg

I don't think you'd want to show the progress of a real-time processing, or is that a major use case you guys see?

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 00:16 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 00:43 Inactive
@@ -45,8 +45,8 @@ file can be tracked by DVC or by Git.
This function makes a direct connection to the
[remote storage](/doc/command-reference/remote/add#supported-storage-types)
(except for Google Drive), so the file contents can be streamed as they are
read. This means it does not require space on the disc to save the file before
making it accessible. The only exception is when using Google Drive as
read. This means it does not require disc space to save the file before making
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are read -> is read? (similar to data, it's strange to see contents are)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"file contents are" seems correct to me. I can change it to "file content is" though. Same meaning

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

to emphasize about disc and memory usage in each one
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 00:55 Inactive
@jorgeorpinel jorgeorpinel changed the title dvc.api.open() streaming example api ref: better explanation on disc and memory usage for read/open Mar 9, 2020
@jorgeorpinel jorgeorpinel marked this pull request as ready for review March 9, 2020 00:56
@jorgeorpinel jorgeorpinel requested a review from shcheklein March 9, 2020 01:26
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 01:26 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 01:29 Inactive
[remote type](/doc/command-reference/remote/add#supported-storage-types).
(except for Google Drive), so the file contents can be streamed. Your code can
process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's
streamed, which optimizes memory usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

`dvc.api.open()` is able to stream the data download. (The `mySAXHandler` object
should handle the event-driven parsing of the document in this case.) This
increases the performance of the code (minimizing memory usage), and is
typically faster than loading the whole data into memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-api-lc21uy0sjqxwec March 9, 2020 01:34 Inactive
@jorgeorpinel jorgeorpinel merged commit 08f7301 into master Mar 9, 2020
@jorgeorpinel
Copy link
Contributor Author

Woot woot 🎉

Cc @Suor feel fee to review this port-merge. Thanks!

@shcheklein shcheklein deleted the api branch March 27, 2020 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants