-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blog: remote optimization post #1474
Conversation
From the original PR: Not sure if the initial draft is too in depth/technical. @andronovhopf I'd appreciate it if you can take a look at this and give some suggestions on how to make it more interesting/applicable for users from an ML perspective |
Really nice start! The technical aspects are clear and the approach is easy to understand. I'm putting some comments in the draft. |
Probably the biggest comment is to do with weaving together the sections of the blog. To me, it was hard on a first read to understand how core sections ("DVC cache/remote structure", "Optimizing remote status queries", and "Indexing versioned directories") were related. Maybe changing the titles to something that explains why we need this info would help. For example: "Understanding how DVC tracks files", "Remote status queries are a performance bottleneck", and "How DVC 1.0 avoids unneccessary status queries" - assuming I've understood these ideas correctly. Similarly, adding some more sentences to tie ideas together- "And here's where all those status queries really add up: when a user goes to |
@andronovhopf thanks for all the suggestions. I pushed a new revision that breaks things up into a few more sections and reorganizes things. I think (hopefully) that the different sections should be tied together better and flow more smoothly now. |
CI link check fails because one of the post images 404s, but it looks like it's checking the actual dvc.org url and not the PR deployment url? |
this is expected, #1000 should fix this, for now we can ignore this kind of errors. |
Hey everyone! Seems I forgot to change the GitHub icon to inherit color like the other SVGs. Just pushed a change to that now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great post. A lot of details and deep insights. Plots.
It has a chance to be interesting for a wide audience of people. However, to make it happened all DVC-specific stuff needs to be significantly reduced. We should not assume a reader knows what is DVC. Is it possible? :)
@dmpetrov @shcheklein would appreciate any feedback on the latest revision, I worked more on generalizing the problem and solutions from a higher level and tried to strip out the unnecessary dvc specific internal details where possible |
@pmrowla it's great overall!! But we need to do another iteration to make it less DVC/ML specific and simplify.
|
Pushed a revision with a simplified intro and removed ML specific references.
I'm not actually sure if we outperform rsync in 1.0 since we don't currently index local remotes, and the query method changes don't apply either (since both us and rsync use filesystem calls here). But I'm looking into it, and will update the benchmark graphic once I've finished with this edit: for some reason I forgot that rsync works over ssh, so we should outperform it in that case |
Benchmark graphic has been simplified and now has a short explanation section. Regarding rsync comparison, our performance in 1.0 is comparable to rsync's for SSH remotes, so I'm not sure if it's worth adding to the blog post at this point. It looks like whatever advantage we have when it comes to querying file existence on the remote machine is lost because we are slower than rsync when it comes to collecting the list of files in our local cache (this is probably an issue we need to look into) |
I've also moved the placeholder date to Friday 07.03 for now |
* cut intro * add more benchmark cases
14666c5
to
960afba
Compare
960afba
to
30b610f
Compare
@shcheklein This still needs a discuss.dvc.org thread (I don't have the permissions to create one in the blog category), but other than that this should be ready to publish. |
@pmrowla thanks, Peter! I'll create and update the comments links. @elleobrien probably publishing it on TG doesn't make much sense? What do you think? |
Sorry, what is TG?
… On Nov 25, 2020, at 9:25 PM, Ivan Shcheklein ***@***.***> wrote:
@pmrowla thanks, Peter! I'll create and update the comments links. @elleobrien probably publishing it on TG doesn't make much sense? What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Ah. Thanksgiving. It’s fine to merge then and we can hype it on social media Monday.
… On Nov 25, 2020, at 9:30 PM, Elle O'Brien ***@***.***> wrote:
Sorry, what is TG?
>> On Nov 25, 2020, at 9:25 PM, Ivan Shcheklein ***@***.***> wrote:
>>
>
> @pmrowla thanks, Peter! I'll create and update the comments links. @elleobrien probably publishing it on TG doesn't make much sense? What do you think?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or unsubscribe.
|
Co-authored-by: Jorge Orpinel <[email protected]>
moved #1451 to an upstream branch
❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.
🐛 Please make sure to mention
Fix #issue
(if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.Please choose to allow us to edit your branch when creating the PR.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Initial draft for the remote optimization write up
TODO