Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog: remote optimization post #1474

Merged
merged 27 commits into from
Nov 26, 2020
Merged

blog: remote optimization post #1474

merged 27 commits into from
Nov 26, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jun 22, 2020

moved #1451 to an upstream branch

You may disregard these recommendations if you used the Edit on GitHub button from dvc.org to improve a doc in place.

❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.

🐛 Please make sure to mention Fix #issue (if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.

Please choose to allow us to edit your branch when creating the PR.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Initial draft for the remote optimization write up

TODO

  • improve introduction
  • needs conclusion
  • update placeholder image
  • update placeholder date

@pmrowla pmrowla self-assigned this Jun 22, 2020
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-rbug3z June 22, 2020 05:50 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 22, 2020

From the original PR:

Not sure if the initial draft is too in depth/technical.

@andronovhopf I'd appreciate it if you can take a look at this and give some suggestions on how to make it more interesting/applicable for users from an ML perspective

@pmrowla pmrowla changed the title blog: remote optimization post [WIP] blog: remote optimization post Jun 22, 2020
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 23, 2020 08:37 Inactive
@pmrowla pmrowla changed the title [WIP] blog: remote optimization post blog: remote optimization post Jun 23, 2020
@elleobrien
Copy link
Contributor

Really nice start! The technical aspects are clear and the approach is easy to understand. I'm putting some comments in the draft.

@elleobrien
Copy link
Contributor

Probably the biggest comment is to do with weaving together the sections of the blog. To me, it was hard on a first read to understand how core sections ("DVC cache/remote structure", "Optimizing remote status queries", and "Indexing versioned directories") were related. Maybe changing the titles to something that explains why we need this info would help. For example: "Understanding how DVC tracks files", "Remote status queries are a performance bottleneck", and "How DVC 1.0 avoids unneccessary status queries" - assuming I've understood these ideas correctly.

Similarly, adding some more sentences to tie ideas together- "And here's where all those status queries really add up: when a user goes to dvc add a new file to a big DVC-tracked directory. Before 1.0, DVC has had to check that EVERY file in the directory existed..."

content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 24, 2020 12:02 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 24, 2020

@andronovhopf thanks for all the suggestions. I pushed a new revision that breaks things up into a few more sections and reorganizes things. I think (hopefully) that the different sections should be tied together better and flow more smoothly now.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 24, 2020

CI link check fails because one of the post images 404s, but it looks like it's checking the actual dvc.org url and not the PR deployment url?

@shcheklein
Copy link
Member

CI link check fails because one of the post images 404s, but it looks like it's checking the actual dvc.org url and not the PR deployment url?

this is expected, #1000 should fix this, for now we can ignore this kind of errors.

@rogermparent rogermparent temporarily deployed to dvc-landing-blog-remote-hvxbdz June 25, 2020 00:20 Inactive
@rogermparent
Copy link
Contributor

Hey everyone! Seems I forgot to change the GitHub icon to inherit color like the other SVGs. Just pushed a change to that now!

@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 25, 2020 09:21 Inactive
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great post. A lot of details and deep insights. Plots.

It has a chance to be interesting for a wide audience of people. However, to make it happened all DVC-specific stuff needs to be significantly reduced. We should not assume a reader knows what is DVC. Is it possible? :)

content/authors/peter_rowlands.md Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
content/blog/2020-06-29-optimizing-dvc-remotes.md Outdated Show resolved Hide resolved
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 26, 2020 16:21 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 26, 2020

@dmpetrov @shcheklein would appreciate any feedback on the latest revision, I worked more on generalizing the problem and solutions from a higher level and tried to strip out the unnecessary dvc specific internal details where possible

@shcheklein
Copy link
Member

@pmrowla it's great overall!!

But we need to do another iteration to make it less DVC/ML specific and simplify.

  1. The introduction (everything that comes ebfore Why status queries are a performance bottleneck) - let's make it general: In ML (and not only) we deal with a lot of files -> we usually store them in clouds (to share and backup) -> we need to sync them to/from user's machine, a remote machine to run scripts, etc
  2. If no ideas - let's start even with Why status queries are a performance bottleneck section and omit the whole intro.
  3. However, DVC is a specialized tool designed specifically for ML. - here and in similar cases, let's not emphasize that it is ML-specific. DVC data management is a general layer that can be used to track any large files, directories. Cache structure is not dictated by ML scenarios.
  4. Image - we compare dvc and rclone but we use dvc status c- as a scenario decription - we should come up with some human readable names, explain scenarios better.
  5. does it make sense, should we compare with rsync?

@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 30, 2020 09:51 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 30, 2020

Pushed a revision with a simplified intro and removed ML specific references.

does it make sense, should we compare with rsync?

I'm not actually sure if we outperform rsync in 1.0 since we don't currently index local remotes, and the query method changes don't apply either (since both us and rsync use filesystem calls here). But I'm looking into it, and will update the benchmark graphic once I've finished with this

edit: for some reason I forgot that rsync works over ssh, so we should outperform it in that case

@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 30, 2020 12:50 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 30, 2020

Benchmark graphic has been simplified and now has a short explanation section.

Regarding rsync comparison, our performance in 1.0 is comparable to rsync's for SSH remotes, so I'm not sure if it's worth adding to the blog post at this point.

It looks like whatever advantage we have when it comes to querying file existence on the remote machine is lost because we are slower than rsync when it comes to collecting the list of files in our local cache (this is probably an issue we need to look into)

@shcheklein

@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-hvxbdz June 30, 2020 13:01 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 30, 2020

I've also moved the placeholder date to Friday 07.03 for now

@pmrowla pmrowla force-pushed the blog-remote-optimization branch from 14666c5 to 960afba Compare November 26, 2020 04:12
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-cvtwrp November 26, 2020 04:12 Inactive
@pmrowla pmrowla force-pushed the blog-remote-optimization branch from 960afba to 30b610f Compare November 26, 2020 04:13
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-cvtwrp November 26, 2020 04:13 Inactive
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-cvtwrp November 26, 2020 05:12 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Nov 26, 2020

@shcheklein This still needs a discuss.dvc.org thread (I don't have the permissions to create one in the blog category), but other than that this should be ready to publish.

@shcheklein
Copy link
Member

@pmrowla thanks, Peter! I'll create and update the comments links. @elleobrien probably publishing it on TG doesn't make much sense? What do you think?

@elleobrien
Copy link
Contributor

elleobrien commented Nov 26, 2020 via email

@elleobrien
Copy link
Contributor

elleobrien commented Nov 26, 2020 via email

Co-authored-by: Jorge Orpinel <[email protected]>
@shcheklein shcheklein temporarily deployed to dvc-landing-blog-remote-cvtwrp November 26, 2020 18:39 Inactive
@shcheklein shcheklein merged commit 89cd7dc into master Nov 26, 2020
@pmrowla pmrowla deleted the blog-remote-optimization branch November 27, 2020 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants