-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dev: do not store images in Git #1115
Comments
@shcheklein where should we store the images? Should we use an S3 bucket with public access and serve them from there? |
@fabiosantoscode for the blog.dvc.org specifically? pick any remote storage to experiment with. Probably will be storing on S3 when it's done. |
@shcheklein I've looked deeply into gatsby hooks, the source-filesystem plugin (which can generate files from Buffer objects as well as real files) and decided to simplify to the max and work up from there. So I used Plus, you don't need to restart All that's missing, I think, is running Am I on the right track? I don't see too many problems with this approach, and it's pretty simple so it shouldn't be too hard to maintain. Edit: I'm not tackling image optimization itself since gatsby has plugins for that. I think it's OK to store large images on S3 and have them be optimized on build time. |
An interesting problem we have now, is third party contributions. If a regular user were to create a post they would have to give us the image off-band.
Once the PR is merged that "data PR DVC file" can be turned into a regular old import-url DVC file by an insider with s3/google drive permissions (or a github action!), who simply runs a command, which downloads it from iterative and places it in the desired remote. It might also help coworkers "branch off" their large data files, and then have them imported back into master with ease. It's still possible for a team to do this by hand by making the files available on some personal HTTP server or public S3 bucket and adding a |
not sure I understand why would you need
yep, that's the most interesting part - how to automate and hook it nicely into the workflow. Write a plugin that detects that images are stored in DVC and resolves/pulls them programmatically? Also, how do we add more images ... can we make it part of the build (detect changes and save
Agreed! Let's keep this in mind and think. But probably for now can solve this for the internal team first. |
@shcheklein I can create this plugin. If the image DVC files are stored next to the images it's a lot easier (or if the DVC file represents the whole image subdirectory). I can detect changes in the DVC file(s) and run Automatic pull has a drawback though. If you change a DVC file after changing an image, you lose the changes you made. After reading through the docs a bit, I don't think there's a way to pull a DVC file that avoids overwriting novel data that might be in the working directory. It's probably safe to assume people will have an extra copy of the images they're adding, so this is probably fine to ignore for now. As per writing images, I don't think it would be very good to track images automatically on build time (if you crop, resize or color correct an image 10 times before committing and your dev server is running, we get 9 unused versions of that image sitting on S3). Instead, simply asking users to run |
probably this should scale better, DVC is not very well suited for thousands and thousands of DVC-files - it might become a bottleneck. Also, dvc pull can do "granular" partial pulls for a file in the directory.
not sure why. You don't change DVC-file. It's being updated by running
this is fine, we can do
it's a pretty regular thing that takes a lot of time, btw in Gatsby. Can we use DVC pipelines there? to save prepared processed images into cache?
we can do any simple option first for the update experience. |
The case where you lose your image changes is mostly when you change branches, which changes the DVC file. It shouldn't be a big deal. Using DVC pipelines sounds like a good idea, but I think there are some gatsby plugins for compressing images which use the gatsby cache to keep things up to date. Either way I don't want to tackle this before the rest of the experience is solid. Thanks for clearing up my questions, I'll get right on this. |
@fabiosantoscode it won't allow you to do
yep, the question is - can Netlify and/or Gatsby CI/CD setup in a way to use this cache easily?
yep, that's right. |
To fix this I think we should cache the |
Hi @iterative/websites ! I just realized this p1 issue has been waiting for a while. Should we plan it up? Option A - Use DVC (probably with an S3 bucket)
Cc @casperdcl didn't you propose a related solution for another repo? Thanks |
Thanks, @jorgeorpinel for bringing this up, we need more discussion on this. We have already discussed it a little on slack and have also mentioned this on #docs meeting but haven't reached a conclusion overall.
For this we need to be okay with : why-you-shouldn't-use-git-lfs
Possible option C was I mentioned using some content management tools(cms for content as well as media management): If we choose to go with one of the above:
Option B:
Option C:
|
As I recall I'd also suggested an option of using a submodule (e.g. https://github.com/iterative/static) for images (so that the base repo doesn't bloat in size). Cons: Would require people to shallow-clone/checkout/init submodules but maybe easiest? From discussion with @shcheklein I also recall the gatsby build will also have to checkout the submodule in order to perform image optimisations. |
Quite doable: you config up 2 remotes to the same storage, the read-only one via HTTP in .dvc/config and the one for writing in .dvc/congif.local (needs to be setup manually as part of the dev env setup, which should be documented in the README and/or here.) Indeed the LFS integration with GH makes this easier for devs though.
Good question. It's not a common case though so maybe not to worry about? In the serare cases they'd have to upload it to the PR during review and we'd process images appropriately once it's approved, before squash & merge. Maybe GH/LFS also makes this easier? If so I'd have to incline for LFS at this point.
We'd want to rewrite the whole repo history to remove image files with either option I think. I don't think it's a problem for a web app -- we never deploy much older versions?
Interesting but how does it integrate with the existing engine? Do we want to manage 2 web apps? Sorry, I can't open the Slack links (feel free to @ me directly on Slack though). BTW we'd still want to rewrite history and remove all images from the repo, I believe.
Ugh submodules. A reason not to use Git is that it's "not supposed" to manage binaries anyway, right? The submodule may still take a long time to clone, etc. and complicates things for devs/contributors anyway. |
💡 idea to explore: (bit hacky but) we could try using our existing Discuss site to upload images and load them from there into docs. We do something like that for all blog posts: to create a forum for comments. I believe the images would be served up by our CDN anyway (we should check). |
Since we deploy our website on Heroku(I couldn't find mention of support for git-lfs), there might be a problem during build time so we might need some workaround on Heroku to support git-lfs. But, that does come with a limitation. Shall we try with LFS and submit a pr to see how it goes?
Currently, the contents are kind of separate too. Gatsby has the flexibility to use it from different sources.
Depends. We can rely on the cloud services most of them provide.
Or, we can use image/media management services like Cloudinary which has easy integration with Gatsby too. |
Even if there is no other option, I'd not use Git-LFS for any solution. It's like using Mercedes trucks in BMW factory. If we're going to use competitors' solutions, git-annex may be better, as it doesn't require server side changes 😃 DVC can be used with multiple remote configurations like we do in example repositories. Something like reading from |
It’s is just an idea. Need to discuss its feasibility.
cc: @iterative/websites @casperdcl |
If we convert In publish phase, a In our PRs, we can upload the files ourselves. Outside contributors won't be able to push though. They can include the images to PRs, and we can upload them. Once we have a workflow to use DVC in the repository, we can automate as much as possible. |
before this step I would have:
Also the storage could be similar to https://assets.cml.dev :) /CC @iterative/cml |
Sounds interesting. I'm also of the idea to try this manually first, and automate via hooks or actions once the workflow is clear.
I'd be ideal that the transformation can happen locally though, so that there's no need to cascade PRs some of which rewrite history of what was just merged in (sounds fragile). The question about outside contributors is good but it's not very common that they're include images in PRs so not a pressing one to answer. |
Lowering to p2 due to lack of resources for now. |
Clone takes forever, at some point we will hit the GH limits.
As an experiment we can do see what would it take to integrate DVC to store images, use GraphQL to fetch them. POC here - https://github.com/iterative/blog/pull/93
After that we can amend Git repo to remove all image objects.
Any other ideas?
The text was updated successfully, but these errors were encountered: