Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp push: fails for >50MB commits #6181

Closed
casperdcl opened this issue Jun 15, 2021 · 6 comments
Closed

exp push: fails for >50MB commits #6181

casperdcl opened this issue Jun 15, 2021 · 6 comments
Labels
A: experiments Related to dvc exp p2-medium Medium priority, should be done, but less important

Comments

@casperdcl
Copy link
Contributor

casperdcl commented Jun 15, 2021

dvc exp push can fail since GitHub rejects commits >50MB in size. Perhaps use DVC cache instead for such cases?

Part of iterative/cml#560

/CC @pmrowla

@casperdcl casperdcl added the product: VSCode Integration with VSCode extension label Jun 15, 2021
@casperdcl casperdcl changed the title exp push: fails from >50MB exp push: fails for >50MB Jun 15, 2021
@casperdcl casperdcl changed the title exp push: fails for >50MB exp push: fails for >50MB commits Jun 15, 2021
@efiop efiop added the A: experiments Related to dvc exp label Jun 15, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Jun 16, 2021

If users are already using github for versioning their pipeline, what is happening in these particular experiment commits that is making them go over the github size limit (when the user's "regular" commits are not over the size limit)

@casperdcl
Copy link
Contributor Author

Could have intermediate per-checkpoint debug data (which the user doesn't want tracked by DVC nor Git)

@casperdcl casperdcl added the p2-medium Medium priority, should be done, but less important label Jun 16, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Jun 17, 2021

For reference, the issue here was discussed in yesterdays meeting:

When we generate a checkpoint commit, anything marked as a pipeline dependency or output that is also cache: false will be forcefully tracked in Git. This is needed to so that DVC can properly preserve/restore the state of the workspace when resuming checkpoint runs.

The problem w/the large commits is most likely occurring when users have large, intermediate cache: false outputs/deps that they don't want tracked at all (by DVC or Git). But DVC cannot tell the difference between an output that is cache: false "because the user will track it with Git" or cache: false "because it should not be tracked it all", and we end up incorrectly tracking these files in the checkpoint Git commits.


One thing to note here is that DVC will not do the forced tracking for files which are both cache: false and .gitignored. This issue is potentially also happening because users don't bother gitignoring these intermediate files/dirs (since the user knows not to manually git add them). However this doesn't work for an automated CI run like in CML.

If these intermediate files/dirs are properly gitignored, it would also stop DVC from generating these bloated checkpoint commits. I think we need to clearly document this behavior on both the DVC and CML sides.

@dberenbaum @casperdcl

@karajan1001
Copy link
Contributor

Currently, DVC or Git would track every checkpoint commit? If so, in training progress with a large number epoch or iteration, it might generate a huge number of checkpoints and iterations, but in most cases, we only need the latest checkpoints.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 17, 2021

@karajan1001 yes, we track every iteration. Once the user decides they want to keep an experiment, they can choose to either keep all of the commits or just squash them all into a single commit (to only keep the final checkpoint)

@casperdcl
Copy link
Contributor Author

we track every iteration

On a related note we may want to add an option to only keep the last N checkpoints. May save disk space as well as sanity when doing exp show on 10 billion checkpoints.

@daavoo daavoo closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2023
@daavoo daavoo removed the product: VSCode Integration with VSCode extension label May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

5 participants