Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix very large git repository #77

Closed
Bisaloo opened this issue Jun 22, 2023 · 6 comments
Closed

Fix very large git repository #77

Bisaloo opened this issue Jun 22, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Bisaloo
Copy link
Member

Bisaloo commented Jun 22, 2023

The serofoi git repository is currently extremely large, with a total size around 1.6 GB. This makes it difficult to clone, and contribute to.

A quick analysis reveals the problems come from these files in history:

  • RDS files in R/stanmodels/
  • RDS files in inst/extdata/stanmodels/
  • RDS files in tests/*/fitting_results/
  • a couple of other one-off occurences (docx in data/, etc.) but their contribution is smaller

Note that because git keeps the entire history, even though these files have been deleted, they keep contribute to the total repository size.

At this stage, I think it would be worthwhile to remove these files from history to be able to start fresh on a clean, lean, repo. However, this comes with some issues: deleting these files will rewrite the entire git history, which means that all contributors will have to delete their local clone and clone the repo again.

This is not completely unheard of. In fact, the carpentries have done something similar in all their lesson repos recently but it needs to be done with: 1. lots of caution, 2. the buy-in of all contributors

@ntorresd
Copy link
Member

ntorresd commented Jun 22, 2023

@Bisaloo you're right! This is a problem we should address soon for the repository to be easier to clone for the users.

We've been discussing with @GeraldineGomez and came to the conclusion that the best approach would be to follow the suggestions in this stackoverflow post, i.e. to use either git filter-repo or git filter-branch (I'd rather prefer the former) to remove from the history all files with the extension *.RDS (with some exceptions needed for testing purposes that I'll refer to in a moment). We'd like to know your opinion on this.

Bare in mind that this shouldn't have a significant impact neither in the general functioning of the package nor of that of previous versions. This files are files generated by Stan whenever a model is ran for the first time locally, which is why some time ago we created the function save_or_load_model, that can still be found in main in the modelling module, but will be replaced by the instantiation function stanmodels in the next update (see stanmodels.R in dev). In essence, this functions just check whether is necessary to compile the corresponding .RDS file depending on whether the user had or hadn't ran them for the first time already.

For the tests we've opted, for time efficiency reasons (the tests would take for ever if each test file were to run each or even one model everytime), to store some benchmark stan model objects as serialized .json files that can be imported during tests by means of jsonlite::fromJSON and jsonlite::unserializeJSON. This way the models don't have to be implemented each time a test is run, for instance, for the visualisation module. For a similar reason we decided to store the files tests/testthat/extdata/prev_expanded_*.RDS (the exceptions I mentioned above, which don't correspond to Stan model objects per se, but to sub-products of them). As you have already noted, these files don't take as much space as the old .RDS files although I'm open to suggestions on how to improve the way we approach testing right now because a similar problem might come to kick back later on with our current approach.

@Bisaloo
Copy link
Member Author

Bisaloo commented Jun 22, 2023

We've been discussing with @GeraldineGomez and came to the conclusion that the best approach would be to follow the suggestions in this stackoverflow post, i.e. to use either git filter-repo or git filter-branch (I'd rather prefer the former) to remove from the history all files with the extension *.RDS (with some exceptions needed for testing purposes that I'll refer to in a moment). We'd like to know your opinion on this.

Yes, I used this approach in the past and it worked well. As mentioned in the previous message, you just have to be very careful during the last step, git push -f, as you are overwriting the entire git history and you could lose everything if there was some error in the previous steps (we have backups so it's not totally unrecoverable though).

For the tests we've opted, for time efficiency reasons (the tests would take for ever if each test file were to run each or even one model everytime), to store some benchmark stan model objects as serialized .json files that can be imported during tests by means of jsonlite::fromJSON and jsonlite::unserializeJSON. This way the models don't have to be implemented each time a test is run, for instance, for the visualisation module. For a similar reason we decided to store the files tests/testthat/extdata/prev_expanded_*.RDS (the exceptions I mentioned above, which don't correspond to Stan model objects per se, but to sub-products of them). As you have already noted, these files don't take as much space as the old .RDS files although I'm open to suggestions on how to improve the way we approach testing right now because a similar problem might come to kick back later on with our current approach.

Yes, I understand your motivations here but as you guessed it, this is going to be a problem with CRAN since they flag any package larger than 5 MB.
Not that it solves completely the issue but it's usually more size-efficient to save these files as binary (RDS) rather than plain text (json). Is there any reason you preferred json here?

However, we might be focusing on the wrong issue. I'm not completely sure it's the best strategy to store so much information in your seromodel_object (it becomes a 'God object'). In particular, it contains quite a lot of data that looks like it can be recomputed easily and quickly on-the-fly (e.g., exposure_age, foi_post_s, ...). Should I open a separate issue to discuss this?

@ben18785
Copy link
Collaborator

I agree with @Bisaloo -- we should alter the Git history as the repo size is huge. I'm not concerned about these files disappearing from the git history as they're not important for the package functioning.

@Bisaloo
Copy link
Member Author

Bisaloo commented Jul 6, 2023

We get down from 1.6GB to 79MB by running:

git filter-repo --strip-blobs-bigger-than 4M

Note that this works well in this specific case but should not be applied in any repo. It works here because the only large files are files we want to delete. If we have a mix of large files we want to keep and large files we want to delete, we need to use a smarter solution

Files deleted from history:

  • tests/test_2022_09_21_19_07_11/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_18_14_00_10/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_18_14_50_25/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_18_16_29_31/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_18_15_25_38/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_18_16_42_08/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_18_14_55_08/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_03_16_20_30/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_03_16_13_02/fitting_results/COL-035-18.RDS
  • tests/test_2022_10_31_18_12_54/fitting_results/COL-035-18.RDS
  • test/test_2022_09_15_17_35_33/fitting_results/COL-035-18.RDS
  • test/test_2022_09_15_18_11_11/fitting_results/COL-035-18.RDS
  • test/test_2022_10_20_16_26_51/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_09_12_01_54/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_09_13_07_27/fitting_results/COL-035-18.RDS
  • tests/tests_2022_12_28_09_56_55/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_09_13_34_04/fitting_results/COL-035-18.RDS
  • test/test_2022_10_21_09_25_29/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_09_12_54_02/fitting_results/COL-035-18.RDS
  • test/test_2022_10_18_13_09_16/fitting_results/COL-035-18.RDS
  • tests/tests_2023_01_09_14_27_27/fitting_results/COL-035-18.RDS
  • test/test_2022_10_14_13_22_42/fitting_results/COL-035-18.RDS
  • tests/tests_2023_01_07_18_38_30/fitting_results/COL-035-18.RDS
  • test/test_2022_10_20_16_12_07/fitting_results/COL-035-18.RDS
  • tests/tests_2023_01_11_10_50_33/fitting_results/COL-035-18.RDS
  • tests/tests_2022_11_09_12_45_25/fitting_results/COL-035-18.RDS
  • test/test_2022_10_30_14_18_55/fitting_results/COL-035-18.RDS
  • tests/tests_2023_01_11_11_00_29/fitting_results/COL-035-18.RDS
  • tests/tests_2023_01_12_08_35_00/fitting_results/COL-035-18.RDS
  • tests/tests_2023_01_12_07_27_15/fitting_results/COL-035-18.RDS
  • data/chik-seroinference-simulations.docx

@ntorresd ntorresd self-assigned this Jul 31, 2023
@ntorresd ntorresd added the bug Something isn't working label Jul 31, 2023
@Bisaloo
Copy link
Member Author

Bisaloo commented Oct 10, 2023

@jamesmbaazam @davidsantiagoquevedo @ben18785 @zmcucunuba

Apologies if we closed a PR or deleted a branch in which you were involved. This is an unfortunate side effect of cleaning the git history. The branches don't have any common history with the new dev and were automatically closed by GitHub. We are working towards restoring the work you have done or were doing based on the current dev.

Thanks for your understanding!

@jamesmbaazam
Copy link
Member

Noted. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants