Fix very large git repository #77

Bisaloo · 2023-06-22T16:41:45Z

The serofoi git repository is currently extremely large, with a total size around 1.6 GB. This makes it difficult to clone, and contribute to.

A quick analysis reveals the problems come from these files in history:

RDS files in R/stanmodels/
RDS files in inst/extdata/stanmodels/
RDS files in tests/*/fitting_results/
a couple of other one-off occurences (docx in data/, etc.) but their contribution is smaller

Note that because git keeps the entire history, even though these files have been deleted, they keep contribute to the total repository size.

At this stage, I think it would be worthwhile to remove these files from history to be able to start fresh on a clean, lean, repo. However, this comes with some issues: deleting these files will rewrite the entire git history, which means that all contributors will have to delete their local clone and clone the repo again.

This is not completely unheard of. In fact, the carpentries have done something similar in all their lesson repos recently but it needs to be done with: 1. lots of caution, 2. the buy-in of all contributors

The text was updated successfully, but these errors were encountered:

ntorresd · 2023-06-22T21:32:33Z

@Bisaloo you're right! This is a problem we should address soon for the repository to be easier to clone for the users.

We've been discussing with @GeraldineGomez and came to the conclusion that the best approach would be to follow the suggestions in this stackoverflow post, i.e. to use either git filter-repo or git filter-branch (I'd rather prefer the former) to remove from the history all files with the extension *.RDS (with some exceptions needed for testing purposes that I'll refer to in a moment). We'd like to know your opinion on this.

Bare in mind that this shouldn't have a significant impact neither in the general functioning of the package nor of that of previous versions. This files are files generated by Stan whenever a model is ran for the first time locally, which is why some time ago we created the function save_or_load_model, that can still be found in main in the modelling module, but will be replaced by the instantiation function stanmodels in the next update (see stanmodels.R in dev). In essence, this functions just check whether is necessary to compile the corresponding .RDS file depending on whether the user had or hadn't ran them for the first time already.

For the tests we've opted, for time efficiency reasons (the tests would take for ever if each test file were to run each or even one model everytime), to store some benchmark stan model objects as serialized .json files that can be imported during tests by means of jsonlite::fromJSON and jsonlite::unserializeJSON. This way the models don't have to be implemented each time a test is run, for instance, for the visualisation module. For a similar reason we decided to store the files tests/testthat/extdata/prev_expanded_*.RDS (the exceptions I mentioned above, which don't correspond to Stan model objects per se, but to sub-products of them). As you have already noted, these files don't take as much space as the old .RDS files although I'm open to suggestions on how to improve the way we approach testing right now because a similar problem might come to kick back later on with our current approach.

Bisaloo · 2023-06-22T23:18:36Z

We've been discussing with @GeraldineGomez and came to the conclusion that the best approach would be to follow the suggestions in this stackoverflow post, i.e. to use either git filter-repo or git filter-branch (I'd rather prefer the former) to remove from the history all files with the extension *.RDS (with some exceptions needed for testing purposes that I'll refer to in a moment). We'd like to know your opinion on this.

Yes, I used this approach in the past and it worked well. As mentioned in the previous message, you just have to be very careful during the last step, git push -f, as you are overwriting the entire git history and you could lose everything if there was some error in the previous steps (we have backups so it's not totally unrecoverable though).

For the tests we've opted, for time efficiency reasons (the tests would take for ever if each test file were to run each or even one model everytime), to store some benchmark stan model objects as serialized .json files that can be imported during tests by means of jsonlite::fromJSON and jsonlite::unserializeJSON. This way the models don't have to be implemented each time a test is run, for instance, for the visualisation module. For a similar reason we decided to store the files tests/testthat/extdata/prev_expanded_*.RDS (the exceptions I mentioned above, which don't correspond to Stan model objects per se, but to sub-products of them). As you have already noted, these files don't take as much space as the old .RDS files although I'm open to suggestions on how to improve the way we approach testing right now because a similar problem might come to kick back later on with our current approach.

Yes, I understand your motivations here but as you guessed it, this is going to be a problem with CRAN since they flag any package larger than 5 MB.
Not that it solves completely the issue but it's usually more size-efficient to save these files as binary (RDS) rather than plain text (json). Is there any reason you preferred json here?

However, we might be focusing on the wrong issue. I'm not completely sure it's the best strategy to store so much information in your seromodel_object (it becomes a 'God object'). In particular, it contains quite a lot of data that looks like it can be recomputed easily and quickly on-the-fly (e.g., exposure_age, foi_post_s, ...). Should I open a separate issue to discuss this?

ben18785 · 2023-06-23T10:40:24Z

I agree with @Bisaloo -- we should alter the Git history as the repo size is huge. I'm not concerned about these files disappearing from the git history as they're not important for the package functioning.

Bisaloo · 2023-07-06T14:49:48Z

We get down from 1.6GB to 79MB by running:

git filter-repo --strip-blobs-bigger-than 4M

Note that this works well in this specific case but should not be applied in any repo. It works here because the only large files are files we want to delete. If we have a mix of large files we want to keep and large files we want to delete, we need to use a smarter solution

Files deleted from history:

tests/test_2022_09_21_19_07_11/fitting_results/COL-035-18.RDS
tests/tests_2022_11_18_14_00_10/fitting_results/COL-035-18.RDS
tests/tests_2022_11_18_14_50_25/fitting_results/COL-035-18.RDS
tests/tests_2022_11_18_16_29_31/fitting_results/COL-035-18.RDS
tests/tests_2022_11_18_15_25_38/fitting_results/COL-035-18.RDS
tests/tests_2022_11_18_16_42_08/fitting_results/COL-035-18.RDS
tests/tests_2022_11_18_14_55_08/fitting_results/COL-035-18.RDS
tests/tests_2022_11_03_16_20_30/fitting_results/COL-035-18.RDS
tests/tests_2022_11_03_16_13_02/fitting_results/COL-035-18.RDS
tests/test_2022_10_31_18_12_54/fitting_results/COL-035-18.RDS
test/test_2022_09_15_17_35_33/fitting_results/COL-035-18.RDS
test/test_2022_09_15_18_11_11/fitting_results/COL-035-18.RDS
test/test_2022_10_20_16_26_51/fitting_results/COL-035-18.RDS
tests/tests_2022_11_09_12_01_54/fitting_results/COL-035-18.RDS
tests/tests_2022_11_09_13_07_27/fitting_results/COL-035-18.RDS
tests/tests_2022_12_28_09_56_55/fitting_results/COL-035-18.RDS
tests/tests_2022_11_09_13_34_04/fitting_results/COL-035-18.RDS
test/test_2022_10_21_09_25_29/fitting_results/COL-035-18.RDS
tests/tests_2022_11_09_12_54_02/fitting_results/COL-035-18.RDS
test/test_2022_10_18_13_09_16/fitting_results/COL-035-18.RDS
tests/tests_2023_01_09_14_27_27/fitting_results/COL-035-18.RDS
test/test_2022_10_14_13_22_42/fitting_results/COL-035-18.RDS
tests/tests_2023_01_07_18_38_30/fitting_results/COL-035-18.RDS
test/test_2022_10_20_16_12_07/fitting_results/COL-035-18.RDS
tests/tests_2023_01_11_10_50_33/fitting_results/COL-035-18.RDS
tests/tests_2022_11_09_12_45_25/fitting_results/COL-035-18.RDS
test/test_2022_10_30_14_18_55/fitting_results/COL-035-18.RDS
tests/tests_2023_01_11_11_00_29/fitting_results/COL-035-18.RDS
tests/tests_2023_01_12_08_35_00/fitting_results/COL-035-18.RDS
tests/tests_2023_01_12_07_27_15/fitting_results/COL-035-18.RDS
data/chik-seroinference-simulations.docx

Bisaloo · 2023-10-10T16:38:35Z

@jamesmbaazam @davidsantiagoquevedo @ben18785 @zmcucunuba

Apologies if we closed a PR or deleted a branch in which you were involved. This is an unfortunate side effect of cleaning the git history. The branches don't have any common history with the new dev and were automatically closed by GitHub. We are working towards restoring the work you have done or were doing based on the current dev.

Thanks for your understanding!

jamesmbaazam · 2023-10-10T17:03:22Z

Noted. Thanks.

Bisaloo mentioned this issue Jul 6, 2023

Simplify fit_seromodel() output #89

Closed

ntorresd self-assigned this Jul 31, 2023

ntorresd added the bug Something isn't working label Jul 31, 2023

ntorresd assigned Bisaloo Aug 4, 2023

ntorresd mentioned this issue Oct 10, 2023

Dev test models #124

Closed

ntorresd closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix very large git repository #77

Fix very large git repository #77

Bisaloo commented Jun 22, 2023

ntorresd commented Jun 22, 2023 •

edited

Loading

Bisaloo commented Jun 22, 2023 •

edited

Loading

ben18785 commented Jun 23, 2023

Bisaloo commented Jul 6, 2023 •

edited

Loading

Bisaloo commented Oct 10, 2023

jamesmbaazam commented Oct 10, 2023

Fix very large git repository #77

Fix very large git repository #77

Comments

Bisaloo commented Jun 22, 2023

ntorresd commented Jun 22, 2023 • edited Loading

Bisaloo commented Jun 22, 2023 • edited Loading

ben18785 commented Jun 23, 2023

Bisaloo commented Jul 6, 2023 • edited Loading

Bisaloo commented Oct 10, 2023

jamesmbaazam commented Oct 10, 2023

ntorresd commented Jun 22, 2023 •

edited

Loading

Bisaloo commented Jun 22, 2023 •

edited

Loading

Bisaloo commented Jul 6, 2023 •

edited

Loading