style: whether to preserve previously used curation scripts or not? #190

haowang-bioinfo · 2020-07-20T08:13:20Z

Description of the issue:

There have been accumulated many curation scripts in the repo. These scripts were used only once for making specific changes to the model, and won't be used again. This issue is to bring up discussions about whether to preserver these one-time scripts under code folder or not?

Expected feature/value/output:

Find a way that serves ease of use for both users and curators

I hereby confirm that I have:

Done this analysis in the master branch of the repository
Checked that a similar issue does not exist already

The text was updated successfully, but these errors were encountered:

mihai-sysbio · 2020-07-21T16:13:34Z

Based on the reasoning in #188, I think @JonathanRob summarized it well:

remove the older model curation scripts (the more recent ones can stay)

Together with this, it would be nice to have a guide on how to go back in time and walk through an example curation.

JonathanRob · 2020-07-21T16:38:25Z

Yes, I like @mihai-sysbio's suggestion to provide a short guide on how to take advantage of the git framework. I guess adding such a guide to the README in the code/ directory would be best, as you suggested earlier, rather than the main repository README. Another option would be to create a wiki for the repository and add it there, but maybe that's an idea for the more distant future.

haowang-bioinfo · 2020-07-21T17:35:27Z

@mihai-sysbio @JonathanRob I don't have strong opinions to the removal of old scripts, but I do concern that this solution might be biased toward curators, instead of users.

JonathanRob · 2020-12-01T16:03:35Z

To add to this, I would also recommend removing a lot of the content in the data/ directory. Much of the data and scripts in that directory are no longer used or maintained, so they should be removed to avoid confusion/clutter.

haowang-bioinfo · 2021-01-04T13:51:05Z

Agree, the outdated content under data and useless scripts in code can be gradually moved a root directory deprecated, in which no further curation will be made and can be completely removed later at some time point (e.g. major release).

mihai-sysbio · 2021-01-05T13:22:45Z

The move to a ./deprecated folder might further impede the ability to run said code because of broken folder structure. Moreover, the recovery mechanism is the same - going back in time to a previous release. Therefore I find the move to such a folder unnecessary.

haowang-bioinfo · 2021-01-15T10:03:26Z

The move to a deprecated folder is to stop running and curating the code/data any further;
If rely on the recovery mechanism alone, may all the content of data and code be just deleted?
Many data files and code appear to be unused and not necessary, but if this is 100% certain?
If it's a big cost to have such a folder with ~100Mb in size?

JonathanRob · 2021-01-15T11:15:50Z

To respond to your points, @Hao-Chalmers:

The move to a deprecated folder is to stop running and curating the code/data any further;

Deleting the code/data will also prevent running and curating it further.

If rely on the recovery mechanism along, may all the content of data and code be just deleted?

I'm not sure what you mean by this, but if there is code or data that is not used or maintained and is not expected to be used or developed in the future, then it should not remain in the repository. A difficulty for new users is having to familiarize themselves with all the code/data in a repository, to see if what they need is already present or if it needs to be developed. For example, I have written my own custom functions in RAVEN only to find out afterward that they already existed and I just didn't see them at first because there are so many (this is not a critique of RAVEN, but an implicit challenge that comes with larger repositories). By keeping a lot of old and unused scripts and data, it makes this even more tedious and problematic.

Many data files and code appear to be unused and not necessary, but if this is 100% certain?

I do not delete any code or files unless I'm quite certain that they are old and no longer necessary. Of course I make a lot of mistakes, but it's very easy to simply restore a wrongly deleted file if we later realize it was incorrectly removed.

If it's a big cost to have such a folder with ~100Mb in size?

For a git repository, I generally try to keep the size as small as reasonably possible. So if something is not necessary, then why keep it?

haowang-bioinfo · 2021-01-17T20:25:50Z

Whether it's necessary or not to keep the previous data and scripts that were used in modifying the GEM? This is an important question and should be openly discussed and considered with the opinions from multiple parties: administrators, conmmunity contributiors, and especially the users, whose thoughts are currently unavailable.

No long ago, the modeling work was suffered from the lack of previous code and data, which made the reuse and further development a painful process. Git-based GEM repos greatly mitigated this. Even though the size of GEM repo grows with time. But in balancing between "easy to trace and full transparency" and "reduce repo size", the former has a higher priority in my view.

mihai-sysbio · 2021-01-18T07:16:01Z

Once code and data are added to a repository, only actions that alter history (eg force-push-ing) can make those code and data go away. Having protected branches reduces some of that risk. In practice, code and data in Human-GEM will always be present in the repository when doing a full git clone or a download of an old release. This means that the total size of the repository will really never be going down.

Now, by deleting some of these files, they will disappear in future releases. As @JonathanRob described above, there are many advantages to keeping a clean and functional code-base, free of dead code that is not runnable directly (say because the file tree or file names have changes). At the same time, if code or data is not present in a current release, one has to dig a little deeper to find it.

I believe the ultimate purpose of this issue is to identify all code that cannot be run any longer and will not be fixed to run in the future, together with associated data, and other data that is not relevant any longer.

@hao:
Not long ago, the modeling work was suffered from the lack of previous code and data, which made the reuse and further development a painful process. Git-based GEM repos greatly mitigated this. Even though the size of GEM repo grows with time. But in balancing between "easy to trace and full transparency" and "reduce repo size", the former has a higher priority in my view.

I think we are all on the same page: any commit that makes changes to the model should be supported by code and data present in those commits. After many releases, I believe time has come to ask ourselves for how long will we keep around code and data that was meant to be used only once in the past while pretending it can still run.

JonathanRob · 2021-01-18T17:18:10Z

Thank you for your input @Hao-Chalmers and @mihai-sysbio. After some further consideration, I have changed my view on the matter.

In particular, I find it to be extremely useful to be able to use GitHub's search function to query the repository for any scripts, functions, log files, etc. that contain a given term (such as the name of a reaction that has been deleted). If the old data files and scripts are deleted, their contents will not appear in the search results, requiring a bit more digging into previous versions as @mihai-sysbio explained.

I am therefore leaning toward @Hao-Chalmers's earlier suggestion to create a ./deprecated folder where we move old and unused scripts and data - except for binaries, which would be deleted. Although the scripts contained therein may break further due to a modified directory structure as pointed out by @mihai-sysbio, I don't think it is much of a loss because we anyway shouldn't expect anything in the deprecated directory to be functional as is - for that, one should instead revert to a previous repository version when the script was created. This approach would also address my concern of a cluttered repository by separating old/unused files from the current maintained content.

Once a file is moved to the deprecated folder, there isn't really a rush to delete it, as it will have no effect on the repository size as long as it's not a binary. So we could in theory just leave them there indefinitely, unless there is a good argument to fully delete at some point.

To summarize, here is my revised suggestion:

Create a ./deprecated directory
Move old/unused content from ./data to ./deprecated/data
Move old/unused content from ./code to ./deprecated/code
Delete old/unused binaries

mihai-sysbio · 2021-01-18T22:05:30Z

A way to build further on the above would be to do cleanups routinely, say prior to each major release. I could even imagine different folder per cleanup, eg deprecated/v1.
Another approach would be to keep the code where it is on develop and remove it from master.

haowang-bioinfo · 2021-06-25T10:59:12Z

For the one-time curation scripts, one can actually directly keep them under .deprecated/code/modelCuration/ subfolder from the beginning. This would save the cleanup step.

haowang-bioinfo mentioned this issue Jul 20, 2020

Align repository with standard-GEM template #188

Merged

1 task

JonathanRob added discussion question labels Jul 21, 2020

JonathanRob mentioned this issue Jan 17, 2021

Move deprecated data and code to new .deprecated directory #223

Merged

1 task

JonathanRob changed the title ~~style: whether to preserver previously used curation scripts or not?~~ style: whether to preserve previously used curation scripts or not? Jan 18, 2021

JonathanRob mentioned this issue Jan 28, 2021

Documentation of model development MetabolicAtlas/standard-GEM#20

Open

haowang-bioinfo mentioned this issue Apr 16, 2021

Human 1.7 #247

Merged

haowang-bioinfo closed this as completed Jun 25, 2021

mihai-sysbio mentioned this issue Jul 17, 2023

chore: add standard-GEM.md SysBioChalmers/yeast-GEM#345

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

style: whether to preserve previously used curation scripts or not? #190

style: whether to preserve previously used curation scripts or not? #190

haowang-bioinfo commented Jul 20, 2020

mihai-sysbio commented Jul 21, 2020

JonathanRob commented Jul 21, 2020

haowang-bioinfo commented Jul 21, 2020

JonathanRob commented Dec 1, 2020

haowang-bioinfo commented Jan 4, 2021

mihai-sysbio commented Jan 5, 2021

haowang-bioinfo commented Jan 15, 2021 •

edited

Loading

JonathanRob commented Jan 15, 2021 •

edited

Loading

haowang-bioinfo commented Jan 17, 2021

mihai-sysbio commented Jan 18, 2021

JonathanRob commented Jan 18, 2021

mihai-sysbio commented Jan 18, 2021 •

edited

Loading

haowang-bioinfo commented Jun 25, 2021

style: whether to preserve previously used curation scripts or not? #190

style: whether to preserve previously used curation scripts or not? #190

Comments

haowang-bioinfo commented Jul 20, 2020

Description of the issue:

Expected feature/value/output:

mihai-sysbio commented Jul 21, 2020

JonathanRob commented Jul 21, 2020

haowang-bioinfo commented Jul 21, 2020

JonathanRob commented Dec 1, 2020

haowang-bioinfo commented Jan 4, 2021

mihai-sysbio commented Jan 5, 2021

haowang-bioinfo commented Jan 15, 2021 • edited Loading

JonathanRob commented Jan 15, 2021 • edited Loading

haowang-bioinfo commented Jan 17, 2021

mihai-sysbio commented Jan 18, 2021

JonathanRob commented Jan 18, 2021

mihai-sysbio commented Jan 18, 2021 • edited Loading

haowang-bioinfo commented Jun 25, 2021

haowang-bioinfo commented Jan 15, 2021 •

edited

Loading

JonathanRob commented Jan 15, 2021 •

edited

Loading

mihai-sysbio commented Jan 18, 2021 •

edited

Loading