Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

style: whether to preserve previously used curation scripts or not? #190

Closed
2 tasks done
haowang-bioinfo opened this issue Jul 20, 2020 · 13 comments
Closed
2 tasks done

Comments

@haowang-bioinfo
Copy link
Member

Description of the issue:

  • There have been accumulated many curation scripts in the repo. These scripts were used only once for making specific changes to the model, and won't be used again. This issue is to bring up discussions about whether to preserver these one-time scripts under code folder or not?

Expected feature/value/output:

  • Find a way that serves ease of use for both users and curators

I hereby confirm that I have:

  • Done this analysis in the master branch of the repository
  • Checked that a similar issue does not exist already
@mihai-sysbio
Copy link
Member

Based on the reasoning in #188, I think @JonathanRob summarized it well:

remove the older model curation scripts (the more recent ones can stay)

Together with this, it would be nice to have a guide on how to go back in time and walk through an example curation.

@JonathanRob
Copy link
Collaborator

Yes, I like @mihai-sysbio's suggestion to provide a short guide on how to take advantage of the git framework. I guess adding such a guide to the README in the code/ directory would be best, as you suggested earlier, rather than the main repository README. Another option would be to create a wiki for the repository and add it there, but maybe that's an idea for the more distant future.

@haowang-bioinfo
Copy link
Member Author

@mihai-sysbio @JonathanRob I don't have strong opinions to the removal of old scripts, but I do concern that this solution might be biased toward curators, instead of users.

@JonathanRob
Copy link
Collaborator

To add to this, I would also recommend removing a lot of the content in the data/ directory. Much of the data and scripts in that directory are no longer used or maintained, so they should be removed to avoid confusion/clutter.

@haowang-bioinfo
Copy link
Member Author

Agree, the outdated content under data and useless scripts in code can be gradually moved a root directory deprecated, in which no further curation will be made and can be completely removed later at some time point (e.g. major release).

@mihai-sysbio
Copy link
Member

The move to a ./deprecated folder might further impede the ability to run said code because of broken folder structure. Moreover, the recovery mechanism is the same - going back in time to a previous release. Therefore I find the move to such a folder unnecessary.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented Jan 15, 2021

  1. The move to a deprecated folder is to stop running and curating the code/data any further;
  2. If rely on the recovery mechanism alone, may all the content of data and code be just deleted?
  3. Many data files and code appear to be unused and not necessary, but if this is 100% certain?
  4. If it's a big cost to have such a folder with ~100Mb in size?

@JonathanRob
Copy link
Collaborator

JonathanRob commented Jan 15, 2021

To respond to your points, @Hao-Chalmers:

  1. The move to a deprecated folder is to stop running and curating the code/data any further;

Deleting the code/data will also prevent running and curating it further.

  1. If rely on the recovery mechanism along, may all the content of data and code be just deleted?

I'm not sure what you mean by this, but if there is code or data that is not used or maintained and is not expected to be used or developed in the future, then it should not remain in the repository. A difficulty for new users is having to familiarize themselves with all the code/data in a repository, to see if what they need is already present or if it needs to be developed. For example, I have written my own custom functions in RAVEN only to find out afterward that they already existed and I just didn't see them at first because there are so many (this is not a critique of RAVEN, but an implicit challenge that comes with larger repositories). By keeping a lot of old and unused scripts and data, it makes this even more tedious and problematic.

  1. Many data files and code appear to be unused and not necessary, but if this is 100% certain?

I do not delete any code or files unless I'm quite certain that they are old and no longer necessary. Of course I make a lot of mistakes, but it's very easy to simply restore a wrongly deleted file if we later realize it was incorrectly removed.

  1. If it's a big cost to have such a folder with ~100Mb in size?

For a git repository, I generally try to keep the size as small as reasonably possible. So if something is not necessary, then why keep it?

@haowang-bioinfo
Copy link
Member Author

Whether it's necessary or not to keep the previous data and scripts that were used in modifying the GEM? This is an important question and should be openly discussed and considered with the opinions from multiple parties: administrators, conmmunity contributiors, and especially the users, whose thoughts are currently unavailable.

No long ago, the modeling work was suffered from the lack of previous code and data, which made the reuse and further development a painful process. Git-based GEM repos greatly mitigated this. Even though the size of GEM repo grows with time. But in balancing between "easy to trace and full transparency" and "reduce repo size", the former has a higher priority in my view.

@mihai-sysbio
Copy link
Member

Once code and data are added to a repository, only actions that alter history (eg force-push-ing) can make those code and data go away. Having protected branches reduces some of that risk. In practice, code and data in Human-GEM will always be present in the repository when doing a full git clone or a download of an old release. This means that the total size of the repository will really never be going down.

Now, by deleting some of these files, they will disappear in future releases. As @JonathanRob described above, there are many advantages to keeping a clean and functional code-base, free of dead code that is not runnable directly (say because the file tree or file names have changes). At the same time, if code or data is not present in a current release, one has to dig a little deeper to find it.

I believe the ultimate purpose of this issue is to identify all code that cannot be run any longer and will not be fixed to run in the future, together with associated data, and other data that is not relevant any longer.

@hao:
Not long ago, the modeling work was suffered from the lack of previous code and data, which made the reuse and further development a painful process. Git-based GEM repos greatly mitigated this. Even though the size of GEM repo grows with time. But in balancing between "easy to trace and full transparency" and "reduce repo size", the former has a higher priority in my view.

I think we are all on the same page: any commit that makes changes to the model should be supported by code and data present in those commits. After many releases, I believe time has come to ask ourselves for how long will we keep around code and data that was meant to be used only once in the past while pretending it can still run.

@JonathanRob JonathanRob changed the title style: whether to preserver previously used curation scripts or not? style: whether to preserve previously used curation scripts or not? Jan 18, 2021
@JonathanRob
Copy link
Collaborator

Thank you for your input @Hao-Chalmers and @mihai-sysbio. After some further consideration, I have changed my view on the matter.

In particular, I find it to be extremely useful to be able to use GitHub's search function to query the repository for any scripts, functions, log files, etc. that contain a given term (such as the name of a reaction that has been deleted). If the old data files and scripts are deleted, their contents will not appear in the search results, requiring a bit more digging into previous versions as @mihai-sysbio explained.

I am therefore leaning toward @Hao-Chalmers's earlier suggestion to create a ./deprecated folder where we move old and unused scripts and data - except for binaries, which would be deleted. Although the scripts contained therein may break further due to a modified directory structure as pointed out by @mihai-sysbio, I don't think it is much of a loss because we anyway shouldn't expect anything in the deprecated directory to be functional as is - for that, one should instead revert to a previous repository version when the script was created. This approach would also address my concern of a cluttered repository by separating old/unused files from the current maintained content.

Once a file is moved to the deprecated folder, there isn't really a rush to delete it, as it will have no effect on the repository size as long as it's not a binary. So we could in theory just leave them there indefinitely, unless there is a good argument to fully delete at some point.

To summarize, here is my revised suggestion:

  1. Create a ./deprecated directory
  2. Move old/unused content from ./data to ./deprecated/data
  3. Move old/unused content from ./code to ./deprecated/code
  4. Delete old/unused binaries

@mihai-sysbio
Copy link
Member

mihai-sysbio commented Jan 18, 2021

A way to build further on the above would be to do cleanups routinely, say prior to each major release. I could even imagine different folder per cleanup, eg deprecated/v1.
Another approach would be to keep the code where it is on develop and remove it from master.

@haowang-bioinfo
Copy link
Member Author

For the one-time curation scripts, one can actually directly keep them under .deprecated/code/modelCuration/ subfolder from the beginning. This would save the cleanup step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants