genomeGit 3.0: A distributed version control system for fast and efficient updating of genome assembly data.
genomeGit 3.0 is a distributed version control system which utilizes Git for the quick storage and management of genomic data. It enables quick 'lifting-over' of genomic depedent files, while being storage efficient. GenomeGit 3.0 can currently deal with the following datasets:
- Genome assemblies (FASTA)
- Variant Calling Files (VCF)
- Annotation files (GFF/GFF3)
- Alignment files (SAM/BAM)
GenomeGit 3.0 is currently only compatiable with Unix Operating Systems.
genomeGit 3.0 is now able to make use of BAM files. There is also a new hybrid alignment option avaliable using MashMap2 and Nucmer4. In addition, genomeGit 3.0 is faster, more accurate, can handle inversions, and handle both splits and merges within genomic data.
The following dependencies are required:
genomeGit 3.0 also makes use of the following Python modules:
In order to be able to run the program anywhere on your system, run genomegit_install
. This wil lcreate a symlink for genomegit in the /usr/bin directory. You may need to make the main script executable using chmod u+x <path_to_genomeGit>/genomeigt
.
You can temporarily append the genomegit directory to the $PATH variable by executing PATH=$PATH:directory
, where directory
repersents the location of the genomeGit directory. The user may need to make the main script executable using chmod u+x <path_to_genomeGit>/genomegit
.
To uninstall genomeGit, remove the genomeGit directory from your PC and then remove the symlink using the following: rm -rf ./genomegit
and then sudo rm /usr/bin/genomegit
.
To display the genomeGit welcome message, execute genomegit
.
Please note that genomeGit can make use of Git commands, executed as follows: genomegit <git_command>
. See git documentation for further information on how to use git.
To get the list of all available genomeGit commands, execute genomegit help
.
The repository can be initialized by executing genomegit init
, creating a .gnmgit directory.
This repository will store all of your genomic data and the .git repository. You can clone an existing repository by executing genomegit clone <url>
.
To add files into the repository, execute genomegit add <file>
and genomegit commit -m <message>
.
Additional arguments can be passed to genomegit add <file>
, such at the number of threads (--t=<x>
or --thread=<x>
), and the aligner you wish to use (--a=<1 or 2>
or --aligner=<1 or 2>
, where 1 will run the hybrid aligner and 2 runs Nucmer4 only).
Flags specific to NUCmer4 (--c=<x>
or --mincluster=<x>
) can be used (see the NUCmer documentation for information regarding these flags).
Likewise, the flags --k=<x>
or --kmer=<x>
and --s=<x>
or --segLength=<x>
and --pi=<x>
can be used for MashMap2 (see the MashMap GitHub page for information regarding these flags).
For the hyrid alignment, the user can also use the flag --ms=<x>
as either 1 or 2, where 2 will also detect merges (but not splits). By default, it'll detect splits (--ms=1
).
Using only NUCmer4 will result in the detection of both splits and merges.
genomeGit 3.0 will automatically classify the file inputted and parse it into it's respective Git-compatible sub-files, within the Git repository. A summary of the charactersticis of the data within the repository can be visualised using the command genomegit report
.
When a user already has a genome assembly present within the repository and wishes to update it, genomeGit 3.0 will automatically migrate the coordinates of the stored dependent files. This is called lift-over. This process can be computationally demanding and it is thus recommended to use the optional --t=<x>
or --threads=<x>
parameter to choose the number of threads used during lift-over.
genomeGit 3.0 enable users to perform updates within thier local repository and push this to a central repository for all users to use. This can be performed by the command genomegit init --bare <remote_name>
.
To access a remote repository, the remote repository address needs to be added. This can be done as follows: genomegit remote add <remote_name> <remote_location>
where <remote_location>
is the absolute path to the repository of interested, if located within the same machine it's executed from.
To obtain a repository from a server, a username and server IP address is required, i.e. [email protected]
.
The <remote_name>
parameter is the 'nick name' of the repository that the user provides when pushing and pulling from a local repository.
To update the remote repository, the user can fetch the remote repositories data and push it into their local repository by typing genomegit pull <remote_name> <branch_name>
. Any changes that were introduced into the local repository can be pushed into the remote by executing genomegit push <remote_name>
.
4. Assembly version log, checking out a Assembly version of interest, and listing all the files of a particular dataset
You can switch to any of your stored assembly versions by using genomegit checkout <commit_hash>
, where <commit_hash>
repersents the SHA-1 commit hash. This can be obtained via executing genomegit log
.
To reconstruct any Git-compatible files, such as the extracted VCF data, execute genomegit get --dataset --sequence --region --commit-hash --message <filename>
, where <filename>
and <get --dataset>
are required arguments.
The <filename>
argument will require the user to enter the file of interest that they wish to reconstruct.
If a user wishes to reconstruct a Variants datatype, they can execute genomegit get --dataset=Variants
. The optional parameters --sequence
and --region
can be used to extract regions of a sequence which are contained within the file of interested. The region must be specified as a range in the form of two integers, seperated by a dash ("-"), e.g. --region=1-5000
. If the file of interest belongs to a previous repository version, then --commit-hash
and --message
can be used to specify the version's commit hash, or its message.
The command genomegit list
will obtain a list with all the file names present within the repository.
To revert back to the main branch, you can type genomegit checkout <branch name>
.
To view the difference between two versions of the genomic data present within the repository, execute genomegit diff --message=<message> <hash1> <hash2>
, where <hash1> <hash2>
repersent the hashes of the commits given to the user following lift-over.
The user can alternatively use --message=<message>
if they used a commit message instead. The number of threads can be provided using genomegit diff --threads
or --t=<x>
, as previously described. This might prove useful when comparing non-consecutive versions, as comparisons of assembly versions may need to be performed. genomegit log
will provide a list of commits with their hashes.
The user can also execute genomegit report
to obtain a report output GenomeGit_Report.txt, which contains the all the information regarding the differences between assembly versions.
bash /path/to/directory/genomegit_install
genomegit init
genomegit clone [email protected]:/home/user/repository
genomegit init --bare <remote_name>
genomegit remote add origin [email protected]
genomegit pull origin master
genomegit push origin
genomegit add /path/to/directory/*
genomegit add --t=32 --a=2 --c=3000 NewAssembly.fa
genomegit commit -m "My_First_Commit"
genomegit get MyAssembly.fa
genomegit get MyAnnotation.gff --sequence=Ch02 --region=1-2000000
genomegit get --dataset=Variants --sequence=Ch03 --region=1-4000000 --message=My_First_Commit
genomegit report --commit-hash 352d69d2327dc95b58d6ec10130366ddc760bd3d
genomegit diff --message My_First_Commit My_Second_Commit