Merge pull request #1 from jamaliki/nucleotide

Nucleotide
3dem · May 12, 2023 · bfb4ca3 · bfb4ca3
2 parents ba75c8e + 03d2ddc
commit bfb4ca3
Show file tree

Hide file tree

Showing 50 changed files with 4,077 additions and 2,691 deletions.
diff --git a/.gitignore b/.gitignore
@@ -130,3 +130,9 @@ dmypy.json
 
 # Pycharm
 .idea/
+*.aap
+*.csv
+*.cif
+*.pkl
+*.hmm
+*.prot
diff --git a/README.md b/README.md
@@ -3,12 +3,15 @@
 ModelAngelo is an automatic atomic model building program for cryo-EM maps.
 
 ## Compute requirements
+
 It is highly recommended to have access to GPUs with at least 8GB of memory. ModelAngelo performs well on NVIDIA GPUs such as 2080's and beyond.
 
 Please note that the weight files required by both ModelAngelo and the language model it uses combined are around 10 GB. So you need to have more disk space than that.
 
 ## Installation
-### Personal use
+<details>
+<summary>Personal use</summary>
+<br>
 
 (If you manage a computational cluster, please skip to the next section)
 
@@ -38,8 +41,13 @@ source install_script.sh
 You will now have a conda environment called `model_angelo` that is able to run the program. 
 You need to activate this conda environment with `conda activate model_angelo`. 
 Now, you can run `model_angelo build -h` to see if the installation worked!
+<br>
+</details>
+
+<details>
+<summary>Shared computational environment</summary>
+<br>
 
-### Installing for a shared computational environment
 If you manage a computational cluster with many users and would like to install ModelAngelo once to be used everywhere, 
 you should complete the above steps 1 and 2 for a public account.
 
@@ -61,8 +69,12 @@ Finally, you can make the following bash script available for all users to run:
 source `which activate` model_angelo
 model_angelo "$@"
 ```
+<br>
+</details>
 
-## Installation issues
+<details>
+<summary>Installation issues</summary>
+<br>
 
 **1. Binary activate not found**
 It appears that miniconda's activate binary is not added to `PATH` by default. You can either fix this by appending it yourself, like so:
@@ -71,19 +83,34 @@ export PATH="$PATH:/path/to/miniconda3/bin"
 ```
 or running `conda init` and restarting your shell.
 
+</details>
+
 ## Usage
-### Building a map with FASTA sequence
-This is the recommended use case, when you have access to a medium-high resolution cryo-EM map (resolutions exceeding 4 Å) as well as a FASTA file with all of your protein sequences.
+First, make sure to run `model_angelo build --help` or `model_angelo build_no_seq --help` to familiarize yourself with all of the options available.
 
-To familiarize yourself with the options available in `model_angelo build`, run `model_angelo build -h`.
+<details>
+<summary>Building a map with FASTA sequence</summary>
+<br>
 
-Let's say the map's name is `map.mrc` and the sequence file is `sequence.fasta`. To build your model in a directory named `output`, you run:
+This is the recommended use case, when you have access to a medium-high resolution cryo-EM map (resolutions exceeding 4 Å) as well as a FASTA files with all of your protein, RNA, and DNA sequences.
+
+Let's say the map's name is `map.mrc` and the (protein) sequence file is `prot.fasta`. To build your model in a directory named `output`, you run:
+```
+model_angelo build -v map.mrc -pf prot.fasta -o output
+```
+If you would like to build nucleotides as well, you need to provide the RNA and DNA portions of your sequences in different files like so
 ```
-model_angelo build -v map.mrc -f sequence.fasta -o output
+model_angelo build -v map.mrc -pf prot.fasta -df dna.fasta -rf rna.fasta -o output
 ```
+If you only have RNA or DNA, you can drop the other input.
+
 If the output of the program halts before the completion of `GNN model refinement, round 3 / 3`, there was a bug that you can see in `output/model_angelo.log`. Otherwise, you can find your model in `output/output.cif`. The name of the mmCIF file is based on the output folder name, so if you specify, for example, `-o testing/test/model_building`, the model will be in `testing/test/model_building/model_building.cif`.
+</details>
+
+<details>
+<summary>Building a map with no FASTA sequence</summary>
+<br>
 
-### Building a map with no FASTA sequence
 If you have a sample where you do not know all of the protein sequences that occur in the map, you can run `model_angelo build_no_seq` instead.
 This version of the program uses a network that was not trained with input sequences, nor does it do post-processing on the built map.
 
@@ -93,17 +120,27 @@ You run this command:
 ```
 model_angelo build_no_seq -v map.mrc -o output
 ```
-The model will be in `output/output.cif` as before. Now there are also HMM profiles for each chain in HHsearch's format here: `output/hmm_profiles`.
-To do a sequence search for chain A (for example), you should first install [HHblits](https://github.com/soedinglab/hh-suite) and download one of the [databases](https://github.com/soedinglab/hh-suite#available-databases). Then, you can run
+The model will be in `output/output.cif` as before. Now there are also HMM profiles for each chain in HMMER3 format here: `output/hmm_profiles`.
+To do a sequence search for chain A (for example), you should first download a database that will include your organism's proteins, such as the [human genome](https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz). Then, you can run
 ```
-hhblits -i output/hmm_profiles/A.hhm -d PATH_TO_DB -o A.hhr -oa3m A.a3m -M first
+model_angelo hmm_search --i output --f PATH_TO_DB --o hmm_output
 ```
-You will have your result as a multiple sequence alignment here: `A.a3m`. 
+You will have your results as a series of HMM output files with the extension .hhr, for example: `hmm_output/A.hhr`. These are named by chain according to the model built by ModelAngelo in `output/output.cif`.
+</details>
+
+## FAQs
+
+1. **How do I change which GPU ModelAngelo runs on?** You can specify the device(s) ModelAngelo runs on by using the `--device` flag. So, for example, to use GPU with Id 0, you write `--device 0`. To use the first two GPUs of your computer, you can write `--device 0,1`.
+2. **Do I need to repeat the sequence of a dimer twice in the FASTA file?** No, each *unique* sequence only needs to show up once in the FASTA file. Duplicates are always removed.
+3. **How does ModelAngelo deal with glycosylation sites, non standard amino acids, etc?** It *doesn't*. These parts of the model should be checked manually.
+4. **How does ModelAngelo deal with cis prolines?** It *doesn't*. However, we find that a round of refinement (with REFMAC, for example) fixes this issue.
 
 ## Common issues
-1. ModelAngelo currently does not build nucleotides. It also may make mistakes if nucleotide sequences are in the sequence fasta file.
 
-2. If the result looks very bad, with many disconnected chains, take a look at the alpha helices. If these are made of short and disconnected chains, the map was probably in the wrong handedness. If you flip the map and run again, you should see much better results.
+1. If the result looks very bad, with many disconnected chains, take a look at the alpha helices. If these are made of short and disconnected chains, the map was probably in the wrong hand. If you flip the map and run again, you should see much better results.
+2. If the map is processed using deepEMhancer, we have noticed less than satisfactory results. Please try with a map post-processed with a conventional algorithm and try again.
+3. Always check your input sequence files to make sure that they correspond to a correct FASTA format. Please make sure that the sequences are all capital letters, as is the convention.
+4. If the output model is shifted with respect to your map, make sure that the map provided to ModelAngelo is cubic. Otherwise, it might get shifted when ModelAngelo internally makes the map cubic.
 
 ## Citation
 
@@ -119,4 +156,4 @@ booktitle={International Conference on Learning Representations},
 year={2023},
 url={https://openreview.net/forum?id=65XDF_nwI61}
 }
-```
+```
diff --git a/install_script.sh b/install_script.sh
@@ -23,7 +23,7 @@ fi
 
 is_conda_model_angelo_installed=$(conda info --envs | grep model_angelo -c)
 if [[ "${is_conda_model_angelo_installed}" == "0" ]];then
-  conda create -n model_angelo python=3.9 -y;
+  conda create -n model_angelo python=3.10 -y;
 fi
 
 torch_home_path="${TORCH_HOME}"
@@ -56,8 +56,8 @@ $python_exc setup.py install
 
 if [[ "${DOWNLOAD_WEIGHTS}" ]]; then
   echo "Writing weights to ${TORCH_HOME}"
-  $python_exc model_angelo/utils/setup_weights.py --bundle-name original
-  $python_exc model_angelo/utils/setup_weights.py --bundle-name original_no_seq
+  $python_exc model_angelo/utils/setup_weights.py --bundle-name nucleotides
+  $python_exc model_angelo/utils/setup_weights.py --bundle-name nucleotides_no_seq
 else
   echo "Did not download weights because the flag -w or --download-weights was not specified"
 fi
diff --git a/model_angelo/__init__.py b/model_angelo/__init__.py
@@ -5,4 +5,4 @@
 """
 
 
-__version__ = "0.2.4"
+__version__ = "1.0.0"
diff --git a/model_angelo/__main__.py b/model_angelo/__main__.py
@@ -11,8 +11,7 @@ def main():
     import model_angelo
 
     parser = argparse.ArgumentParser(
-        description=__doc__,
-        formatter_class=argparse.RawTextHelpFormatter,
+        description=__doc__, formatter_class=argparse.RawTextHelpFormatter,
     )
     parser.add_argument(
         "--version",
@@ -24,17 +23,19 @@ def main():
     import model_angelo.apps.build_no_seq
     import model_angelo.apps.evaluate
     import model_angelo.apps.eval_per_resid
+    import model_angelo.apps.hmm_search
+    import model_angelo.apps.refine
 
     modules = {
         "build": model_angelo.apps.build,
         "build_no_seq": model_angelo.apps.build_no_seq,
         "evaluate": model_angelo.apps.evaluate,
         "eval_per_resid": model_angelo.apps.eval_per_resid,
+        "hmm_search": model_angelo.apps.hmm_search,
+        "refine": model_angelo.apps.refine,
     }
 
-    subparsers = parser.add_subparsers(
-        title="Choose a module",
-    )
+    subparsers = parser.add_subparsers(title="Choose a module",)
     subparsers.required = "True"
 
     for key in modules:
-Original file line number
+Diff line change
@@ Expand Up / @@ -130,3 +130,9 @@ dmypy.json @@
     # Pycharm
     .idea/
+    *.aap
+    *.csv
+    *.cif
+    *.pkl
+    *.hmm
+    *.prot
Original file line number	Diff line number	Diff line change
Expand Up		@@ -5,4 +5,4 @@
		"""


		__version__ = "0.2.4"
		__version__ = "1.0.0"