Merge pull request #207 from alexismhill3/update-install

Make Opfi available on Bioconda
wilkelab · Sep 30, 2021 · a02850d · a02850d
2 parents 0b8f11b + 4bd19e0
commit a02850d
Show file tree

Hide file tree

Showing 10 changed files with 185 additions and 172 deletions.
diff --git a/.github/workflows/feature.yml b/.github/workflows/feature.yml
@@ -27,7 +27,7 @@ jobs:
         sudo cp -r GenericRepeatFinder/bin /usr/local/bin/
     - name: Install dependencies
       run: |
-        cd lib/pilercr1.06 && sudo make install && cd ../..
+        sudo apt-get install -y pilercr
         sudo apt install -y ncbi-blast+
         sudo apt install -y mmseqs2
         sudo apt install -y diamond-aligner

diff --git a/.github/workflows/master.yml b/.github/workflows/master.yml
@@ -28,7 +28,7 @@ jobs:
         git clone https://github.com/bioinfolabmu/GenericRepeatFinder.git && sudo cp GenericRepeatFinder/bin/* /usr/local/bin/ && sudo chmod 755 /usr/local/bin/grf* /usr/local/bin/ltr_finder
     - name: Install dependencies
       run: |
-        cd lib/pilercr1.06 && sudo make install && cd ../..
+        sudo apt-get install -y pilercr
         sudo apt install -y ncbi-blast+
         sudo apt install -y mmseqs2
         sudo apt install -y diamond-aligner

diff --git a/README.md b/README.md
@@ -2,34 +2,41 @@
 
 [![Documentation Status](https://readthedocs.org/projects/opfi/badge/?version=latest)](https://opfi.readthedocs.io/en/latest/?badge=latest)
 [![PyPI](http://img.shields.io/pypi/v/opfi.svg)](https://pypi.python.org/pypi/opfi/)
+[![Anaconda-Server Badge](https://anaconda.org/bioconda/opfi/badges/installer/conda.svg)](https://conda.anaconda.org/bioconda)
 
 A python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics datasets.
 
-## Requirements
+## Installation
+
+The recommended way to install Opfi is with [Bioconda](https://bioconda.github.io/), which requires the [conda](https://docs.conda.io/en/latest/) package manager. This will install Opfi and all of its dependencies (which you can read more about [here](https://opfi.readthedocs.io/en/latest/installation.html)).
+
+Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system. 
+
+### Install with conda (Linux and Mac OS only)
 
-At a minimum, the NCBI BLAST+ software suite should be installed and on the user's PATH. BLAST+ installation instruction can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK279671/). For annotation of CRISPR arrays, Opfi uses PILER-CR, which can be downloaded from the software [home page](https://www.drive5.com/pilercr/). A modified version of PILER-CR that detects mini (two repeat) CRISPR arrays is also available, and can be built with GNU make after cloning or downloading Opfi:
+First, set up conda and Bioconda following the [quickstart](https://bioconda.github.io/user/install.html) guide. Once this is done, run:
 
 ```
-cd lib/pilercr1.06
-sudo make install
+conda install -c bioconda opfi
 ```
 
-## Installation
-
-You can install Opfi with Pip:
+And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:
 
 ```
-pip3 install opfi
+conda create --name opfi-env -c bioconda opfi
+conda activate opfi-env
 ```
 
-Alternatively, you can install the latest version on Github:
+### Install with pip 
+
+This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is available [here](https://opfi.readthedocs.io/en/latest/installation.html#dependencies). Once this step is complete, install Opfi with pip by running:
 
 ```
-git clone https://github.com/wilkelab/Opfi.git
-cd Opfi
-pip3 install .
+pip install opfi
 ```
 
+For information about installing for development, check out the [documentation site](https://opfi.readthedocs.io/en/latest/installation.html).
+
 ## Gene Finder
 
 Gene Finder iteratively executes homology searches to identify gene clusters of interest. Below is an example script that sets up a search for putative CRISPR-Cas systems in the Rippkaea orientalis PCC 8802 (cyanobacteria) genome. Data inputs are provided in the Opfi tutorial (`tutorials/tutorial.ipynb`).

diff --git a/docs/examples.rst b/docs/examples.rst
@@ -10,7 +10,7 @@ In this example, we will annotate and visualize CRISPR-Cas systems in the cyanob
 
 You can download the complete assembled genome `here <https://www.ncbi.nlm.nih.gov/assembly/GCF_000024045.1/>`_; it is also available at `<https://github.com/wilkelab/Opfi>`_ under ``tutorials``, along with the other data files necessary to run these examples, and an interactive jupyter notebook version of this tutorial. 
 
-To run the code snippets here, Opfi must be installed, along with NCBI BLAST+ **and** PILER-CR. More detailed installation instructions can be found in the :ref:`installation` section. 
+This tutorial assumes the user has already installed Opfi and all dependencies (if installing with conda, this is done automatically). Some familiarity with BLAST and the basic homology search algorithm may also be helpful, but is not required. 
 
 1. Use the makeblastdb utility to convert a Cas protein database to BLAST format
 ################################################################################

diff --git a/docs/index.rst b/docs/index.rst
@@ -6,9 +6,9 @@
 Opfi
 ====
 
-Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cassette identification pipelines, particularly for large genomics or metagenomics datasets. 
+Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cluster identification pipelines, particularly for large genomics or metagenomics datasets. 
 
-Opfi is implemented entirely in Python, and can be downloaded from the Python package index. It consists of two major modules: Gene Finder, for discovery of novel gene cassettes, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.
+Opfi is implemented entirely in Python, and can be downloaded with conda or the from the Python Package Index. It consists of two major modules: Gene Finder, for discovery of novel gene clusters, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.
 
 Contents
 --------
@@ -20,4 +20,4 @@ Contents
     examples
     tips
     modules
-    contributing
+    contributing
diff --git a/docs/installation.rst b/docs/installation.rst
@@ -6,46 +6,98 @@ Getting Started
 Installation
 ------------
 
-You can install Opfi with Pip:
+The recommended way to install Opfi is with `Bioconda <https://bioconda.github.io/>`_, which requires the `conda <https://docs.conda.io/en/latest/>`_ package manager. This will install Opfi and all of its dependencies (which you can read more about below, see :ref:`dependencies`).
+
+Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system. 
+
+.. _install-with-conda:
+
+Install with conda (Linux and Mac OS only)
+##########################################
+
+First, set up conda and Bioconda following the `quickstart <https://bioconda.github.io/user/install.html>`_ guide. Once this is done, run:
 
 .. code-block:: bash
 
-    pip3 install opfi
+    conda install -c bioconda opfi
 
-Alternatively, you can install the latest version on Github:
+And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:
 
 .. code-block:: bash
 
-    git clone https://github.com/alexismhill3/Opfi.git
+    conda create --name opfi-env -c bioconda opfi
+    conda activate opfi-env
+
+.. _install-with-pip:
+
+Install with pip
+################
+
+This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is provided below, see :ref:`dependencies`. Once this step is complete, install Opfi with pip by running:
+
+.. code-block:: bash
+
+    pip install opfi
+
+Install from source
+###################
+
+Finally, the latest development build may be installed directly from Github. First, non-Python :ref:`dependencies` will need to be installed in the working environment. An easy way to do this is to first install Opfi with conda using the :ref:`install-with-conda` method (we'll re-install the development version of the Opfi package in the next step). Alternatively, dependencies can be installed individually.
+
+Once dependencies have been installed in the working environment, run the following code to download and install the development build:
+
+.. code-block:: bash
+
+    git clone https://github.com/wilkelab/Opfi.git
+    cd Opfi
+    pip install . # or pip install -e . for an editable version
+    pip install -r requirements # if conda was used, this can be skipped
+
+Testing the build
+#################
+
+Regardless of installation method, users can download and run Opfi's suite of unit tests to confirm that the build is working as expected. First download the tests from Github:
+
+.. code-block:: bash
+
+    git clone https://github.com/wilkelab/Opfi
     cd Opfi
-    pip3 install .
+
+And then run the test suite using pytest:
+
+.. code-block:: bash
+
+    pytest --runslow --runmmseqs --rundiamond
+
+This may take a minute or so to complete. 
+
+.. _dependencies:    
 
 Dependencies
 ------------
 
-Opfi makes use of several third-party softwares for finding and annotating genomic features. Depending on your use case, you may not need to install all of these; however, at a minimum users should have the NCBI BLAST+ application installed in their environment. The following table provides more details about required/optional dependencies, including links to application homepages.
-
-.. csv-table:: 
-   :header: "Application", "Required", "Description", "Anaconda distribution"
+Opfi uses the following bioinformatics software packages to find and annotate genomic features:
 
-   "`NCBI BLAST+ <https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs>`_", "Yes", "Protein and nucleic acid homology search tool", https://anaconda.org/bioconda/blast
-   "`Diamond <https://github.com/bbuchfink/diamond>`_", "No", "Alternative to BLAST+ for fast protein homology searches", https://anaconda.org/bioconda/diamond
-   "`MMseqs2 <https://github.com/soedinglab/MMseqs2>`_", "No", "Alternative to BLAST+ for fast protein homology searches", https://anaconda.org/bioconda/mmseqs2
-   "`PILER-CR <https://www.drive5.com/pilercr/>`_", "No", "CRISPR repeat detection", https://anaconda.org/bioconda/piler-cr
-   "`GenericRepeatFinder <https://github.com/bioinfolabmu/GenericRepeatFinder>`_", "No", "Transposon-associated repeat detection", "NA"
+.. csv-table:: Software dependencies
+   :header: "Application", "Description"
 
-Testing your build
-------------------
+   "`NCBI BLAST+ <https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs>`_", "Protein and nucleic acid homology search tool"
+   "`Diamond <https://github.com/bbuchfink/diamond>`_", "Alternative to BLAST+ for fast protein homology searches"
+   "`MMseqs2 <https://github.com/soedinglab/MMseqs2>`_", "Alternative to BLAST+ for fast protein homology searches"
+   "`PILER-CR <https://www.drive5.com/pilercr/>`_", "CRISPR repeat detection"
+   "`Generic Repeat Finder <https://github.com/bioinfolabmu/GenericRepeatFinder>`_", "Transposon-associated repeat detection"
 
-Users who opt to build Opfi from source can test their build by running ``pytest`` from the project root directory. The following flags will direct pytest to run specific sets of tests (in addition to the core suite):
+The first three (BLAST+, Diamond, and MMseqs2) are popular homology search applications, that is, programs that look for local similarities between input sequences (either protein or nucleic acid) and a target. These are used by Opfi in :class:`gene_finder.pipeline.Pipeline` for annotation of genes or non-coding regions of interest in the input genome/contig. The user specifies which homology search tool to use during pipeline setup (see :class:`gene_finder.pipeline.Pipeline` for details). Note that the BLAST+ distribution contains multiple programs for homology searching, three of which (blastp, blastn, and PSI-BLAST) are currently supported by Opfi. 
 
-* ``--runmmseqs``: Run tests that require MMseqs2.
-* ``--rundiamond``: Run tests that require Diamond.
-* ``--runslow``: Run integration/end-to-end tests.
-* ``--runprop``: Run very slow property tests.
+The following table summarizes the main difference between each homology search program. It may help users decide which application will best meet their needs. Note that performance tests are inherently hardware and context dependent, so this should be taken as a loose guide, rather than a definitive comparison. 
 
-For most users, running ``pytest --runslow`` is recommended. 
+.. csv-table:: Comparison of homology search programs supported by Opfi
+    :header: "Application", "Relative sensitivity", "Relative speed", "Requires a protein or nucleic acid sequence database?"
 
-.. note::
+    "Diamond", `+`, `++++`, "protein"
+    "MMseqs2", `++`, `+++`, "protein"
+    "blastp", `+++`, `++`, "protein"
+    "PSI-BLAST", `++++`, `+`, "protein"
+    "blastn", "NA", "NA", "nucleic acid"
 
-    Several tests in the core suite require :program:`BLAST`, :program:`PILER-CR`, and/or :program:`GenericRepeatFinder`. Running ``pytest`` without first installing these dependencies will cause these tests to fail. 
+The last two software dependencies, PILER-CR and Generic Repeat Finder (GRF), deal with annotation of repetive sequences in DNA. PILER-CR identifies CRISPR arrays, regions of alternatating ~30 bp direct repeat and variable sequences that play a role in prokaryotic immunity. GRF identifies repeats associated with transposable elements, such as terminal inverted repeats (TIRs) and long terminal repeats (LTRs).
diff --git a/docs/tips.rst b/docs/tips.rst
@@ -6,7 +6,7 @@ Inputs and Outputs
 Building sequence databases
 ---------------------------
 
-To search for gene cassettes with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target casettes (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as `Uniprot <https://www.uniprot.org/>`_ (maintained by the `European Bioinformatics Institute <https://www.ebi.ac.uk/>`_ ) or one of the `databases <https://www.ncbi.nlm.nih.gov/>`_ provided by the National Center for Biotechnology Information. 
+To search for gene clusters with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target clusters (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as `Uniprot <https://www.uniprot.org/>`_ (maintained by the `European Bioinformatics Institute <https://www.ebi.ac.uk/>`_ ) or one of the `databases <https://www.ncbi.nlm.nih.gov/>`_ provided by the National Center for Biotechnology Information. 
 
 Once target sequences have been compiled, they must be converted to an application-specific database format. Opfi currently supports :program:`BLAST+`, :program:`mmseqs2`, and :program:`diamond` for homology searching:
 
@@ -47,7 +47,7 @@ The sequence definition (defline) comes directly after the ``>`` character, and
 Annotating sequence databases
 #############################
 
-To take full advantage of the rule-based filtering methods in :mod:`operon_analyzer.rules`, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene cassettes.
+To take full advantage of the rule-based filtering methods in :mod:`operon_analyzer.rules`, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene clusters.
 
 Gene labels are parsed from sequence deflines; specifically, Opfi looks for the second word/token following the ``>`` character. For example, the following FASTA sequence has been annotated with the label "cas1":
 
@@ -147,7 +147,7 @@ Results from :class:`gene_finder.pipeline.Pipeline` searches are written to a si
     :file: csv/example_output.csv
     :header-rows: 0
 
-The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cassette, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cassette. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV. 
+The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cluster, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cluster. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV. 
 
 Descriptions of each output field are provided below. Alignment statistic naming conventions are from the BLAST documentation, see `BLAST+ appendices <https://www.ncbi.nlm.nih.gov/books/NBK279684/>`_ (specifically "outfmt" in table C1). This `glossary <https://www.ncbi.nlm.nih.gov/books/NBK62051/>`_ of common BLAST terms may also be useful in interpreting alignment statistic meaning. 
 

diff --git a/tests/integration/test_gene_finder.py b/tests/integration/test_gene_finder.py
@@ -63,13 +63,8 @@ def test_with_blast(temporary_directory):
     genomic_data = "tests/integration/integration_data/contigs/v_crass_J520_whole.fasta"
     conf_file = "tests/integration/configs/blast_integration_test.yaml"
     p = create_pipeline(conf_file)
-
     results = p.run(data=genomic_data, output_directory=temporary_directory.name)
-
     assert len(results) == 1
-    hits = results["NZ_CCKB01000071.1"]["Loc_78093-114093"]["Hits"]
-    assert len(hits) == 11
-    assert "Array_0" in hits
 
 
 @pytest.mark.slow
@@ -85,9 +80,7 @@ def test_multi_seq_fasta(temporary_directory):
     tmp, genomic_data = merge_data()
     conf_file = "tests/integration/configs/blast_integration_test.yaml"
     p = create_pipeline(conf_file)
-
     results = p.run(data=genomic_data, output_directory=temporary_directory.name)
-
     assert len(results) == 2
     tmp.cleanup()
 
@@ -103,10 +96,8 @@ def test_gzip_fasta(temporary_directory):
     conf_file = "tests/integration/configs/blast_integration_test.yaml"
     p = create_pipeline(conf_file)
     results = p.run(data=data, gzip=True, output_directory=temporary_directory.name)
-
     hits = results["KB405063.1"]["Loc_0-23815"]["Hits"]
     assert "Cas_all_hit-0" in hits
-    assert "Array_0" in hits
 
 
 @pytest.mark.slow
@@ -124,15 +115,11 @@ def test_record_all_hits_1(temporary_directory):
     genomic_data = "tests/integration/integration_data/contigs/record_all_hits_test_1"
     conf_file = "tests/integration/configs/blast_integration_test.yaml"
     p = create_pipeline(conf_file)
-
     p.run(data=genomic_data, record_all_hits=True, output_directory=temporary_directory.name)
-
     with open(os.path.join(temporary_directory.name, "gene_finder_hits.json"), "r") as f:
         hits = json.load(f)["KB405063.1"]["hits"]
-        assert len(hits) == 3
         assert "tnsAB" in hits
         assert "cas_all" in hits
-        assert "CRISPR" in hits
 
 
 @pytest.mark.slow

diff --git a/tests/integration/test_operon_analyze.py b/tests/integration/test_operon_analyze.py
@@ -193,7 +193,7 @@ def visualize(condition: str):
                 if op is None:
                     continue
                 good_operons.append(op)
-        plot_operons(good_operons, tempdir)
+        plot_operons(good_operons, tempdir, nucl_per_line=25000)
         files = os.listdir(tempdir)
         count = len([f for f in files if f.endswith(".png")])
     except Exception as e: