diff --git a/.github/workflows/feature.yml b/.github/workflows/feature.yml
index c838394..22d551d 100644
--- a/.github/workflows/feature.yml
+++ b/.github/workflows/feature.yml
@@ -27,7 +27,7 @@ jobs:
sudo cp -r GenericRepeatFinder/bin /usr/local/bin/
- name: Install dependencies
run: |
- cd lib/pilercr1.06 && sudo make install && cd ../..
+ sudo apt-get install -y pilercr
sudo apt install -y ncbi-blast+
sudo apt install -y mmseqs2
sudo apt install -y diamond-aligner
diff --git a/.github/workflows/master.yml b/.github/workflows/master.yml
index 720ad2d..4857709 100644
--- a/.github/workflows/master.yml
+++ b/.github/workflows/master.yml
@@ -28,7 +28,7 @@ jobs:
git clone https://github.com/bioinfolabmu/GenericRepeatFinder.git && sudo cp GenericRepeatFinder/bin/* /usr/local/bin/ && sudo chmod 755 /usr/local/bin/grf* /usr/local/bin/ltr_finder
- name: Install dependencies
run: |
- cd lib/pilercr1.06 && sudo make install && cd ../..
+ sudo apt-get install -y pilercr
sudo apt install -y ncbi-blast+
sudo apt install -y mmseqs2
sudo apt install -y diamond-aligner
diff --git a/README.md b/README.md
index a9b6b74..053da6f 100644
--- a/README.md
+++ b/README.md
@@ -2,34 +2,41 @@
[![Documentation Status](https://readthedocs.org/projects/opfi/badge/?version=latest)](https://opfi.readthedocs.io/en/latest/?badge=latest)
[![PyPI](http://img.shields.io/pypi/v/opfi.svg)](https://pypi.python.org/pypi/opfi/)
+[![Anaconda-Server Badge](https://anaconda.org/bioconda/opfi/badges/installer/conda.svg)](https://conda.anaconda.org/bioconda)
A python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics datasets.
-## Requirements
+## Installation
+
+The recommended way to install Opfi is with [Bioconda](https://bioconda.github.io/), which requires the [conda](https://docs.conda.io/en/latest/) package manager. This will install Opfi and all of its dependencies (which you can read more about [here](https://opfi.readthedocs.io/en/latest/installation.html)).
+
+Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.
+
+### Install with conda (Linux and Mac OS only)
-At a minimum, the NCBI BLAST+ software suite should be installed and on the user's PATH. BLAST+ installation instruction can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK279671/). For annotation of CRISPR arrays, Opfi uses PILER-CR, which can be downloaded from the software [home page](https://www.drive5.com/pilercr/). A modified version of PILER-CR that detects mini (two repeat) CRISPR arrays is also available, and can be built with GNU make after cloning or downloading Opfi:
+First, set up conda and Bioconda following the [quickstart](https://bioconda.github.io/user/install.html) guide. Once this is done, run:
```
-cd lib/pilercr1.06
-sudo make install
+conda install -c bioconda opfi
```
-## Installation
-
-You can install Opfi with Pip:
+And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:
```
-pip3 install opfi
+conda create --name opfi-env -c bioconda opfi
+conda activate opfi-env
```
-Alternatively, you can install the latest version on Github:
+### Install with pip
+
+This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is available [here](https://opfi.readthedocs.io/en/latest/installation.html#dependencies). Once this step is complete, install Opfi with pip by running:
```
-git clone https://github.com/wilkelab/Opfi.git
-cd Opfi
-pip3 install .
+pip install opfi
```
+For information about installing for development, check out the [documentation site](https://opfi.readthedocs.io/en/latest/installation.html).
+
## Gene Finder
Gene Finder iteratively executes homology searches to identify gene clusters of interest. Below is an example script that sets up a search for putative CRISPR-Cas systems in the Rippkaea orientalis PCC 8802 (cyanobacteria) genome. Data inputs are provided in the Opfi tutorial (`tutorials/tutorial.ipynb`).
diff --git a/docs/examples.rst b/docs/examples.rst
index f38a42c..f1064b3 100644
--- a/docs/examples.rst
+++ b/docs/examples.rst
@@ -10,7 +10,7 @@ In this example, we will annotate and visualize CRISPR-Cas systems in the cyanob
You can download the complete assembled genome `here `_; it is also available at ``_ under ``tutorials``, along with the other data files necessary to run these examples, and an interactive jupyter notebook version of this tutorial.
-To run the code snippets here, Opfi must be installed, along with NCBI BLAST+ **and** PILER-CR. More detailed installation instructions can be found in the :ref:`installation` section.
+This tutorial assumes the user has already installed Opfi and all dependencies (if installing with conda, this is done automatically). Some familiarity with BLAST and the basic homology search algorithm may also be helpful, but is not required.
1. Use the makeblastdb utility to convert a Cas protein database to BLAST format
################################################################################
diff --git a/docs/index.rst b/docs/index.rst
index 3a310c5..4c332f2 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -6,9 +6,9 @@
Opfi
====
-Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cassette identification pipelines, particularly for large genomics or metagenomics datasets.
+Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cluster identification pipelines, particularly for large genomics or metagenomics datasets.
-Opfi is implemented entirely in Python, and can be downloaded from the Python package index. It consists of two major modules: Gene Finder, for discovery of novel gene cassettes, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.
+Opfi is implemented entirely in Python, and can be downloaded with conda or the from the Python Package Index. It consists of two major modules: Gene Finder, for discovery of novel gene clusters, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.
Contents
--------
@@ -20,4 +20,4 @@ Contents
examples
tips
modules
- contributing
\ No newline at end of file
+ contributing
diff --git a/docs/installation.rst b/docs/installation.rst
index 101b46f..4156428 100644
--- a/docs/installation.rst
+++ b/docs/installation.rst
@@ -6,46 +6,98 @@ Getting Started
Installation
------------
-You can install Opfi with Pip:
+The recommended way to install Opfi is with `Bioconda `_, which requires the `conda `_ package manager. This will install Opfi and all of its dependencies (which you can read more about below, see :ref:`dependencies`).
+
+Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.
+
+.. _install-with-conda:
+
+Install with conda (Linux and Mac OS only)
+##########################################
+
+First, set up conda and Bioconda following the `quickstart `_ guide. Once this is done, run:
.. code-block:: bash
- pip3 install opfi
+ conda install -c bioconda opfi
-Alternatively, you can install the latest version on Github:
+And that's it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:
.. code-block:: bash
- git clone https://github.com/alexismhill3/Opfi.git
+ conda create --name opfi-env -c bioconda opfi
+ conda activate opfi-env
+
+.. _install-with-pip:
+
+Install with pip
+################
+
+This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is provided below, see :ref:`dependencies`. Once this step is complete, install Opfi with pip by running:
+
+.. code-block:: bash
+
+ pip install opfi
+
+Install from source
+###################
+
+Finally, the latest development build may be installed directly from Github. First, non-Python :ref:`dependencies` will need to be installed in the working environment. An easy way to do this is to first install Opfi with conda using the :ref:`install-with-conda` method (we'll re-install the development version of the Opfi package in the next step). Alternatively, dependencies can be installed individually.
+
+Once dependencies have been installed in the working environment, run the following code to download and install the development build:
+
+.. code-block:: bash
+
+ git clone https://github.com/wilkelab/Opfi.git
+ cd Opfi
+ pip install . # or pip install -e . for an editable version
+ pip install -r requirements # if conda was used, this can be skipped
+
+Testing the build
+#################
+
+Regardless of installation method, users can download and run Opfi's suite of unit tests to confirm that the build is working as expected. First download the tests from Github:
+
+.. code-block:: bash
+
+ git clone https://github.com/wilkelab/Opfi
cd Opfi
- pip3 install .
+
+And then run the test suite using pytest:
+
+.. code-block:: bash
+
+ pytest --runslow --runmmseqs --rundiamond
+
+This may take a minute or so to complete.
+
+.. _dependencies:
Dependencies
------------
-Opfi makes use of several third-party softwares for finding and annotating genomic features. Depending on your use case, you may not need to install all of these; however, at a minimum users should have the NCBI BLAST+ application installed in their environment. The following table provides more details about required/optional dependencies, including links to application homepages.
-
-.. csv-table::
- :header: "Application", "Required", "Description", "Anaconda distribution"
+Opfi uses the following bioinformatics software packages to find and annotate genomic features:
- "`NCBI BLAST+ `_", "Yes", "Protein and nucleic acid homology search tool", https://anaconda.org/bioconda/blast
- "`Diamond `_", "No", "Alternative to BLAST+ for fast protein homology searches", https://anaconda.org/bioconda/diamond
- "`MMseqs2 `_", "No", "Alternative to BLAST+ for fast protein homology searches", https://anaconda.org/bioconda/mmseqs2
- "`PILER-CR `_", "No", "CRISPR repeat detection", https://anaconda.org/bioconda/piler-cr
- "`GenericRepeatFinder `_", "No", "Transposon-associated repeat detection", "NA"
+.. csv-table:: Software dependencies
+ :header: "Application", "Description"
-Testing your build
-------------------
+ "`NCBI BLAST+ `_", "Protein and nucleic acid homology search tool"
+ "`Diamond `_", "Alternative to BLAST+ for fast protein homology searches"
+ "`MMseqs2 `_", "Alternative to BLAST+ for fast protein homology searches"
+ "`PILER-CR `_", "CRISPR repeat detection"
+ "`Generic Repeat Finder `_", "Transposon-associated repeat detection"
-Users who opt to build Opfi from source can test their build by running ``pytest`` from the project root directory. The following flags will direct pytest to run specific sets of tests (in addition to the core suite):
+The first three (BLAST+, Diamond, and MMseqs2) are popular homology search applications, that is, programs that look for local similarities between input sequences (either protein or nucleic acid) and a target. These are used by Opfi in :class:`gene_finder.pipeline.Pipeline` for annotation of genes or non-coding regions of interest in the input genome/contig. The user specifies which homology search tool to use during pipeline setup (see :class:`gene_finder.pipeline.Pipeline` for details). Note that the BLAST+ distribution contains multiple programs for homology searching, three of which (blastp, blastn, and PSI-BLAST) are currently supported by Opfi.
-* ``--runmmseqs``: Run tests that require MMseqs2.
-* ``--rundiamond``: Run tests that require Diamond.
-* ``--runslow``: Run integration/end-to-end tests.
-* ``--runprop``: Run very slow property tests.
+The following table summarizes the main difference between each homology search program. It may help users decide which application will best meet their needs. Note that performance tests are inherently hardware and context dependent, so this should be taken as a loose guide, rather than a definitive comparison.
-For most users, running ``pytest --runslow`` is recommended.
+.. csv-table:: Comparison of homology search programs supported by Opfi
+ :header: "Application", "Relative sensitivity", "Relative speed", "Requires a protein or nucleic acid sequence database?"
-.. note::
+ "Diamond", `+`, `++++`, "protein"
+ "MMseqs2", `++`, `+++`, "protein"
+ "blastp", `+++`, `++`, "protein"
+ "PSI-BLAST", `++++`, `+`, "protein"
+ "blastn", "NA", "NA", "nucleic acid"
- Several tests in the core suite require :program:`BLAST`, :program:`PILER-CR`, and/or :program:`GenericRepeatFinder`. Running ``pytest`` without first installing these dependencies will cause these tests to fail.
+The last two software dependencies, PILER-CR and Generic Repeat Finder (GRF), deal with annotation of repetive sequences in DNA. PILER-CR identifies CRISPR arrays, regions of alternatating ~30 bp direct repeat and variable sequences that play a role in prokaryotic immunity. GRF identifies repeats associated with transposable elements, such as terminal inverted repeats (TIRs) and long terminal repeats (LTRs).
diff --git a/docs/tips.rst b/docs/tips.rst
index 9c464ee..cbcff5e 100644
--- a/docs/tips.rst
+++ b/docs/tips.rst
@@ -6,7 +6,7 @@ Inputs and Outputs
Building sequence databases
---------------------------
-To search for gene cassettes with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target casettes (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as `Uniprot `_ (maintained by the `European Bioinformatics Institute `_ ) or one of the `databases `_ provided by the National Center for Biotechnology Information.
+To search for gene clusters with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target clusters (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as `Uniprot `_ (maintained by the `European Bioinformatics Institute `_ ) or one of the `databases `_ provided by the National Center for Biotechnology Information.
Once target sequences have been compiled, they must be converted to an application-specific database format. Opfi currently supports :program:`BLAST+`, :program:`mmseqs2`, and :program:`diamond` for homology searching:
@@ -47,7 +47,7 @@ The sequence definition (defline) comes directly after the ``>`` character, and
Annotating sequence databases
#############################
-To take full advantage of the rule-based filtering methods in :mod:`operon_analyzer.rules`, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene cassettes.
+To take full advantage of the rule-based filtering methods in :mod:`operon_analyzer.rules`, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene clusters.
Gene labels are parsed from sequence deflines; specifically, Opfi looks for the second word/token following the ``>`` character. For example, the following FASTA sequence has been annotated with the label "cas1":
@@ -147,7 +147,7 @@ Results from :class:`gene_finder.pipeline.Pipeline` searches are written to a si
:file: csv/example_output.csv
:header-rows: 0
-The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cassette, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cassette. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.
+The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cluster, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cluster. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.
Descriptions of each output field are provided below. Alignment statistic naming conventions are from the BLAST documentation, see `BLAST+ appendices `_ (specifically "outfmt" in table C1). This `glossary `_ of common BLAST terms may also be useful in interpreting alignment statistic meaning.
diff --git a/tests/integration/test_gene_finder.py b/tests/integration/test_gene_finder.py
index b8c3a42..c506505 100644
--- a/tests/integration/test_gene_finder.py
+++ b/tests/integration/test_gene_finder.py
@@ -63,13 +63,8 @@ def test_with_blast(temporary_directory):
genomic_data = "tests/integration/integration_data/contigs/v_crass_J520_whole.fasta"
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)
-
results = p.run(data=genomic_data, output_directory=temporary_directory.name)
-
assert len(results) == 1
- hits = results["NZ_CCKB01000071.1"]["Loc_78093-114093"]["Hits"]
- assert len(hits) == 11
- assert "Array_0" in hits
@pytest.mark.slow
@@ -85,9 +80,7 @@ def test_multi_seq_fasta(temporary_directory):
tmp, genomic_data = merge_data()
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)
-
results = p.run(data=genomic_data, output_directory=temporary_directory.name)
-
assert len(results) == 2
tmp.cleanup()
@@ -103,10 +96,8 @@ def test_gzip_fasta(temporary_directory):
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)
results = p.run(data=data, gzip=True, output_directory=temporary_directory.name)
-
hits = results["KB405063.1"]["Loc_0-23815"]["Hits"]
assert "Cas_all_hit-0" in hits
- assert "Array_0" in hits
@pytest.mark.slow
@@ -124,15 +115,11 @@ def test_record_all_hits_1(temporary_directory):
genomic_data = "tests/integration/integration_data/contigs/record_all_hits_test_1"
conf_file = "tests/integration/configs/blast_integration_test.yaml"
p = create_pipeline(conf_file)
-
p.run(data=genomic_data, record_all_hits=True, output_directory=temporary_directory.name)
-
with open(os.path.join(temporary_directory.name, "gene_finder_hits.json"), "r") as f:
hits = json.load(f)["KB405063.1"]["hits"]
- assert len(hits) == 3
assert "tnsAB" in hits
assert "cas_all" in hits
- assert "CRISPR" in hits
@pytest.mark.slow
diff --git a/tests/integration/test_operon_analyze.py b/tests/integration/test_operon_analyze.py
index 9991187..c6f3fd4 100644
--- a/tests/integration/test_operon_analyze.py
+++ b/tests/integration/test_operon_analyze.py
@@ -193,7 +193,7 @@ def visualize(condition: str):
if op is None:
continue
good_operons.append(op)
- plot_operons(good_operons, tempdir)
+ plot_operons(good_operons, tempdir, nucl_per_line=25000)
files = os.listdir(tempdir)
count = len([f for f in files if f.endswith(".png")])
except Exception as e:
diff --git a/tutorials/tutorial.ipynb b/tutorials/tutorial.ipynb
index c0aa9f6..62feb01 100644
--- a/tutorials/tutorial.ipynb
+++ b/tutorials/tutorial.ipynb
@@ -2,120 +2,105 @@
"cells": [
{
"cell_type": "markdown",
- "id": "fb1ee176",
- "metadata": {},
"source": [
"## Example 1: Finding CRISPR-Cas systems in a cyanobacteria genome"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "e9e6310f",
- "metadata": {},
"source": [
"In this example, we will annotate and visualize CRISPR-Cas systems in the cyanobacteria species Rippkaea orientalis. CRISPR-Cas is a widespread bacterial defense system, found in at least 50% of all known prokaryotic species. This system is significant in that it can be leveraged as a precision gene editing tool, an advancement that was awarded the 2020 Nobel Prize in Chemistry. The genome of R. orientalis harbors two complete CRISPR-Cas loci (one chromosomal, and one extrachromosomal/plasmid).\n",
"\n",
- "You can download the complete assembled genome [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000024045.1/); it is also available at https://github.com/wilkelab/Opfi in the `tutorials` folder (`tutorials/GCF_000024045.1_ASM2404v1_genomic.fna.gz`). This tutorial assumes the user has already installed Opfi, NCBI BLAST+, and PILER-CR. Some familiarity with BLAST and the basic homology search algorithm may also be helpful, but is not required. \n",
+ "You can download the complete assembled genome [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_000024045.1/); it is also available at https://github.com/wilkelab/Opfi in the `tutorials` folder (`tutorials/GCF_000024045.1_ASM2404v1_genomic.fna.gz`). This tutorial assumes the user has already installed Opfi and all dependencies (if installing with conda, this is done automatically). Some familiarity with BLAST and the basic homology search algorithm may also be helpful, but is not required. \n",
"\n",
"**Note:** \n",
"This tutorial uses several input data files, all of which are provided in the `tutorials` directory. If running this notebook in another working directory, be sure copy over all three data files as well."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "649438f8",
- "metadata": {},
"source": [
"**1. Use the makeblastdb utility to convert a Cas protein database to BLAST format** \n",
"We start by converting a Cas sequence database to a format that BLAST can recognize, using the command line utility `makeblastdb`, which is part of the core NCBI BLAST+ distribution. A set of ~20,000 non-redundant Cas sequences, downloaded from [Uniprot](https://www.uniprot.org/uniref/) is available as a tar archive (`tutorials/cas_database.tar.gz`). We'll make a new directory, \"blastdb\", and extract sequences there:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "d5448cf5",
- "metadata": {},
- "outputs": [],
"source": [
"! mkdir blastdb\n",
"! cd blastdb && tar -xzf ../cas_database.tar.gz && cd .."
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "80ec3f7d",
- "metadata": {},
"source": [
"Next, create two BLAST databases for the sequence data: one containing Cas1 sequences only, and another that contains the remaining Cas sequences."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "ae7e1f32",
- "metadata": {},
- "outputs": [],
"source": [
"! cd blastdb && cat cas1.fasta | makeblastdb -dbtype prot -title cas1 -hash_index -out cas1_db && cd ..\n",
"! cd blastdb && cat cas[2-9].fasta cas1[0-2].fasta casphi.fasta | makeblastdb -dbtype prot -title cas_all -hash_index -out cas_all_but_1_db"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "91a4e8aa",
- "metadata": {},
"source": [
"`-dbtype prot` simply tells `makeblastdb` to expect amino acid sequences. We use `-title` and `-out` to name the database (required by BLAST) and to prefix the database files, respectively. `-hash_index` directs `makeblastdb` to generate a hash index of protein sequences, which can speed up computation time."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "dba95a6d",
- "metadata": {},
"source": [
"**2. Use Opfi's Gene Finder module to search for CRISPR-Cas loci**"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "fe803501",
- "metadata": {},
"source": [
"CRISPR-Cas systems are extremely diverse. The most recent [classification effort](https://www.nature.com/articles/s41579-019-0299-x) identifies 6 major types, and over 40 subtypes, of compositionally destinct systems. Although there is sufficent sequence similarity between subtypes to infer the existence of a common ancestor, the only protein family present in the majority of CRISPR-cas subtypes is the conserved endonuclease Cas1. For our search, we will define candidate CRISPR-cas loci as having, minimally, a cas1 gene."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "ae536b0c",
- "metadata": {},
"source": [
"First, create another directory for output:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "27525059",
- "metadata": {},
- "outputs": [],
"source": [
"! mkdir example_1_output"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "3096517c",
- "metadata": {},
"source": [
"The following bit of code uses Opfi's Gene Finder module to search for CRISPR-Cas systems:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "a2c52409",
- "metadata": {},
- "outputs": [],
"source": [
"from gene_finder.pipeline import Pipeline\n",
"import os\n",
@@ -132,12 +117,12 @@
"# results will be written to the file _results.csv\n",
"job_id = os.path.basename(genomic_data)\n",
"results = p.run(job_id=job_id, data=genomic_data, output_directory=output_directory, min_prot_len=90, span=10000, gzip=True)"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "4b6a4d5e",
- "metadata": {},
"source": [
"First, we initialize a `Pipeline` object, which keeps track of all search parameters, as well as a running list of systems that meet search criteria. Next, we add three search steps to the pipeline:\n",
"\n",
@@ -146,30 +131,26 @@
"3. `add_crispr_step`: Remaining candidates are annotated for CRISPR repeat sequences using PILER-CR. \n",
"\n",
"Finally, we run the pipeline, executing steps in the order they we added. `min_prot_len` sets the minimum length (in amino acid residues) of hits to keep (really short hits are unlikely real protein encoding genes). `span` is the region directly up- and downstream of initial hits. So, each candidate system will be about 20 kbp in length. Results are written to a single CSV file. Final candidate loci contain at least one putative Cas1 gene and one additional Cas gene. As we will see, this relatively permissive criteria captures some non-CRISPR-Cas loci. Opfi has additional modules for reducing unlikely systems after the gene finding stage."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "a096d534",
- "metadata": {},
"source": [
"**3. Visualize annotated CRISPR-Cas gene clusters using Opfi's Operon Analyzer module**"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "2252e259",
- "metadata": {},
"source": [
"It is sometimes useful to visualize candidate systems, especially during the exploratory phase of a genomics survey. Opfi provides a few functions for visualizing candidate systems as gene diagrams. We'll use these to visualize the CRISPR-Cas gene clusters in R. orientalis:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "105e0379",
- "metadata": {},
- "outputs": [],
"source": [
"import csv\n",
"import sys\n",
@@ -196,98 +177,86 @@
"with open(\"example_1_output/GCF_000024045.1_ASM2404v1_genomic.fna.gz_results.csv\", \"r\") as operon_data:\n",
" operons = load.load_operons(operon_data)\n",
" visualize.plot_operons(operons=operons, output_directory=\"example_1_output\", feature_colors=feature_colors, nucl_per_line=25000)"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "95126fd6",
- "metadata": {},
"source": [
"Looking at the gene diagrams, it is clear that we identified both CRISPR-Cas systems in this genome. We also see some systems that don't resemble functional CRISPR-Cas operons. Because we used a relatively permissive e-value threshhold of 0.001 when running BLAST, Opfi retained regions with very low sequence similarity to true CRISPR-Cas genes. In fact, these regions are likely not CRISPR-Cas loci at all. Using a lower e-value would likely eliminate these \"false positive\" systems, but Opfi also has additional functions for filtering out unlikely candidates _after_ the intial BLAST search. \n",
"\n",
"In general, we have found that using permissive BLAST parameters intially, and then filtering or eliminating candidates during the downstream analysis, is an effective way to search for gene clusters in large amounts of genomic/metagenomic data. In this toy example, we could re-run BLAST many times without significant cost. But on a more realistic dataset, needing to re-do the computationally expensive homology search could detrail a project. Since the optimal search parameters may not be known _a priori_, it can be better to do a permissive homology search initially, and then narrow down results later."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "570ed630",
- "metadata": {},
"source": [
"Finally, clean up the temporary directories, if desired:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "e2b8702f",
- "metadata": {},
- "outputs": [],
"source": [
"! rm -r example_1_output blastdb"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "2032ac20",
- "metadata": {},
"source": [
"## Example 2: Filter and classify CRISPR-Cas systems based on genomic composition"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "7754185e",
- "metadata": {},
"source": [
"As mentioned in the previous example, known CRISPR-Cas systems fall into 6 broad categories, based on the presence of particular \"signature\" genes, as well as overall composition and genomic architecture. In this example, we will use Opfi to search for and classify CRISPR-Cas systems in ~300 strains of fusobacteria. \n",
"\n",
"This dataset was chosen because it is more representative (in magnitude) of what would be encountered in a real genomics study. Additionally, the fusobacteria phylum contains a variety of CRISPR-Cas subtypes. Given that the homology search portion of the analysis takes several hours (using a single core) to complete, we have pre-run Gene Finder using the same setup as the previous example. "
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "7c050221",
- "metadata": {},
"source": [
"**1. Make another temporary directory for output:**"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "69560363",
- "metadata": {},
- "outputs": [],
"source": [
"! mkdir example_2_output"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "0db9b606",
- "metadata": {},
"source": [
"**2. Filter Gene Finder output and extract high-confidence CRISPR-Cas systems**"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "5c564558",
- "metadata": {},
"source": [
"The following code reads in unfiltered Gene Finder output and applies a set of conditions (\"rules\") to accomplish two things:\n",
"1. Select (and bin) systems according to type, and,\n",
"2. Eliminate candidates that likely do not represent true CRISPR-Cas systems"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "d506cea8",
- "metadata": {
- "tags": []
- },
- "outputs": [],
"source": [
"from operon_analyzer import analyze, rules\n",
"\n",
@@ -312,38 +281,36 @@
" with open(\"refseq_fusobacteria.csv\", \"r\") as input_csv:\n",
" with open(f\"example_2_output/refseq_fuso_filtered_type{cas_type}.csv\", \"w\") as output_csv:\n",
" analyze.evaluate_rules_and_reserialize(input_csv, rs, fs, output_csv)"
- ]
+ ],
+ "outputs": [],
+ "metadata": {
+ "tags": []
+ }
},
{
"cell_type": "markdown",
- "id": "3447c2fa",
- "metadata": {},
"source": [
"The rule sets are informed by an established CRISPR-Cas classification system, which you can read more about [here](https://www.nature.com/articles/s41579-019-0299-x). The most recent system recognizes 6 major CRISPR-Cas types, but since fusobacteria doesn't contain type IV or VI systems that can be identified with our protein dataset, we didn't define the corresponding rule sets."
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "3874c11e",
- "metadata": {},
"source": [
"**3. Verify results with additional visualizations**"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "ffa095f2",
- "metadata": {},
"source": [
"Altogther, this analysis will identify several hundred systems. We won't look at each system individually (but you are free to do so!). For the sake of confirming that the code ran as expected, we'll create gene diagrams for just the type V systems, since there are only two:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "27618cad",
- "metadata": {},
- "outputs": [],
"source": [
"import csv\n",
"import sys\n",
@@ -370,25 +337,25 @@
"with open(\"example_2_output/refseq_fuso_filtered_typeV.csv\", \"r\") as operon_data:\n",
" operons = load.load_operons(operon_data)\n",
" visualize.plot_operons(operons=operons, output_directory=\"example_2_output\", feature_colors=feature_colors, nucl_per_line=25000)"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
},
{
"cell_type": "markdown",
- "id": "9a4f175b",
- "metadata": {},
"source": [
"Finally, clean up the temporary output directory, if desired:"
- ]
+ ],
+ "metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
- "id": "63a204f8",
- "metadata": {},
- "outputs": [],
"source": [
"! rm -r example_2_output"
- ]
+ ],
+ "outputs": [],
+ "metadata": {}
}
],
"metadata": {
@@ -412,4 +379,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
-}
+}
\ No newline at end of file