diff --git a/docs/source/about.rst b/docs/source/about.rst index 85511f4..7d84261 100644 --- a/docs/source/about.rst +++ b/docs/source/about.rst @@ -9,7 +9,7 @@ A variety of curated protein databases are available to use with `EUKulele`, whi Functionality ==================================== -``EUKulele`` :cite:`eukulele` is an open-source ``Python``-based package designed to simplify the process of taxonomic identification of marine eukaryotes in meta-omic samples. User-provided metatranscriptomic or metagenomic samples are aligned against a database of the user's choosing, with an aligner of the user's choice (``BLAST`` :cite:`kent2002blat` or ``DIAMOND`` :cite:`buchfink2015fast`). The "blastx" utility is used by default if metatranscriptomic samples are only provided in nucleotide format, while the "blastp" utility is used for metagenomic samples and metatranscriptomic samples available as translated protein sequences. Optionally, the user may indicate a preference to translate nucleotide input sequences using the ``TransDecoder`` software :cite:`haastransdecoder`, with the output provided to "blastp". Any consistently-formatted database may be used, but three published microbial eukaryotic database options are provided by default: MMETSP :cite:`keeling2014marine;@caron2017probing`, PhyloDB :cite:`phylodb`, and EukProt :cite:`richter2020eukprot`. The package returns comma-separated files containing all of the contig matches from the metatranscriptome or metagenome, as well as the total number of transcripts that matched, at each taxonomic level, from supergroup to species. If a quantification tool has been used to estimate the number of counts associated with each transcript ID, counts may also be returned. Additionally, the software returns barplots displaying the relative composition of each sample at each taxonomic level, according to the number of transcripts or number of estimated counts if provided from ``Salmon`` (an external transcript quantification tool :cite:`patro2017salmon`). +``EUKulele`` :cite:`eukulele` is an open-source ``Python``-based package designed to simplify the process of taxonomic identification of marine eukaryotes in meta-omic samples. User-provided metatranscriptomic or metagenomic samples are aligned against a database of the user's choosing, with an aligner of the user's choice (``BLAST`` :cite:`kent2002blat` or ``DIAMOND`` :cite:`buchfink2015fast`). The "blastx" utility is used by default if metatranscriptomic samples are only provided in nucleotide format, while the "blastp" utility is used for metagenomic samples and metatranscriptomic samples available as translated protein sequences. Optionally, the user may indicate a preference to translate nucleotide input sequences using the ``TransDecoder`` software :cite:`haastransdecoder`, with the output provided to "blastp". Any consistently-formatted database may be used, but three published microbial eukaryotic database options are provided by default: MMETSP :cite:`keeling2014marine,caron2017probing`, PhyloDB :cite:`phylodb`, and EukProt :cite:`richter2020eukprot`. The package returns comma-separated files containing all of the contig matches from the metatranscriptome or metagenome, as well as the total number of transcripts that matched, at each taxonomic level, from supergroup to species. If a quantification tool has been used to estimate the number of counts associated with each transcript ID, counts may also be returned. Additionally, the software returns barplots displaying the relative composition of each sample at each taxonomic level, according to the number of transcripts or number of estimated counts if provided from ``Salmon`` (an external transcript quantification tool :cite:`patro2017salmon`). ``EUKulele`` will assess the relative 'completeness' of a given taxonomic group by taking a user-inputted list of names at some taxonomic level to determine BUSCO completeness and redundancy :cite:`simao2015busco`. For example, if the user was interested whether there was a set of relatively complete contigs available for genus *Phaeocystis* within their metagenomic sample, they could pass *Phaeocystis*, along with its taxonomic level, "genus", to ``EUKulele``. By default, ``EUKulele`` will assess the BUSCO completeness of the most commonly encountered classifications at each taxonomic level. @@ -25,4 +25,4 @@ The alignment output is compared to an accompanying phylogenetic reference speci Subsequently, ``BUSCO`` :cite:`simao2015busco` is used to identify the core eukaryotic genes present in each sample. Using the list of genes identified as "core", a secondary taxonomic estimation step (and consensus assignment step, for MAGs) is performed to compare the taxonomic assignment predicted using all of the genes in comparison to the assignment made using only the genes that would be expected to be found in most reference transcriptomes. This approach is particularly useful for MAGs, and offers a method for avoiding conflicting or spurious matches made due to strain-level inconsistencies. For metatranscriptome samples, BUSCO completeness can be used to estimate the completeness of taxonomic groups to better inform their downstream interpretation. .. bibliography:: refs.bib - :cited: \ No newline at end of file + :cited: diff --git a/docs/source/databaseandconfig.rst b/docs/source/databaseandconfig.rst index 82e1870..79ab42c 100644 --- a/docs/source/databaseandconfig.rst +++ b/docs/source/databaseandconfig.rst @@ -6,7 +6,7 @@ Installing Databases and Creating Configuration Files Default Databases ----------------- -Three databases can be downloaded and formatted automatically when invoking ``EUKulele``. Currently the supported databases are: +Four databases can be downloaded and formatted automatically when invoking ``EUKulele``. Currently the supported databases are: - `PhyloDB `_ - `EukProt `_ @@ -21,7 +21,7 @@ A database (for example ``phylodb``) can be setup prior to running by using:: EUKulele setup --database phylodb -If a database is not found automatically by ``EUKuele`` it will automatically download the database specified by the flag. If you downloaded a database previously you can specify the ``--reference_dir`` flag indicating the path to the previously downloaded database. If no reference database is specified with ```--reference_dir```, EUKulele will automatically download and use the MMETSP database. You can also (1) download the other databases and use the flag ```reference_dir``` to point EUKulele to the location of already downloaded databases or (2) use your own databases. +If a database is not found automatically by ``EUKuele`` it will automatically download the database specified by the flag. If you downloaded a database previously you can specify the ``--reference_dir`` flag indicating the path to the previously downloaded database. If no reference database is specified with ``--reference_dir``, EUKulele will automatically download and use the MMETSP database. You can also (1) download the other databases and use the flag ``reference_dir`` to point EUKulele to the location of already downloaded databases or (2) use your own databases. Composition of Default Databases -------------------------------- diff --git a/docs/source/outputstructure.rst b/docs/source/outputstructure.rst index 67c12d9..369a87c 100644 --- a/docs/source/outputstructure.rst +++ b/docs/source/outputstructure.rst @@ -30,7 +30,7 @@ Below is what you should expect to see when you run ``EUKulele``. ``output-folde Taxonomy Estimation Folders --------------------------- -Inside each of the taxonomy estimation folders (``core_taxonomy_estimation``, for exclusively transcripts annotated as core genes, and ``taxonomy_estimation``), there are files labeled *sample_name* ``-estimated-taxonomy.out``. Each of these files has the following columns: +Inside each of the taxonomy estimation folders (``core_taxonomy_estimation``, for exclusively transcripts annotated as core genes, and ``taxonomy_estimation``), there are files labeled ``-estimated-taxonomy.out``. Each of these files has the following columns: - transcript_name - The name of the matched transcript/contig from this sample file @@ -63,7 +63,7 @@ Inside each of the taxonomy counts folders (``core_taxonomy_counts`` and ``taxon - Sample - The original metagenomic/metatranscriptomic sample that this count is from (a separate row would be provided if the match is found in multiple samples) -The taxonomic count files are named according to the convention *output-folder-name* ``_all_`` *taxonomic-level* ``_counts.csv``. +The taxonomic count files are named according to the convention ``_all__counts.csv``. Taxonomy Visualization Folders ------------------------------ @@ -76,4 +76,4 @@ Inside each of the taxonomy visualization folders (``core_taxonomy_visualization - y-axis, right subplot (if using counts): relative number of counts - bars, right subplot (if using counts): each of the top represented taxonomic groups (must represent >= 5% of total counts) -The right subplot is only generated if counts from a quantification tool (namely, ``Salmon``) are provided. \ No newline at end of file +The right subplot is only generated if counts from a quantification tool (namely, ``Salmon``) are provided. diff --git a/docs/source/running-eukulele.rst b/docs/source/running-eukulele.rst index 5365c9f..3595825 100644 --- a/docs/source/running-eukulele.rst +++ b/docs/source/running-eukulele.rst @@ -6,7 +6,7 @@ Using EUKulele Metatranscriptomes (METs) ========================= -In the first case, metatranscriptomes (shortened in ``EUKulele`` to ``mets``), are assumed to be contigs generated from shotgun-style sequencing and assembly of metatranscriptomic data (RNA) from a mixed community. These contigs can be provide to ``EUKulele`` as either nucleotide sequences (such as those output by `Trinity `_) or predicted protein sequences from these contigs (such as those output by `Transdecoder `_). +In the first case, metatranscriptomes (shortened in ``EUKulele`` to ``mets``), are assumed to be contigs generated from shotgun-style sequencing and assembly of metatranscriptomic data (RNA) from a mixed community. These contigs can be provided to ``EUKulele`` as either nucleotide sequences (such as those output by `Trinity `_) or predicted protein sequences from these contigs (such as those output by `Transdecoder `_). The most basic running of ``EUKulele`` on metatranscriptome samples would be::