Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed small typos #26

Merged
merged 1 commit into from
Nov 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ A variety of curated protein databases are available to use with `EUKulele`, whi
Functionality
====================================

``EUKulele`` :cite:`eukulele` is an open-source ``Python``-based package designed to simplify the process of taxonomic identification of marine eukaryotes in meta-omic samples. User-provided metatranscriptomic or metagenomic samples are aligned against a database of the user's choosing, with an aligner of the user's choice (``BLAST`` :cite:`kent2002blat` or ``DIAMOND`` :cite:`buchfink2015fast`). The "blastx" utility is used by default if metatranscriptomic samples are only provided in nucleotide format, while the "blastp" utility is used for metagenomic samples and metatranscriptomic samples available as translated protein sequences. Optionally, the user may indicate a preference to translate nucleotide input sequences using the ``TransDecoder`` software :cite:`haastransdecoder`, with the output provided to "blastp". Any consistently-formatted database may be used, but three published microbial eukaryotic database options are provided by default: MMETSP :cite:`keeling2014marine;@caron2017probing`, PhyloDB :cite:`phylodb`, and EukProt :cite:`richter2020eukprot`. The package returns comma-separated files containing all of the contig matches from the metatranscriptome or metagenome, as well as the total number of transcripts that matched, at each taxonomic level, from supergroup to species. If a quantification tool has been used to estimate the number of counts associated with each transcript ID, counts may also be returned. Additionally, the software returns barplots displaying the relative composition of each sample at each taxonomic level, according to the number of transcripts or number of estimated counts if provided from ``Salmon`` (an external transcript quantification tool :cite:`patro2017salmon`).
``EUKulele`` :cite:`eukulele` is an open-source ``Python``-based package designed to simplify the process of taxonomic identification of marine eukaryotes in meta-omic samples. User-provided metatranscriptomic or metagenomic samples are aligned against a database of the user's choosing, with an aligner of the user's choice (``BLAST`` :cite:`kent2002blat` or ``DIAMOND`` :cite:`buchfink2015fast`). The "blastx" utility is used by default if metatranscriptomic samples are only provided in nucleotide format, while the "blastp" utility is used for metagenomic samples and metatranscriptomic samples available as translated protein sequences. Optionally, the user may indicate a preference to translate nucleotide input sequences using the ``TransDecoder`` software :cite:`haastransdecoder`, with the output provided to "blastp". Any consistently-formatted database may be used, but three published microbial eukaryotic database options are provided by default: MMETSP :cite:`keeling2014marine,caron2017probing`, PhyloDB :cite:`phylodb`, and EukProt :cite:`richter2020eukprot`. The package returns comma-separated files containing all of the contig matches from the metatranscriptome or metagenome, as well as the total number of transcripts that matched, at each taxonomic level, from supergroup to species. If a quantification tool has been used to estimate the number of counts associated with each transcript ID, counts may also be returned. Additionally, the software returns barplots displaying the relative composition of each sample at each taxonomic level, according to the number of transcripts or number of estimated counts if provided from ``Salmon`` (an external transcript quantification tool :cite:`patro2017salmon`).

``EUKulele`` will assess the relative 'completeness' of a given taxonomic group by taking a user-inputted list of names at some taxonomic level to determine BUSCO completeness and redundancy :cite:`simao2015busco`. For example, if the user was interested whether there was a set of relatively complete contigs available for genus *Phaeocystis* within their metagenomic sample, they could pass *Phaeocystis*, along with its taxonomic level, "genus", to ``EUKulele``. By default, ``EUKulele`` will assess the BUSCO completeness of the most commonly encountered classifications at each taxonomic level.

Expand All @@ -25,4 +25,4 @@ The alignment output is compared to an accompanying phylogenetic reference speci
Subsequently, ``BUSCO`` :cite:`simao2015busco` is used to identify the core eukaryotic genes present in each sample. Using the list of genes identified as "core", a secondary taxonomic estimation step (and consensus assignment step, for MAGs) is performed to compare the taxonomic assignment predicted using all of the genes in comparison to the assignment made using only the genes that would be expected to be found in most reference transcriptomes. This approach is particularly useful for MAGs, and offers a method for avoiding conflicting or spurious matches made due to strain-level inconsistencies. For metatranscriptome samples, BUSCO completeness can be used to estimate the completeness of taxonomic groups to better inform their downstream interpretation.

.. bibliography:: refs.bib
:cited:
:cited:
4 changes: 2 additions & 2 deletions docs/source/databaseandconfig.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Installing Databases and Creating Configuration Files
Default Databases
-----------------

Three databases can be downloaded and formatted automatically when invoking ``EUKulele``. Currently the supported databases are:
Four databases can be downloaded and formatted automatically when invoking ``EUKulele``. Currently the supported databases are:

- `PhyloDB <https://drive.google.com/drive/u/0/folders/0B-BsLZUMHrDQfldGeDRIUHNZMEREY0g3ekpEZFhrTDlQSjQtbm5heC1QX2V6TUxBeFlOejQ>`_
- `EukProt <https://figshare.com/articles/EukProt_a_database_of_genome-scale_predicted_proteins_across_the_diversity_of_eukaryotic_life/12417881/2>`_
Expand All @@ -21,7 +21,7 @@ A database (for example ``phylodb``) can be setup prior to running by using::

EUKulele setup --database phylodb

If a database is not found automatically by ``EUKuele`` it will automatically download the database specified by the flag. If you downloaded a database previously you can specify the ``--reference_dir`` flag indicating the path to the previously downloaded database. If no reference database is specified with ```--reference_dir```, EUKulele will automatically download and use the MMETSP database. You can also (1) download the other databases and use the flag ```reference_dir``` to point EUKulele to the location of already downloaded databases or (2) use your own databases.
If a database is not found automatically by ``EUKuele`` it will automatically download the database specified by the flag. If you downloaded a database previously you can specify the ``--reference_dir`` flag indicating the path to the previously downloaded database. If no reference database is specified with ``--reference_dir``, EUKulele will automatically download and use the MMETSP database. You can also (1) download the other databases and use the flag ``reference_dir`` to point EUKulele to the location of already downloaded databases or (2) use your own databases.

Composition of Default Databases
--------------------------------
Expand Down
6 changes: 3 additions & 3 deletions docs/source/outputstructure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Below is what you should expect to see when you run ``EUKulele``. ``output-folde
Taxonomy Estimation Folders
---------------------------

Inside each of the taxonomy estimation folders (``core_taxonomy_estimation``, for exclusively transcripts annotated as core genes, and ``taxonomy_estimation``), there are files labeled *sample_name* ``-estimated-taxonomy.out``. Each of these files has the following columns:
Inside each of the taxonomy estimation folders (``core_taxonomy_estimation``, for exclusively transcripts annotated as core genes, and ``taxonomy_estimation``), there are files labeled ``<sample_name>-estimated-taxonomy.out``. Each of these files has the following columns:

- transcript_name
- The name of the matched transcript/contig from this sample file
Expand Down Expand Up @@ -63,7 +63,7 @@ Inside each of the taxonomy counts folders (``core_taxonomy_counts`` and ``taxon
- Sample
- The original metagenomic/metatranscriptomic sample that this count is from (a separate row would be provided if the match is found in multiple samples)

The taxonomic count files are named according to the convention *output-folder-name* ``_all_`` *taxonomic-level* ``_counts.csv``.
The taxonomic count files are named according to the convention ``<output-folder-name>_all_<taxonomic-level>_counts.csv``.

Taxonomy Visualization Folders
------------------------------
Expand All @@ -76,4 +76,4 @@ Inside each of the taxonomy visualization folders (``core_taxonomy_visualization
- y-axis, right subplot (if using counts): relative number of counts
- bars, right subplot (if using counts): each of the top represented taxonomic groups (must represent >= 5% of total counts)

The right subplot is only generated if counts from a quantification tool (namely, ``Salmon``) are provided.
The right subplot is only generated if counts from a quantification tool (namely, ``Salmon``) are provided.
2 changes: 1 addition & 1 deletion docs/source/running-eukulele.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Using EUKulele

Metatranscriptomes (METs)
=========================
In the first case, metatranscriptomes (shortened in ``EUKulele`` to ``mets``), are assumed to be contigs generated from shotgun-style sequencing and assembly of metatranscriptomic data (RNA) from a mixed community. These contigs can be provide to ``EUKulele`` as either nucleotide sequences (such as those output by `Trinity <https://github.com/trinityrnaseq/trinityrnaseq/wiki>`_) or predicted protein sequences from these contigs (such as those output by `Transdecoder <https://github.com/transdecoder>`_).
In the first case, metatranscriptomes (shortened in ``EUKulele`` to ``mets``), are assumed to be contigs generated from shotgun-style sequencing and assembly of metatranscriptomic data (RNA) from a mixed community. These contigs can be provided to ``EUKulele`` as either nucleotide sequences (such as those output by `Trinity <https://github.com/trinityrnaseq/trinityrnaseq/wiki>`_) or predicted protein sequences from these contigs (such as those output by `Transdecoder <https://github.com/transdecoder>`_).

The most basic running of ``EUKulele`` on metatranscriptome samples would be::

Expand Down