Skip to content

Commit

Permalink
fixup! fixup! Add positional substitution matrices
Browse files Browse the repository at this point in the history
  • Loading branch information
padix-key committed Sep 9, 2024
1 parent 04ab023 commit 13a8204
Showing 1 changed file with 31 additions and 8 deletions.
39 changes: 31 additions & 8 deletions doc/tutorial/sequence/profiles.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
.. include:: /tutorial/preamble.rst

Sequence profiles and position-specific scoring matrices
========================================================

Profiles and position-specific matrices
=======================================
Often sequences are not viewed in isolation:
For example, if you investigate a protein family, you do not handle a single sequence,
but an arbitrarily large collection of highly similar sequences.
Expand Down Expand Up @@ -65,23 +64,46 @@ occurrences for each symbol.
gap_penalty=-5,
)
profile = seq.SequenceProfile.from_alignment(alignment)
count_matrix = profile.symbols
print(profile)

Each row in the displayed count matrix
(accessible via :attr:`SequenceProfile.symbols`) refers to a single position, i.e. a
column in the input MSA, and each column refers to a symbol in the underlying alphabet
(accessible via :attr:`SequenceProfile.alphabet`).
For completeness it should be noted that :attr:`SequenceProfile.gaps` also tracks the
gaps for each position in the alignment, but we will not further use this in this
gaps for each position in the alignment, but we will not further use them in this
tutorial.

Note that the information about the individual sequences is lost in the condensation
process: There is no way to reconstruct the original sequences from the profile.
However, we can still extract a consensus sequence from the profile, which is a
sequence that represents the most frequent symbol at each position.

.. jupyter-execute::

print(profile.to_consensus())

Note that the information about the individual sequences is lost in the condensation
process: There is no way to reconstruct the original sequences from the profile.
Profile visualization as sequence logo
--------------------------------------

.. currentmodule:: biotite.sequence.align

A common way to visualize a sequence profile is a sequence logo.
It depicts each profile position as a stack of letters:
The degree of conversation (more precisely the
`Shannon entropy <https://en.wikipedia.org/wiki/Entropy_(information_theory)>`_)
is the height of a stack and each letter's height in the stack is proportional to its
frequency at the respective position.

.. jupyter-execute::

import matplotlib.pyplot as plt
from biotite.sequence.graphics import plot_sequence_logo

fig, ax = plt.subplots(figsize=(8.0, 2.0), constrained_layout=True)
plot_sequence_logo(ax, profile)
ax.set_xlabel("Residue position")
ax.set_ylabel("Bits")

Position-specific scoring matrices
----------------------------------
Expand Down Expand Up @@ -131,4 +153,5 @@ sought length.
print(alignment)

More on positional sequences
----------------------------
----------------------------
Sequence profiles are just one application of position-specific substitution matrices.

0 comments on commit 13a8204

Please sign in to comment.