Skip to content

Commit

Permalink
feat(docs): transfer ambiguous symbols explanation #551
Browse files Browse the repository at this point in the history
  • Loading branch information
fengelniederhammer committed Jan 15, 2024
1 parent 5cea36d commit 066f5b6
Show file tree
Hide file tree
Showing 10 changed files with 161 additions and 142 deletions.
4 changes: 4 additions & 0 deletions lapis2-docs/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,10 @@ export default defineConfig({
label: 'Mutation filters',
link: '/concepts/mutation-filters/',
},
{
label: 'Ambiguous symbols',
link: '/concepts/ambiguous-symbols/',
},
{
label: 'Pango lineage query',
link: '/concepts/pango-lineage-query/',
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
import AminoAcidMutationExample from './AminoAcidMutationExample.astro';
---

<code>MAYBE(<AminoAcidMutationExample />)</code>
Original file line number Diff line number Diff line change
Expand Up @@ -5,45 +5,10 @@ description: Explanation of terms used in the context of LAPIS.

| Term | Definition |
| -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| AA | amino acid |
| AA | short for amino acid |
| aligned | a nucleotide sequence is aligned, if it is arranged such that it has many similarities to a given reference genome. The aligned sequence has the same length as the reference genome. Gaps are marked in the aligned sequence. Insertions are stored separately. |
| Mutation | a divergence from the reference genome (see below). |
| Mutation | a divergence from the reference genome (see [mutation-filters](../concepts/mutation-filters)). |
| Organism | The organism that the genomic data was extracted from. Each LAPIS instance serves data for a single organism. |
| QC | quality control; in our case, it usually refers to the quality checks and metrics of the sequences, targeting how well the nucleotide sequence was determined from the probe. |
| Segment | The genome of an organism may consist of multiple nucleotide sequence pieces. We call those pieces "segments". |
| Variant | We follow a very open definition of variants. Every subset of sequences is considered a variant. A variant is specified by lineage/clade names and mutations. A variant does not need to be [monophyletic](https://en.wikipedia.org/wiki/Monophyly). |

## Mutations

Mutations can occur either on nucleotide level or on amino acid level.
For the nucleotides a single symbol can produce a mutation, whereas for the amino acids,
some nucleotide mutations still produce the same amino acid
([see also](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables)).

The following explains the notations for mutations.

### Amino Acid Mutations

The gene has to be provided for the AA mutation, since AAs only make sense within a gene.

**Example ORF_1a\:G1234S**. This translates to

- in Gene: ORF_1a
- AA mutation from "G" to "S" at position 1234

The origin AA symbol can be omitted, since it is clear from the reference genome.
**Example: ORF_1a:1234S**

### Nucleotide Mutations

**Example: C1234T**. This translates to

- a nucleotide mutation from nucleotide "C"
- at position 1234 in the genome
- to nucleotide "T"

The origin nucleotide symbol can be omitted, since it is clear from the reference genome.
**Example: 1234T**

If the organism has multiple nucleotide sequence segments, the segment has to be provided.
**Example: segment_name\:C1234T**
53 changes: 53 additions & 0 deletions lapis2-docs/src/content/docs/concepts/ambiguous-symbols.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Ambiguous symbols
description: Explanation how ambiguous reads are handled in the data
---

The underlying sequence files in `.FASTA` format can contain any of the following symbols:

| Symbol | Meaning |
| ------ | ----------------- |
| A | Adenine |
| C | Cytosine |
| G | Guanine |
| T | Thymine |
| - | Deletion |
| N | failed read / any |
| R | A or G |
| Y | C or T |
| S | C or G |
| W | A or T |
| K | G or T |
| M | A or C |
| B | not A |
| D | not C |
| H | not G |
| V | not T |

The ambiguous symbols arise from imperfect reads in the sequencer.

While one mostly queries for the symbols `A`, `C`, `G`, `T` and `-` to look for specific features and mutations of a sequence,
or `N` for quality control of the underlying data,
the ambiguous symbols `R` through `V` are often too cumbersome to consider in analyses.

LAPIS supports the flexible consideration of these ambiguous symbols
through an extension of the boolean logic syntax in the variant queries.

Here we introduce a new expression `MAYBE` to consider sequences that have an ambiguous code which **maybe** matches the queried value.

#### Example

Consider the following sequences:

```
12345
AAACG
AARCG
AANCG
AAGCG
AAACG
```

A filter for the mutation `3G` returns only the sequence `AAGCG`, as it is the only sequence with the symbol `G` at position 3.
The filter `MAYBE(3G)` however also considers that the sequences `AARCG` and `AANCG` **may** have the symbol `G` at position 3,
because the symbols `R` and `N` can represent Guanine.
7 changes: 7 additions & 0 deletions lapis2-docs/src/content/docs/concepts/mutation-filters.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ description: Mutation filters
---

import AminoAcidMutationExample from '../../../components/MutationFilters/AminoAcidMutationExample.astro';
import MaybeAminoAcidMutationExample from '../../../components/MutationFilters/MaybeAminoAcidMutationExample.astro';
import GeneNames from '../../../components/MutationFilters/GeneNames.astro';
import NucleotideMutations from '../../../components/MutationFilters/NucleotideMutations.astro';

Expand All @@ -19,3 +20,9 @@ It can also be `-` for deletion and `X` for unknown. **Example:** <AminoAcidMuta
The `<base>` can be omitted to filter for any mutation.
You can write a `.` for the `<base>` to filter for sequences for which it is confirmed that no mutation occurred,
i.e. has the same base as the reference genome at the specified position.

:::note
Both, nucleotide and amino acid mutation filter, also support `Maybe` queries.
Read more in [ambiguous symbols](/concepts/ambiguous-symbols).
**Example:** <MaybeAminoAcidMutationExample/>.
:::
14 changes: 12 additions & 2 deletions lapis2-docs/src/content/docs/concepts/variant-query.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,12 @@ query correctly (in JavaScript, this can be done with the
function)!

The formal specification of the query language is available
[here](https://github.com/cevo-public/LAPIS/blob/main/server/src/main/antlr/ch/ethz/lapis/api/parser/VariantQuery.g4) as
an ANTLR v4 grammar. In following, we provide an informal description and examples.
[here](https://github.com/GenSpectrum/LAPIS/blob/main/lapis2/src/main/antlr/org/genspectrum/lapis/model/variantqueryparser/VariantQuery.g4)
as an ANTLR v4 grammar.
In following, we provide an informal description and examples.
The respective
[unit test](https://github.com/GenSpectrum/LAPIS/blob/main/lapis2/src/test/kotlin/org/genspectrum/lapis/model/VariantQueryFacadeTest.kt)
provides a full list of possible atomic queries.

The query language understands Boolean logic. Expressions can be connected with `&` (and), `|` (or) and `!` (not).
Parentheses `(` and `)` can be used to define the order of the operations. Further, there is a special syntax to match
Expand Down Expand Up @@ -51,4 +55,10 @@ or by Nextclade) and filter by Nextstrain clades:
BA.5* | nextcladePangoLineage:BA.5* | nextstrainClade:22B
```

LAPIS supports a ternary logic to query [ambiguous nucleotide symbols](../ambiguous-symbols/).

```
MAYBE(123W)
```

</OnlyIf>
1 change: 1 addition & 0 deletions lapis2-docs/tests/docs.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ const referencesPages = [
const conceptsPages = [
'Data versions',
'Mutation filters',
'Ambiguous symbols',
'Pango lineage query',
'Request methods: GET and POST',
'Response format',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -192,9 +192,13 @@ data class AminoAcidInsertionContains(val position: Int, val value: String, val

data object True : SiloFilterExpression("True")

data class And(val children: List<SiloFilterExpression>) : SiloFilterExpression("And")
data class And(val children: List<SiloFilterExpression>) : SiloFilterExpression("And") {
constructor(vararg children: SiloFilterExpression) : this(children.toList())
}

data class Or(val children: List<SiloFilterExpression>) : SiloFilterExpression("Or")
data class Or(val children: List<SiloFilterExpression>) : SiloFilterExpression("Or") {
constructor(vararg children: SiloFilterExpression) : this(children.toList())
}

data class Not(val child: SiloFilterExpression) : SiloFilterExpression("Not")

Expand Down
Loading

0 comments on commit 066f5b6

Please sign in to comment.