Skip to content

Commit

Permalink
feat(docs): transfer ambiguous symbols explanation #551
Browse files Browse the repository at this point in the history
  • Loading branch information
fengelniederhammer committed Jan 9, 2024
1 parent 03079d7 commit 502a722
Show file tree
Hide file tree
Showing 8 changed files with 159 additions and 136 deletions.
4 changes: 4 additions & 0 deletions lapis2-docs/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@ export default defineConfig({
label: 'Mutation filters',
link: '/concepts/mutation-filters/',
},
{
label: 'Ambiguous symbols',
link: '/concepts/ambiguous-symbols/',
},
{
label: 'Pango lineage query',
link: '/concepts/pango-lineage-query/',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,45 +5,10 @@ description: Explanation of terms used in the context of LAPIS.

| Term | Definition |
| -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| AA | amino acid |
| AA | short for amino acid |
| aligned | a nucleotide sequence is aligned, if it is arranged such that it has many similarities to a given reference genome. The aligned sequence has the same length as the reference genome. Gaps are marked in the aligned sequence. Insertions are stored separately. |
| Mutation | a divergence from the reference genome (see below). |
| Mutation | a divergence from the reference genome (see [mutation-filters](../concepts/mutation-filters)). |
| Organism | The organism that the genomic data was extracted from. Each LAPIS instance serves data for a single organism. |
| QC | quality control; in our case, it usually refers to the quality checks and metrics of the sequences, targeting how well the nucleotide sequence was determined from the probe. |
| Segment | The genome of an organism may consist of multiple nucleotide sequence pieces. We call those pieces "segments". |
| Variant | We follow a very open definition of variants. Every subset of sequences is considered a variant. A variant is specified by lineage/clade names and mutations. A variant does not need to be [monophyletic](https://en.wikipedia.org/wiki/Monophyly). |

## Mutations

Mutations can occur either on nucleotide level or on amino acid level.
For the nucleotides a single symbol can produce a mutation, whereas for the amino acids,
some nucleotide mutations still produce the same amino acid
([see also](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables)).

The following explains the notations for mutations.

### Amino Acid Mutations

The gene has to be provided for the AA mutation, since AAs only make sense within a gene.

**Example ORF_1a\:G1234S**. This translates to

- in Gene: ORF_1a
- AA mutation from "G" to "S" at position 1234

The origin AA symbol can be omitted, since it is clear from the reference genome.
**Example: ORF_1a:1234S**

### Nucleotide Mutations

**Example: C1234T**. This translates to

- a nucleotide mutation from nucleotide "C"
- at position 1234 in the genome
- to nucleotide "T"

The origin nucleotide symbol can be omitted, since it is clear from the reference genome.
**Example: 1234T**

If the organism has multiple nucleotide sequence segments, the segment has to be provided.
**Example: segment_name\:C1234T**
69 changes: 69 additions & 0 deletions lapis2-docs/src/content/docs/concepts/ambiguous-symbols.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Ambiguous symbols
description: Explanation how ambiguous reads are handled in the data
---

The underlying sequence files in `.FASTA` format can contain any of the following symbols:

Here is the data converted into a Markdown table:

| Symbol | Meaning |
| ------ | ----------------- |
| A | Adenine |
| C | Cytosine |
| G | Guanine |
| T | Thymine |
| - | Deletion |
| N | failed read / any |
| R | A or G |
| Y | C or T |
| S | C or G |
| W | A or T |
| K | G or T |
| M | A or C |
| B | not A |
| D | not C |
| H | not G |
| V | not T |

While one mostly queries for the symbols `A`, `C`, `G`, `T` and `-` to look for specific features and mutations of a sequence,
or `N` for quality control of the underlying data,
the ambiguous symbols `R` through `V` are often too cumbersome to consider in analyses.

LAPIS supports the flexible consideration of these ambiguous symbols
through an extension of the boolean logic syntax in the variant queries.

Here we introduce two new expressions:

- Maybe (or UpperBound) to consider sequences that have an ambiguous code which **maybe** matches the queries value.
- The complementary expression Exact (or LowerBound).

#### Example

Consider the following sequences:

```
12345
AAACG
AARCG
AANCG
AAGCG
AAACG
```

A filter for the mutation `3G` returns only the sequence `AAGCG`, as it is the only sequence with the symbol `G` at position 3.
The filter `Maybe(3G)`, also considers however, that the sequences `AARCG` and `AANCG` **may** have the symbol `G` at position 3, because the symbols `R` and `N` can represent Guanine.

Conversely, the filter `Not(3A)` contains the sequences

```
AARCG
AANCG
AAGCG
```

If you want to restrict the set of sequences to those which also do not have an ambiguous code containing `A` at position 3, you can get the lower bound of the sequences using the filter `Exact(Not(3G))` or equivalently `Not(Maybe(3G)`:

```
AAGCG
```
14 changes: 12 additions & 2 deletions lapis2-docs/src/content/docs/concepts/variant-query.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,12 @@ query correctly (in JavaScript, this can be done with the
function)!

The formal specification of the query language is available
[here](https://github.com/cevo-public/LAPIS/blob/main/server/src/main/antlr/ch/ethz/lapis/api/parser/VariantQuery.g4) as
an ANTLR v4 grammar. In following, we provide an informal description and examples.
[here](https://github.com/GenSpectrum/LAPIS/blob/main/lapis2/src/main/antlr/org/genspectrum/lapis/model/variantqueryparser/VariantQuery.g4)
as an ANTLR v4 grammar.
In following, we provide an informal description and examples.
The respective
[unit test](https://github.com/GenSpectrum/LAPIS/blob/main/lapis2/src/test/kotlin/org/genspectrum/lapis/model/VariantQueryFacadeTest.kt)
provides a full list of possible atomic queries.

The query language understands Boolean logic. Expressions can be connected with `&` (and), `|` (or) and `!` (not).
Parentheses `(` and `)` can be used to define the order of the operations. Further, there is a special syntax to match
Expand Down Expand Up @@ -51,4 +55,10 @@ or by Nextclade) and filter by Nextstrain clades:
BA.5* | nextcladePangoLineage:BA.5* | nextstrainClade:22B
```

LAPIS supports a ternary logic to query [ambiguous nucleotide symbols](../ambiguous-symbols/).

```
MAYBE(123W)
```

</OnlyIf>
1 change: 1 addition & 0 deletions lapis2-docs/tests/docs.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ const pages = [
'Open API / Swagger',
'Data versions',
'Mutation filters',
'Ambiguous symbols',
'Pango lineage query',
'Request methods: GET and POST',
'Response format',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -192,9 +192,13 @@ data class AminoAcidInsertionContains(val position: Int, val value: String, val

data object True : SiloFilterExpression("True")

data class And(val children: List<SiloFilterExpression>) : SiloFilterExpression("And")
data class And(val children: List<SiloFilterExpression>) : SiloFilterExpression("And") {
constructor(vararg children: SiloFilterExpression) : this(children.toList())
}

data class Or(val children: List<SiloFilterExpression>) : SiloFilterExpression("Or")
data class Or(val children: List<SiloFilterExpression>) : SiloFilterExpression("Or") {
constructor(vararg children: SiloFilterExpression) : this(children.toList())
}

data class Not(val child: SiloFilterExpression) : SiloFilterExpression("Not")

Expand Down
Loading

0 comments on commit 502a722

Please sign in to comment.