Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: major edits to VRS mapping and supporting methods #7

Merged
merged 8 commits into from
Jan 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# dcd-mapping

## Usage
## Results

Mapped MaveDB Scoresets can be downloaded here: https://mavedb-mapping.s3.us-east-2.amazonaws.com/mappings.tar.gz

## Tool usage

Use the `dcd-map` command with a scoreset URN, eg

Expand All @@ -10,6 +14,16 @@ $ dcd-map urn:mavedb:00000083-c-1

Output is saved in the format `<URN>_mapping_results.json` in the directory specified by the environment variable `MAVEDB_STORAGE_DIR`, or `~/.local/share/dcd-mapping` by default.

## Setup

Following installation instructions for [CoolSeqTool](https://coolseqtool.readthedocs.io/en/0.4.0-dev1/install.html) and [Gene Normalizer](https://gene-normalizer.readthedocs.io/en/latest/install.html) should take care of the external data dependencies.

Note that Gene Normalizer's `pg` dependency group must be installed to make use of the PostgreSQL-based backend:

```shell
python3 -m pip install 'gene-normalizer[pg]'
```

## Development

Clone the repo
Expand Down Expand Up @@ -38,6 +52,4 @@ Add pre-commit hooks
pre-commit install
```

## Data

Mapped MaveDB Scoresets can be downloaded here: https://mavedb-mapping.s3.us-east-2.amazonaws.com/mappings.tar.gz
When debugging, use flags `-c` and `-d` to enable caching of BLAT alignment and debug logging, respectively.
24 changes: 24 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Remaining TODO

General:
* notes scattered in comments and docstrings (sorry!)
* Docker-ification. More or less needs to start from scratch (there should be existing images for SeqRepo and UTA, not sure how up-to-date they are).
* Add extra stuff that appears in mapping JSON objects.
* Currently using VRS 2.0a-based libraries. For lifting back to VRS 1.3, some basic post-processing should be fine (annoying but shouldn't be too trivial)
* Without access to a production DynamoDB instance, Gene Normalizer will be quickest and easiest to set up via a PostgreSQL data backend. That, however, requires an extra dependency group (noted in README). We might want to make a `pg` dependency group here or just include it in core dependencies.
* On that note, I've only done minimal testing of how possible it would be to drop the gene normalizer dependency entirely, but it'd be nice to get there.
* Some of the singleton/factory stuff might be cleaner as `global`s

Alignment:
* Pretty sure this is mostly done. Haven't tested exhaustively, though.
* Need to sufficiently mock/patch things in tests

Transcript selection:
* IndexError in calculating offset on lots of new (2023) scoresets.
* Tests will need some extensive mocking (or cassettes?) for reliance on UTA and other external dependencies

VRS mapping:
* In general, this stuff is still pretty rough. Tests aren't passing yet.
* Finish double-checking the SeqRepo storage workaround
* A fair amount of small questions about conditions written to handle specific scoresets/edge cases
* More testing. Can be ready for CI by manually patching the SequenceStore class (or using SeqRepoRESTDataProxy and cassettes).
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,9 @@ src = ["src"]
# pydocstyle (D)
# pep8-naming (N)
# isort (I)
select = ["E", "W", "F", "ANN", "D", "N", "I"]
fixable = ["I", "F401"]
select = ["E", "W", "F", "ANN", "D", "N", "I", "T201", "T100"]
fixable = ["I", "F401", "T201", "T100"]
include = ["pyproject.toml", "tests/**/*.py", "src/**/*.py"]

# ANN101 - missing-type-self
# ANN003 - missing-type-kwargs
Expand Down
2 changes: 1 addition & 1 deletion src/dcd_mapping/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def cli(urn: str, debug: bool, cache_align: bool) -> None:
""" # noqa: D301
logging.basicConfig(
filename="dcd-mapping.log",
format="%(asctime)s %(levelname)s:%(message)s",
format="%(asctime)s %(levelname)s:%(name)s:%(message)s",
level=logging.INFO,
force=True,
)
Expand Down
42 changes: 24 additions & 18 deletions src/dcd_mapping/lookup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from cool_seq_tool.schemas import TranscriptPriority
from ga4gh.core._internal.models import Extension, Gene
from ga4gh.vrs._internal.models import Allele, SequenceLocation
from ga4gh.vrs.dataproxy import SeqRepoDataProxy
from ga4gh.vrs.extras.translator import AlleleTranslator
from gene.database import create_db
from gene.query import QueryHandler
Expand All @@ -23,7 +24,6 @@
"CoolSeqToolBuilder",
"get_seqrepo",
"GeneNormalizerBuilder",
"VrsTranslatorBuilder",
"get_protein_accession",
"get_transcripts",
"get_gene_symbol",
Expand All @@ -33,7 +33,7 @@
"get_chromosome_identifier_from_vrs_id",
"get_sequence",
"store_sequence",
"hgvs_to_vrs",
"translate_hgvs_to_vrs",
"get_mane_transcripts",
"get_uniprot_sequence",
]
Expand Down Expand Up @@ -76,17 +76,20 @@ def __new__(cls) -> QueryHandler:
return cls.instance


class VrsTranslatorBuilder:
"""Singleton constructor for VRS-Python translator instance."""
class TranslatorBuilder:
"""Singleton constructor for VRS Translator instance."""

def __new__(cls) -> AlleleTranslator:
"""Provide VRS-Python translator. Construct if unavailable.
def __new__(cls, data_proxy: SeqRepoDataProxy) -> AlleleTranslator:
"""Provide translator instance. Constructs it if unavailable. Use a new
``data_proxy`` instance that contains a given score row's sequence/ID.

:return: singleton instances of Translator
:return: singleton instance of ``AlleleTranslator``
"""
if not hasattr(cls, "instance"):
cst = CoolSeqToolBuilder()
cls.instance = AlleleTranslator(cst.seqrepo_access, normalize=False)
tr = AlleleTranslator(data_proxy, normalize=False)
cls.instance = tr
else:
cls.instance.data_proxy = data_proxy
return cls.instance


Expand Down Expand Up @@ -280,15 +283,17 @@ def get_gene_location(metadata: ScoresetMetadata) -> Optional[GeneLocation]:
def get_chromosome_identifier(chromosome: str) -> str:
"""Get latest NC_ accession identifier given a chromosome name.

:param chromosome: prefix-free chromosome name, e.g. ``"8"``, ``"X"``
:param chromosome: chromosome name, e.g. ``"8"``, ``"X"``
:return: latest ID if available
:raise KeyError: if unable to retrieve identifier
"""
if not chromosome.startswith("chr"):
chromosome = f"chr{chromosome}"
sr = CoolSeqToolBuilder().seqrepo_access
acs = []
for assembly in ["GRCh38", "GRCh37"]:
tmp_acs, _ = sr.translate_identifier(
f"{assembly}:chr{chromosome}", target_namespaces="refseq"
f"{assembly}:{chromosome}", target_namespaces="refseq"
)
for ac in tmp_acs:
acs.append(ac.split("refseq:")[-1])
Expand Down Expand Up @@ -378,18 +383,15 @@ def store_sequence(sequence: str, names: List[Dict]) -> None:
# -------------------------------- VRS-Python -------------------------------- #


def hgvs_to_vrs(hgvs: str, alias_map: Dict) -> Allele:
def translate_hgvs_to_vrs(hgvs: str, data_proxy: SeqRepoDataProxy) -> Allele:
"""Convert HGVS variation description to VRS object.

# TODO incorporate alias map

:param hgvs: MAVE-HGVS variation string
:param alias_map: lookup for custom sequence IDs
:param data_proxy:
:return: Corresponding VRS allele as a Pydantic class
"""
tr = VrsTranslatorBuilder()
vrs_allele = tr.translate_from(hgvs, "hgvs")
allele = Allele(**vrs_allele)
tr = TranslatorBuilder(data_proxy)
allele = tr.translate_from(hgvs, "hgvs")

if (
not isinstance(allele.location, SequenceLocation)
Expand All @@ -398,6 +400,10 @@ def hgvs_to_vrs(hgvs: str, alias_map: Dict) -> Allele:
):
raise ValueError

# TODO temporary, remove
if not isinstance(allele, Allele):
raise NotImplementedError

return allele


Expand Down
12 changes: 8 additions & 4 deletions src/dcd_mapping/main.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Provide core MaveDB mapping methods."""
import json
import logging
from pathlib import Path
from typing import List

import click
Expand All @@ -21,20 +22,23 @@

def _save_results(
metadata: ScoresetMetadata, mapping_results: VrsMappingResult
) -> None:
) -> Path:
"""Save results to file.

Todo:
----
* Embed in original metadata JSON
* add ``mapped_reference_sequence`` and ``computed_referense_sequence`` properties
* Embed within original metadata JSON
* Option to save as VRS 1.x

:param metadata: scoreset metadata
:param mapping results: mapped objects
:return: path to saved file
"""
outfile = LOCAL_STORE_PATH / f"{metadata.urn}_mapping_results.json"
with open(outfile, "w") as f:
json.dump(mapping_results.model_dump_json(indent=2), f)
json.dump(mapping_results.model_dump(exclude_none=True), f, indent=2)
return outfile


async def map_scoreset(
Expand Down Expand Up @@ -65,7 +69,7 @@ async def map_scoreset(
return None

try:
vrs_results = vrs_map(metadata, alignment_result, transcript, records)
vrs_results = vrs_map(metadata, alignment_result, transcript, records, silent)
except VrsMapError:
_logger.error(f"VRS mapping failed for scoreset {metadata.urn}")
return None
Expand Down
17 changes: 9 additions & 8 deletions src/dcd_mapping/schemas.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

from cool_seq_tool.schemas import Strand, TranscriptPriority
from ga4gh.vrs._internal.models import Allele, Haplotype
from pydantic import BaseModel, StrictBool, StrictInt
from pydantic import BaseModel, StrictBool, StrictFloat, StrictInt, StrictStr


class TargetSequenceType(StrEnum):
Expand Down Expand Up @@ -125,13 +125,14 @@ class TxSelectResult(BaseModel):


class VrsMapping(BaseModel):
"""Define pre-post mapping pair structure for VRS-structured variations.

Probably need to add score and accession to make json writing easier
"""

pre_mapping: Union[Allele, Haplotype]
mapped: Union[Allele, Haplotype]
"""Define pre-post mapping pair structure for VRS-structured variations."""

mavedb_id: StrictStr
pre_mapped: Union[Allele, Haplotype]
post_mapped: Union[Allele, Haplotype]
mapped_transcript: Optional[TranscriptDescription] = None
score: StrictFloat
# relation: Literal["SO:is_homologous_to"] = "SO:is_homologous_to"


class VrsMappingResult(BaseModel):
Expand Down
Loading