diff --git a/README.md b/README.md
index 4eab763f..a1d8c0dc 100755
--- a/README.md
+++ b/README.md
@@ -7,11 +7,11 @@
The Personal Cancer Genome Reporter (PCGR) is a stand-alone software package for translation of individual tumor genomes for precision cancer medicine.
-PCGR interprets primarily somatic SNVs/InDels and copy number aberrations, and also have support for interpretation of bulk RNA-seq expression data. The software produces interactive HTML reports intended for clinical interpretation. PCGR can perform multiple types of analyses, including:
+PCGR interprets primarily somatic SNVs/InDels and copy number aberrations, and has additional support for interpretation of bulk RNA-seq expression data. The software produces interactive HTML reports intended for clinical interpretation. PCGR can perform multiple types of analyses, including:
- Variant classification
- according to *oncogenicity*: evaluating the oncogenic potential of somatic DNA aberrations (VICC/CGC/ClinGen guidelines)
- - according to *actionability*: mapping the therapeutic and prognostic implications of somatic DNA aberrations (ACMG/AMP guidelines)
+ - according to *actionability*: mapping the therapeutic, diagnostic, and prognostic implications of somatic DNA aberrations (ACMG/AMP guidelines)
- Tumor mutational burden (TMB) estimation
- Tumor-only analysis (variant filtering)
- Mutational signature analysis
@@ -21,10 +21,16 @@ PCGR interprets primarily somatic SNVs/InDels and copy number aberrations, and a
If you want to interrogate germline variants and their relation to cancer predisposition, we recommend trying the accompanying tool [Cancer Predisposition Sequencing Reporter (CPSR)](https://github.com/sigven/cpsr).
-![PCGR overview](pcgrr/pkgdown/assets/img/pcgr_dashboard_views.png)
+![PCGR screenshot 1](pcgrr/pkgdown/assets/img/sc2.png)
+![PCGR screenshot 2](pcgrr/pkgdown/assets/img/sc1.png)
+![PCGR screenshot 3](pcgrr/pkgdown/assets/img/sc3.png)
### News
+- *May 2024*: **2.x.x release**
+ - Massive reference data bundle upgrade, new report layout, oncogenicity classification++
+ - Details at [CHANGELOG](http://sigven.github.io/pcgr/articles/CHANGELOG.html)
+
- *February 2023*: **1.3.0 release**
- Details at [CHANGELOG](http://sigven.github.io/pcgr/articles/CHANGELOG.html)
- proritize protein-coding BIOTYPE csq ([pr201](https://github.com/sigven/pcgr/pull/201))
@@ -82,7 +88,7 @@ If you want to interrogate germline variants and their relation to cancer predis
### Example reports
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6275299.svg)](https://doi.org/10.5281/zenodo.6275299)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11401431.svg)](https://doi.org/10.5281/zenodo.11401431)
### Getting started
diff --git a/pcgr/pcgr_vars.py b/pcgr/pcgr_vars.py
index 1241cbc1..ff0911f8 100644
--- a/pcgr/pcgr_vars.py
+++ b/pcgr/pcgr_vars.py
@@ -3,7 +3,7 @@
from pcgr._version import __version__
PCGR_VERSION = __version__
-DB_VERSION = '20240527'
+DB_VERSION = '20240530'
## MISCELLANEOUS
NCBI_BUILD_MAF = 'GRCh38'
diff --git a/pcgrr/pkgdown/assets/img/sc1.png b/pcgrr/pkgdown/assets/img/sc1.png
new file mode 100644
index 00000000..7361be6a
Binary files /dev/null and b/pcgrr/pkgdown/assets/img/sc1.png differ
diff --git a/pcgrr/pkgdown/assets/img/sc2.png b/pcgrr/pkgdown/assets/img/sc2.png
new file mode 100644
index 00000000..bd077e3b
Binary files /dev/null and b/pcgrr/pkgdown/assets/img/sc2.png differ
diff --git a/pcgrr/pkgdown/assets/img/sc3.png b/pcgrr/pkgdown/assets/img/sc3.png
new file mode 100644
index 00000000..5e3542b7
Binary files /dev/null and b/pcgrr/pkgdown/assets/img/sc3.png differ
diff --git a/pcgrr/pkgdown/index.md b/pcgrr/pkgdown/index.md
index ac355b37..54d11dc4 100644
--- a/pcgrr/pkgdown/index.md
+++ b/pcgrr/pkgdown/index.md
@@ -9,27 +9,25 @@ editor_options:
-The Personal Cancer Genome Reporter (PCGR) is a stand-alone software package for functional annotation and translation of individual tumor genomes for precision cancer medicine. It interprets primarily somatic SNVs/InDels and copy number aberrations, and also have support for interpretation of bulk RNA-seq expression data. The software [classifies variants](articles/variant_classification.html) both with respect to _oncogenicity_, and _actionability_. Interactive HTML output reports allow the user to interrogate the clinical impact of the molecular findings in an individual tumor.
+The Personal Cancer Genome Reporter (PCGR) is a stand-alone software package for functional annotation and translation of individual tumor genomes for precision cancer medicine. It interprets primarily somatic SNVs/InDels and copy number aberrations, and has additional support for interpretation of bulk RNA-seq expression data. The software [classifies variants](articles/variant_classification.html) both with respect to _oncogenicity_, and _actionability_. Interactive HTML output reports allow the user to interrogate the clinical impact of the molecular findings in an individual tumor.
-Example views from the dashboard HTML output:
+Example screenshots from the quarto-generated cancer genome report by PCGR:
-![](img/pcgr_dashboard_views.png)
+![](img/sc2.png)
+![](img/sc1.png)
+![](img/sc3.png)
PCGR originates from the [Norwegian Cancer Genomics Consortium (NCGC)](http://cancergenomics.no), at the [Institute for Cancer Research, Oslo University Hospital, Norway](http://radium.no).
## Example reports
-- [Cervical cancer sample (tumor-control)](http://insilico.hpc.uio.no/pcgr/example_reports/latest/cervix_tumor_control.grch37.flexdb.html)
-- [Stomach cancer sample (tumor-control)](http://insilico.hpc.uio.no/pcgr/example_reports/latest/esophagus_stomach_tumor_control.grch37.flexdb.html)
-- [Breast cancer sample (tumor-only)](http://insilico.hpc.uio.no/pcgr/example_reports/latest/breast_tumor_only.grch37.flexdb.html)
-
-(to view the rmarkdown-based reports, simply remove *.flexdb.* in the file names for the flexdashboard reports)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11401431.svg)](https://doi.org/10.5281/zenodo.11401431)
## Why use PCGR?
The great complexity of acquired mutations in individual tumor genomes poses a severe challenge for clinical interpretation. PCGR aims to be a comprehensive reporting platform that can
-- systematically interrogate tumor-specific variants in the context of known therapeutic and prognostic biomarkers
+- systematically interrogate tumor-specific variants in the context of known therapeutic, diagnostic, and prognostic biomarkers
- highlight genomic aberrations with likely oncogenic potential
- provide a structured and concise summary of the most relevant findings
- present the results in a format accessible to clinical experts
diff --git a/pcgrr/vignettes/CHANGELOG.Rmd b/pcgrr/vignettes/CHANGELOG.Rmd
index 589d9d24..a42086ce 100644
--- a/pcgrr/vignettes/CHANGELOG.Rmd
+++ b/pcgrr/vignettes/CHANGELOG.Rmd
@@ -46,6 +46,61 @@ sigven <- user("sigven")
pdiakumis <- user("pdiakumis")
```
+## v2.0.0
+
+* Date: **2024-05-xx**
+* Major data updates
+ * ClinVar
+ * NCI Thesaurus
+ * Open Targets Platform
+ * CIViC
+ * GENCODE
+ * Cancer Gene Census
+ * CancerMine
+ * Pfam
+ * Disease Ontology/EFO
+ * UniProt KB
+* Major software updates
+ * VEP
+
+##### Added/changed
+
+- New report generation framework - [quarto](https://quarto.org)
+ - multiple options related to Rmarkdown output are now deprecated
+- Re-organized data bundle structure
+ - Users need to download an assembly-specific VEP cache separately from PCGR/CPSR, and provide its path to the new required argument `--vep_dir` in the `pcgr` command
+- Re-engineered data bundle generation pipeline
+- Improved data bundle documentation
+ - An HTML report with an overview of the contents of the data bundle is shipped with the reference data itself, also available [here (grch38)](https://rpubs.com/sigven/pcgr_refdata).
+- Moved more of the code base to initial Python workflow steps (biomarker matching, CNA segment annotation, RNA expression analysis, oncogenicity classification)
+- Variants are now classified with respect to both oncogenicity and actionability, and the previous global tier classification (tier 1-5) is thus deprecated
+- New copy number input format - allele-specific (chrom, start, end, n_major, n_minor)
+ - New argument `n_copy_gain` - Minimum number of total copy number for segments considered as gains/amplifications (default: 6)
+- RNA-bulk expression input permitted in the `pcgr` command
+ - `--input_rna_expression` - accepts a TSV file with gene expression values
+ - `--expression_sim` - boolean flag to enable expression similarity analysis
+ - `--expression_sim_db` - Comma-separated string of databases for used in RNA expression similarity analysis, default: tcga,depmap,treehouse
+- TMB calculations can be adjusted using several parameters:
+ - `--tmb_display` - Type of TMB measure to show in report (coding_and_silent, coding_non_silent, missense_only)
+ - `--tmb_dp_min` - Minimum depth for a position to be considered for TMB calculation (default: 0) - requires allelic support information from VCF
+ - `--tmb_af_min` - Minimum allele frequency for a position to be considered for TMB calculation (default: 0) - requires allelic support information from VCF
+- A multi-sheet Excel workbook output with analysis output is provided, suitable e.g. for aggregation of results across samples
+- argument name changes to `pcgr`:
+ - `--pcgr_dir` is now named `--refdata_dir`
+ - `--clinvar_ignore_noncancer` is now named `--clinvar_report_noncancer`, meaning that variants found in ClinVar, yet attributed to _non-cancer related_ phenotypes, are now excluded from reporting by default)
+ - `--vep_gencode_all` is now named `--vep_gencode_basic`, meaning that the gene variant annotation is now using _all_ GENCODE transcripts by default, not only the _basic_ set)
+ - `--preserved_info_tags` is now named `--retained_info_tags`
+ - `--basic` is now named `--no_reporting`
+ - `--target_size_mb` is now named `--effective_target_size_mb`
+
+##### Removed
+
+- Options for configuring Rmarkdown output, i.e. `--report_theme`, `report_nonfloating_toc`
+- `--cpsr_report` and `--include_trials`, which can provide the report with associated pathogenic germline variants (from CPSR) and potential clinical trial oppertunities is currenly on hold for a forthcoming release
+- `--no_vcf_validate` - VCF validation is simplified, not relying on _vcf-validator_ anymore
+- Options to filter tumor-only calls using 1000 Genomes Project database, i.e. `--maf_onekg_eur`, `--maf_onekg_amr`, `--maf_onekg_eas`, `--maf_onekg_afr`, `--maf_onekg_sas`, `--maf_onekg_global`
+- `--cell_line`
+- `--logr_gain`, and `--logr_homdel`
## v1.5.0rc
diff --git a/pcgrr/vignettes/annotation_resources.Rmd b/pcgrr/vignettes/annotation_resources.Rmd
index 6bcd8a3b..a0e313e4 100644
--- a/pcgrr/vignettes/annotation_resources.Rmd
+++ b/pcgrr/vignettes/annotation_resources.Rmd
@@ -17,7 +17,7 @@ output: rmarkdown::html_document
### Variant databases of clinical utility
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - database of clinically related variants (May 2024)
- * [CIViC](http://civic.genome.wustl.edu) - clinical interpretations of variants in cancer (May 7th 2024)
+ * [CIViC](http://civic.genome.wustl.edu) - clinical interpretations of variants in cancer (May 23rd 2024)
* [CGI](http://www.cancergenomeinterpreter.org/biomarkers) - Cancer Genome Interpreter Cancer Biomarkers Database (CGI) (October 18th 2022)
### Protein domains/functional features
@@ -44,28 +44,9 @@ Genomic biomarkers included in PCGR are limited to the following:
* Markers reported at the gene level (e.g. __BRAF mutation__, __BRCA1 oncogenic mutation__)
* Markers reported at the variant level (e.g. __BRAF p.V600E__)
* Markers reported at the codon level (e.g. __KRAS p.G12__)
-* Markers reported at the exon level (e.g. __KIT exon 11 mutation__)
+* Markers reported at the exon/gene level (e.g. __KIT exon 11 mutation__, __BRCA1/2 oncogenic mutations__)
* Within the [Cancer bioMarkers database (CGI)](https://www.cancergenomeinterpreter.org/biomarkers), only markers collected from FDA/NCCN guidelines, scientific literature, and clinical trials are included (markers collected from conference abstracts etc. are not included)
* Copy number gains/losses
See also comment on a [closed GitHib issue](https://github.com/sigven/pcgr/issues/37#issuecomment-391966286)
-__Gene-disease associations__
-
-- Cancer phenotype associations retrieved from the [Open Targets Platform](https://www.targetvalidation.org/) are largely based on the [association score](https://docs.targetvalidation.org/getting-started/scoring) developed by the Open Targets Platform, with a couple of extra post-processing steps:
- - Phenotype associations in Open Targets Platform are assembled from [a variety of different data sources](https://docs.targetvalidation.org/data-sources/data-sources). Target-disease associations included in PCGR must be supported by **at least two distinct sources**
- - The weakest associations, here defined as those with an association score < 0.4 (scale from 0 to 1), are ommitted
- - As is done within the Open Targets Platform, association scores (for genes) are represented with varying shades of blue: the darker the blue, the stronger the association. Variant hits in tier 3/4 and the noncoding section are arranged according to this association score. If several disease subtypes are associated with a gene, the maximum association score is chosen.
-
-__Tumor suppressor genes/proto-oncogenes__
-
-- Status as oncogenes and/or tumor suppressors genes are done according to the following scheme in PCGR:
- - Five or more publications in the biomedical literature that suggests an oncogenic/tumor suppressor role for a given gene (as collected from the [CancerMine text-mining resource](http://bionlp.bcgsc.ca/cancermine/)), **OR**
- - At least two publications from CancerMine that suggests an oncogenic/tumor suppressor role for a given gene **AND** an existing record for the same gene as a tumor suppressor/oncogene in the [Network of Cancer Genes (NCG)](http://ncg.kcl.ac.uk/)
- - Status as oncogene is ignored if a given gene has three times as much (literature evidence) support for a role as a tumor suppressor gene (and vice versa)
- - Oncogenes/tumor suppressor candidates from CancerMine/NCG that are found in the [curated list of false positive cancer drivers compiled by Bailey et al. (Cell, 2018)](https://www.ncbi.nlm.nih.gov/pubmed/30096302) have been excluded
-
-
-__TCGA somatic calls__
-
-- TCGA employs four different variant callers for detection of somatic variants (SNVs/InDels): _mutect2, varscan2, somaticsniper and muse_. In the TCGA dataset bundled with PCGR, somatic SNVs are restricted to those that are detected by at least two independent callers (i.e. calls found by a single algorithm are considered low-confident and disregarded)
diff --git a/pcgrr/vignettes/faq.Rmd b/pcgrr/vignettes/faq.Rmd
index 9f60dbb5..7cb1f614 100644
--- a/pcgrr/vignettes/faq.Rmd
+++ b/pcgrr/vignettes/faq.Rmd
@@ -11,7 +11,7 @@ _Answer: VCF variant genotype data (i.e. AD/DP) is something that you as a user
__2. Is it possible to utilize PCGR for analysis of multiple samples?__
-_Answer: As the name of the tool implies, PCGR was developed for the detailed analysis of individual tumor samples. However, if you take advantage of the different outputs from PCGR, it can also be utilized for analysis of multiple samples. First, make sure your input files are organized per sample (i.e. one VCF file per sample, one CNA file per sample), so that they can be fed directly to PCGR. Now, once all samples have been processed with PCGR, note that all the tab-separated output files (i.e. annotated SNVs, gene copy numbers) contain the sample identifier, which enable them to be aggregated and suitable for a downstream multi-sample analysis. Also note the multi-sheet Excel workbook, which contains numerous outputs from PCGR._
+_Answer: As the name of the tool implies, PCGR was developed for the detailed analysis of individual tumor samples. However, if you take advantage of the different outputs from PCGR, it can also be utilized for analysis of multiple samples. First, make sure your input files are organized per sample (i.e. one VCF file per sample, one CNA file per sample), so that they can be fed directly to PCGR. Now, once all samples have been processed with PCGR, note that all the tab-separated output files (i.e. annotated SNVs, gene copy numbers) contain the sample identifier, which enable them to be aggregated and suitable for a downstream multi-sample analysis. Also note the multi-sheet Excel workbook, which contains numerous outputs from PCGR, and can be processed to aggregate findings across samples._
__3. I do not see the expected transcript-specific consequence for a particular variant. In what way is the primary variant consequence established?__
@@ -33,6 +33,10 @@ __7. Are there any plans to incorporate genomic biomarker evidence from__ [OncoK
_Answer: No. PCGR relies upon publicly available open-source resources, and further that the PCGR data bundle can be distributed freely to the user community. It is our understanding that_ [OncoKB's terms of use](https://www.oncokb.org/terms) _do not fit well with this strategy._
-__8. Is it possible for the users to update the data bundle to get the most recent versions of all underlying data sources?__
+__8. I have RNA fusion data that I want to include in the report. Is this possible?__
+
+_Answer: This is currently not supported as input for PCGR, but is something we are actively pursuing. The focus will be on whether detected RNA fusion events are previously known, and whether these are known as biomarkers for diagnosis or treatment._
+
+__9. Is it possible for the users to update the data bundle to get the most recent versions of all underlying data sources?__
_Answer: As of now, the data bundle is updated only with each release of PCGR. The data harmonization pipeline of knowledge databases in PCGR contain numerous and complex procedures, with several quality control and re-formatting steps, and and cannot be fully automated in its present form. The version of all databases and key software elements are outlined in each PCGR report._
diff --git a/pcgrr/vignettes/installation.Rmd b/pcgrr/vignettes/installation.Rmd
index 3b4bb473..4ca1e2af 100644
--- a/pcgrr/vignettes/installation.Rmd
+++ b/pcgrr/vignettes/installation.Rmd
@@ -51,6 +51,7 @@ __output directory__ to output the results to.
Here's an example scenario that will be used in the following sections:
+- VEP cache is downloaded in `/Users/you/dir0/vep`;
- data bundle downloaded in `/Users/you/dir1/data`;
- sample inputs at `/Users/you/dir2/pcgr_inputs`;
- output goes to `/Users/you/dir3/pcgr_outputs` (make sure this directory
@@ -63,17 +64,17 @@ Here's an example scenario that will be used in the following sections:
**A)** Download and unpack the assembly-specific reference data bundle needed for PCGR:
-- [grch37 data bundle - 20240527](http://insilico.hpc.uio.no/pcgr/pcgr_ref_data.20240527.grch37.tgz) (approx 5.6Gb)
-- [grch38 data bundle - 20240527](http://insilico.hpc.uio.no/pcgr/pcgr_ref_data.20240527.grch38.tgz) (approx 5.6Gb)
+- [grch37 data bundle - 20240530](https://insilico.hpc.uio.no/pcgr/pcgr_ref_data.20240530.grch37.tgz) (approx 5.6Gb)
+- [grch38 data bundle - 20240530](https://insilico.hpc.uio.no/pcgr/pcgr_ref_data.20240530.grch38.tgz) (approx 5.6Gb)
- Example:
```bash
GENOME="grch38" # or "grch37"
-BUNDLE_VERSION="20240527"
+BUNDLE_VERSION="20240530"
BUNDLE="pcgr_ref_data.${BUNDLE_VERSION}.${GENOME}.tgz"
-wget http://insilico.hpc.uio.no/pcgr/${BUNDLE}
+wget https://insilico.hpc.uio.no/pcgr/${BUNDLE}
gzip -dc ${BUNDLE} | tar xvf -
```
@@ -146,8 +147,8 @@ bash miniconda.sh
(base) $ conda install -c conda-forge mamba
(base) $ mamba --version
-mamba 0.19.1
-conda 4.11.0
+mamba 1.5.8
+conda 24.5.0
```
#### b) Create PCGR conda environments
@@ -250,13 +251,15 @@ Then your command would look something like this:
```bash
docker container run -it --rm \
- -v /Users/you/dir1/data:/root/pcgr_data \
+ -v /Users/you/dir0/vep:/root/vep
+ -v /Users/you/dir1/data:/root/pcgr_refdata \
-v /Users/you/dir2/pcgr_inputs:/root/pcgr_inputs \
-v /Users/you/dir3/pcgr_outputs:/root/pcgr_outputs \
sigven/pcgr:1.4.1.9005 \
pcgr \
--input_vcf "/root/pcgr_inputs/tumor_sample.BRCA.vcf.gz" \
- --pcgr_dir "/root/pcgr_data" \
+ --vep_dir "/root/vep/.vep" \
+ --refdata_dir "/root/pcgr_refdata" \
--output_dir "/root/pcgr_outputs" \
--genome_assembly "grch38" \
--sample_id "SampleB" \
diff --git a/pcgrr/vignettes/running.Rmd b/pcgrr/vignettes/running.Rmd
index 3b6bc5ed..27f30c00 100644
--- a/pcgrr/vignettes/running.Rmd
+++ b/pcgrr/vignettes/running.Rmd
@@ -315,7 +315,7 @@ This command will run the Conda-based PCGR workflow and produce the following fi
| N | File | Description |
|---|------|-------------|
| 1 | __\.pcgr.grch37.html__ | An interactive HTML report for clinical interpretation (quarto-based) |
-| 2 | __\.pcgr.grch37.xlsx__ | An excel workbook with multiple sheets of annotations (Assay & sample info/SNVs & InDels/CNAs/Biomarkers/TMB/MSI), amenable for aggregation analysis across multiple samples |
+| 2 | __\.pcgr.grch37.xlsx__ | An excel workbook with multiple sheets of annotations (Assay & sample info/SNVs & InDels/CNAs/Biomarkers/TMB/MSI), suitable for aggregation analysis across multiple samples |
| 3 | __\.pcgr.grch37.vcf.gz (.tbi)__ | Bgzipped VCF file with rich set of variant annotations to support interpretation |
| 4 | __\.pcgr.grch37.pass.vcf.gz (.tbi)__ | Bgzipped VCF file with rich set of variant annotations to support interpretation (PASS variants only) |
| 5 | __\.pcgr.grch37.pass.tsv.gz__ | Compressed vcf2tsv-converted file with rich set of variant annotations to support interpretation |