Skip to content

Latest commit

 

History

History
176 lines (131 loc) · 18.6 KB

README.md

File metadata and controls

176 lines (131 loc) · 18.6 KB

Running the figure generation script

Steps to refresh all publication-ready figures. All steps assume your current directory is the top of this repository.

1. Obtain the current dataset.

See these instructions to obtain the current data release. We recommend using the download script to obtain data because this will automatically create symlinks in data/ to the latest files.

2. Set up an up-to-date project Docker container.

See these instructions for setting up the project Docker container. Briefly, the latest version of the project Docker image, which is updated upon commit to master, can be obtained and run via:

docker pull ccdlopenpbta/open-pbta:latest
docker run \
  -e PASSWORD=<password> \
  -p 8787:8787 \
  -v $(pwd):/home/rstudio/kitematic \
  ccdlopenpbta/open-pbta:latest

You may choose to use docker exec to interact with the container from there or if you'd prefer the RStudio interface, you can navigate to localhost:8787 and enter username rstudio and the password you set with the run command above.

3. Run the bash script that generates the figures (scripts/run-figures.sh).

This script runs all the intermediate steps needed to generate figures starting with the original data files.

bash figures/generate-figures.sh

Figures are saved to the figures/pngs folder and will be linked to the accompanying manuscript repository AlexsLemonade/OpenPBTA-manuscript.

Summary for each figure

Each figure has its own script stored in the figures/scripts. All are called by the main bash script figures/run-figures.sh. However, we list information about the resources, intermediate steps, and PBTA data files required for generating each figure below for convenience.

Figure Individual script Notes on requirements Linked analysis modules PBTA data files consumed
Figure 1 scripts/fig1-sample-distribution.R No high RAM requirements sample-distribution-analysis pbta-histologies.tsv
Figure 2 scripts/fig2-mutational-landscape.R 256GB of RAM are needed due to the run_caller_consensus_analysis-pbta.sh handling of large MAF files snv-callers
mutational-signatures
pbta-snv-lancet.vep.maf.gz
pbta-snv-mutect2.vep.maf.gz
pbta-snv-strelka2.vep.maf.gz
pbta-snv-vardict.vep.maf.gz
tcga-snv-lancet.vep.maf.gz
tcga-snv-mutect2.vep.maf.gz
tcga-snv-strelka2.vep.maf.gz
CN status heatmap analyses/copy_number_consensus_call/run_consensus_call.sh and analyses/cnv-chrom-plot/cn_status_heatmap.Rmd No high RAM requirements cnv-chrom-plot pbta-cnv-controlfreec.tsv.gz
pbta-sv-manta.tsv.gz
pbta-cnv-cnvkit.seg.gz
Figure 3 No individual script
(analyses/focal-cn-file-preparation/run-prepare-cn.sh and analyses/oncoprint-landscape/run-oncoprint.sh scripts are used)
24GB of RAM are needed due to the run-prepare-cn.sh handling of large copy number files focal-cn-file-preparation
oncoprint-landscape
pbta-histologies.tsv
pbta-snv-consensus-mutation.maf.tsv.gz
pbta-fusion-putative-oncogenic.tsv
consensus_seg_annotated_cn_autosomes.tsv.gz
independent-specimens.wgs.primary-plus.tsv
Transcriptomic overview scripts/transcriptomic-overview.R Due to the GSVA steps, we recommend ~32 GB of RAM for generating this figure transcriptomic-dimension-reduction
collapse-rnaseq
gene-set-enrichment-analysis
immune-deconv
pbta-histologies.tsv
pbta-gene-expression-rsem-fpkm.stranded.rds
Mutation co-occurrence No individual script
(analyses/interaction-plots/01-create-interaction-plots.sh is used)
No high RAM requirements interaction-plots independent-specimens.wgs.primary-plus.tsv
pbta-snv-consensus-mutation.maf.tsv.gz

Color Palette Usage

This project has a set of unified color palettes. There are 6 sets of hex color keys to be used for all final figures, stored as 6 TSV files in the figures/palettes folder. hex_codes contains the colors to be passed to your plotting code and color_names contains short descriptors of each color (e.g. gradient_1, or divergent_neutral). Each palette contains an na_color that is the same color in all palettes. This color should be used for all NA values. na_color is always the last value in the palette. If na_color is not needed or is supplied separately to a plotting function, you can use a dplyr::filter(hex_code != "na_color") to remove na_color. Biospecimens without a short_histology designation are coded as none and assigned the na_color in palettes/histology_color_palette.tsv.

Palette File Name HEX color key Color Notes Variable application
histology_color_palette.tsv
Adenoma:f23d3d
ATRT:731d1d
Central neurocytoma:b38686
Chondrosarcoma:cc5c33
Chordoma:331c0d
Choroid plexus tumor:ffb380
CNS EFT-CIC:b25f00
CNS lymphoma:f2d6b6
CNS neuroblastoma:736556
CNS Rhabdomyosarcoma:ffaa00
CNS sarcoma:4c3d00
Craniopharyngioma:e2f200
DNET:919926
Dysplasia:d6f2b6
Embryonal Tumor:304d26
Ependymoma:00f241
ETMR:009929
Ganglioglioma:698c7c
Germinoma:39e6c3
Glial-neuronal tumor NOS:005359
Gliosis:263233
Hemangioblastoma:00c2f2
Hemangioma:40a6ff
HGAT:406280
Langerhans Cell histiocytosis:0044ff
LGAT:00144d
LGMT:acbbe6
Medulloblastoma:7373e6
Meningioma:3d0099
MPNST:c200f2
Neurofibroma:917399
na_color:f1f1f1
Oligodendroglioma:f279da
Other:cc0052
Pineoblastoma:994d6b
Schwannoma:4d2636
Teratoma:ffbfd9
a named vector of the hex values that were assigned to each short_histology group table For color-coding by short_histology when it's more convenient to assign colors by short_histology category.
gradient_col_palette.tsv
gradient_0:f7f7f7
gradient_1:f7fcf5
gradient_2:e5f5e0
gradient_3:c7e9c0
gradient_4:a1d99b
gradient_5:74c476
gradient_6:41ab5d
gradient_7:238b45
gradient_8:006d2c
gradient_9:00441b
na_color:f1f1f1
10 hex_codes where gradient_0 is for an absolute 0 but may need to be removed from the palette depending on the application For numeric data being plotted e.g. tumor mutation burden
divergent_col_palette.tsv
divergent_low_5:053061
divergent_low_4:2166ac
divergent_low_3:4393c3
divergent_low_2:92c5de
divergent_low_1:d1e5f0
divergent_neutral:f7f7f7
divergent_high_1:fddbc7
divergent_high_2:f4a582
divergent_high_3:d6604d
divergent_high_4:b2182b
divergent_high_5:67001f
na_color:f1f1f1
12 hex codes where the numbers in the name indicate distance from divergent_neutral. For data has that is bidirectional e.g. Amplification/Deletion values like seg.mean
binary_col_palette.tsv
binary_1:2166ac
binary_2:b2182b
na_color:f1f1f1
A vector of two hex codes For binary variables e.g. presence/absence or Amp/Del as statuses
oncoprint_color_palette.tsv
Missense_Mutation:35978f
Nonsense_Mutation:000000
Frame_Shift_Del:56B4E9
Frame_Shift_Ins:FFBBFF
Splice_Site:F0E442
Translation_Start_Site:191970
Nonstop_Mutation:545454
In_Frame_Del:CAE1FF
In_Frame_Ins:FFE4E1
Stop_Codon_Ins:CC79A7
Start_Codon_Del:56B4E9
Fusion:7B68EE
Multi_Hit:00F021
Hom_Deletion:313695
Hem_Deletion:abd9e9
amplification:c51b7d
loss:0072B2
gain:D55E00
High_Level_Gain:FF0000
Multi_Hit_Fusion:CD96CD
A named vector of hex codes assigned to each short_histology and to each CNV, SNV and Fusion category For plotting an oncoprint figure, this vector provides hex codes for CNV, SNV, and Fusion categories

Color coding examples in R

Example 1) Color coding by short_histology.

Step 1) Read in color palette and format as a named list

histology_col_palette <- readr::read_tsv(
  file.path("figures", "palettes", "histology_color_palette.tsv")
  ) %>%
  # We'll use deframe so we can use it as a recoding list
  tibble::deframe()

Step 2) For any data.frame with a short_histology column, recode NAs as "none".

metadata <- readr::read_tsv(file.path("data", "pbta-histologies.tsv") %>%
  # Easier to deal with NA short histologies if they are labeled something different
  dplyr::mutate(short_histology = as.character(tidyr::replace_na(short_histology, "none")))

Step 3) Use dplyr::recode on short_histology column to make a new color column.

metadata <- metadata %>%
  # Tack on the sample color using the short_histology column and a recode
  dplyr::mutate(sample_color = dplyr::recode(short_histology,
                                             !!!histology_col_palette))

Step 4) Make your plot and use the sample_color column.

Using the ggplot2::scale_fill_identity() or ggplot2::scale_color_identity() allows you to supply the hex_code column from a color palette to ggplot2 with a fill or color argument respectively. For base R plots, you should be able to supply the sample_color column as your col argument.

metadata %>%
  dplyr::group_by(short_histology, sample_color) %>%
  dplyr::summarize(count = dplyr::n()) %>%
  ggplot2::ggplot(ggplot2::aes(x = short_histology, y = count, fill = sample_color)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::scale_fill_identity()

Example 2) Color coding by numeric data

Step 1) Import the palette.

You may want to remove the na_color at the end of the list depending on whether your data include NAs or if the plotting function you are using has the na_color supplied separately.

gradient_col_palette <- readr::read_tsv(
  file.path(figures_dir, "palettes", "gradient_color_palette.tsv")
)

If we need the NA color separated, like for use with ComplexHeatmap which has a separate argument for the color for NA values.

na_color <- gradient_col_palette %>%
  dplyr::filter(color_names == "na_color")

gradient_col_palette <- gradient_col_palette %>%
  dplyr::filter(color_names != "na_color")

Step 2) Make a color function.

In this example, we are building a colorRamp2 function based on a regular interval between the minimum and maximum of our variable df$variable by using seq. However, depending on your data's distribution a regular interval based palette might not represent your data well on the plot. You can provide any numeric vector to color code a palette using circlize::colorRamp2 as long as that numeric vector is the same length as the palette itself.

gradient_col_val <- seq(from = min(df$variable), to = max(df$variable),
                        length.out = nrow(gradient_col_palette))

col_fun <- circlize::colorRamp2(gradient_col_val,
                                gradient_col_palette$hex_codes)

Step 3) Apply to numeric data, or supply to your plotting code.

This step depends on how your main plotting function would like the data supplied. For example, ComplexHeatmap wants a function to be supplied to their col argument.

# Apply to variable directly and make a new column
df <- df %>%
  dplyr::mutate(color_key = col_fun(variable))

## OR ##

# Some plotting packages want a color function

ComplexHeatmap::Heatmap(
  df,
  col = col_fun, 
  na_col = na_color$hex_codes
)

Updating color palettes

The color palette TSV files are created by running scripts/color_palettes.R, which can be called by Rscript scripts/color_palettes.R. Hex codes for the palettes are hard-coded in this script. The script can be called from anywhere in this repository (will look for the .git file). The hex codes table in figures/README.md and its swatches should also be updated by using the swatches_table function at the end of the script and copy and pasting this function's output to the appropriate place in the table.