Running the figure generation script

Steps to refresh all publication-ready figures. All steps assume your current directory is the top of this repository.

1. Obtain the current dataset.

See these instructions to obtain the current data release. We recommend using the download script to obtain data because this will automatically create symlinks in data/ to the latest files.

2. Set up an up-to-date project Docker container.

See these instructions for setting up the project Docker container. Briefly, the latest version of the project Docker image, which is updated upon commit to master, can be obtained and run via:

docker pull ccdlopenpbta/open-pbta:latest
docker run \
  -e PASSWORD=<password> \
  -p 8787:8787 \
  -v $(pwd):/home/rstudio/kitematic \
  ccdlopenpbta/open-pbta:latest

You may choose to use docker exec to interact with the container from there or if you'd prefer the RStudio interface, you can navigate to localhost:8787 and enter username rstudio and the password you set with the run command above.

3. Run the bash script that generates the figures (`scripts/run-figures.sh`).

This script runs all the intermediate steps needed to generate figures starting with the original data files.

bash figures/generate-figures.sh

Figures are saved to the figures/pngs folder and will be linked to the accompanying manuscript repository AlexsLemonade/OpenPBTA-manuscript.

Summary for each figure

Each figure has its own script stored in the figures/scripts. All are called by the main bash script figures/run-figures.sh. However, we list information about the resources, intermediate steps, and PBTA data files required for generating each figure below for convenience.

Figure	Individual script	Notes on requirements	Linked analysis modules	PBTA data files consumed
Figure 1	`scripts/fig1-sample-distribution.R`	No high RAM requirements	`sample-distribution-analysis`	`pbta-histologies.tsv`
Figure 2	`scripts/fig2-mutational-landscape.R`	256GB of RAM are needed due to the run_caller_consensus_analysis-pbta.sh handling of large MAF files	`snv-callers` `mutational-signatures`	`pbta-snv-lancet.vep.maf.gz` `pbta-snv-mutect2.vep.maf.gz` `pbta-snv-strelka2.vep.maf.gz` `pbta-snv-vardict.vep.maf.gz` `tcga-snv-lancet.vep.maf.gz` `tcga-snv-mutect2.vep.maf.gz` `tcga-snv-strelka2.vep.maf.gz`
CN status heatmap	`analyses/copy_number_consensus_call/run_consensus_call.sh` and `analyses/cnv-chrom-plot/cn_status_heatmap.Rmd`	No high RAM requirements	`cnv-chrom-plot`	`pbta-cnv-controlfreec.tsv.gz` `pbta-sv-manta.tsv.gz` `pbta-cnv-cnvkit.seg.gz`
Figure 3	No individual script (`analyses/focal-cn-file-preparation/run-prepare-cn.sh` and `analyses/oncoprint-landscape/run-oncoprint.sh` scripts are used)	24GB of RAM are needed due to the `run-prepare-cn.sh` handling of large copy number files	`focal-cn-file-preparation` `oncoprint-landscape`	`pbta-histologies.tsv` `pbta-snv-consensus-mutation.maf.tsv.gz` `pbta-fusion-putative-oncogenic.tsv` `consensus_seg_annotated_cn_autosomes.tsv.gz` `independent-specimens.wgs.primary-plus.tsv`
Transcriptomic overview	scripts/transcriptomic-overview.R	Due to the GSVA steps, we recommend ~32 GB of RAM for generating this figure	`transcriptomic-dimension-reduction` `collapse-rnaseq` `gene-set-enrichment-analysis` `immune-deconv`	`pbta-histologies.tsv` `pbta-gene-expression-rsem-fpkm.stranded.rds`
Mutation co-occurrence	No individual script (`analyses/interaction-plots/01-create-interaction-plots.sh` is used)	No high RAM requirements	`interaction-plots`	`independent-specimens.wgs.primary-plus.tsv` `pbta-snv-consensus-mutation.maf.tsv.gz`

Color Palette Usage

This project has a set of unified color palettes. There are 6 sets of hex color keys to be used for all final figures, stored as 6 TSV files in the figures/palettes folder. hex_codes contains the colors to be passed to your plotting code and color_names contains short descriptors of each color (e.g. gradient_1, or divergent_neutral). Each palette contains an na_color that is the same color in all palettes. This color should be used for all NA values. na_color is always the last value in the palette. If na_color is not needed or is supplied separately to a plotting function, you can use a dplyr::filter(hex_code != "na_color") to remove na_color. Biospecimens without a short_histology designation are coded as none and assigned the na_color in palettes/histology_color_palette.tsv.

Palette File Name	HEX color key	Color Notes	Variable application
`histology_color_palette.tsv`	Adenoma: ATRT: Central neurocytoma: Chondrosarcoma: Chordoma: Choroid plexus tumor: CNS EFT-CIC: CNS lymphoma: CNS neuroblastoma: CNS Rhabdomyosarcoma: CNS sarcoma: Craniopharyngioma: DNET: Dysplasia: Embryonal Tumor: Ependymoma: ETMR: Ganglioglioma: Germinoma: Glial-neuronal tumor NOS: Gliosis: Hemangioblastoma: Hemangioma: HGAT: Langerhans Cell histiocytosis: LGAT: LGMT: Medulloblastoma: Meningioma: MPNST: Neurofibroma: na_color: Oligodendroglioma: Other: Pineoblastoma: Schwannoma: Teratoma:	a named vector of the hex values that were assigned to each `short_histology` group table	For color-coding by `short_histology` when it's more convenient to assign colors by `short_histology` category.
`gradient_col_palette.tsv`	gradient_0: gradient_1: gradient_2: gradient_3: gradient_4: gradient_5: gradient_6: gradient_7: gradient_8: gradient_9: na_color:	10 hex_codes where gradient_0 is for an absolute `0` but may need to be removed from the palette depending on the application	For numeric data being plotted e.g. tumor mutation burden
`divergent_col_palette.tsv`	divergent_low_5: divergent_low_4: divergent_low_3: divergent_low_2: divergent_low_1: divergent_neutral: divergent_high_1: divergent_high_2: divergent_high_3: divergent_high_4: divergent_high_5: na_color:	12 hex codes where the numbers in the name indicate distance from `divergent_neutral`.	For data has that is bidirectional e.g. Amplification/Deletion values like `seg.mean`
`binary_col_palette.tsv`	binary_1: binary_2: na_color:	A vector of two hex codes	For binary variables e.g. presence/absence or Amp/Del as statuses
`oncoprint_color_palette.tsv`	Missense_Mutation: Nonsense_Mutation: Frame_Shift_Del: Frame_Shift_Ins: Splice_Site: Translation_Start_Site: Nonstop_Mutation: In_Frame_Del: In_Frame_Ins: Stop_Codon_Ins: Start_Codon_Del: Fusion: Multi_Hit: Hom_Deletion: Hem_Deletion: amplification: loss: gain: High_Level_Gain: Multi_Hit_Fusion:	A named vector of hex codes assigned to each `short_histology` and to each `CNV`, `SNV` and `Fusion` category	For plotting an oncoprint figure, this vector provides hex codes for `CNV`, `SNV`, and `Fusion` categories

Color coding examples in R

Example 1) Color coding by `short_histology`.

Step 1) Read in color palette and format as a named list

histology_col_palette <- readr::read_tsv(
  file.path("figures", "palettes", "histology_color_palette.tsv")
  ) %>%
  # We'll use deframe so we can use it as a recoding list
  tibble::deframe()

Step 2) For any data.frame with a short_histology column, recode NAs as "none".

metadata <- readr::read_tsv(file.path("data", "pbta-histologies.tsv") %>%
  # Easier to deal with NA short histologies if they are labeled something different
  dplyr::mutate(short_histology = as.character(tidyr::replace_na(short_histology, "none")))

Step 3) Use dplyr::recode on short_histology column to make a new color column.

metadata <- metadata %>%
  # Tack on the sample color using the short_histology column and a recode
  dplyr::mutate(sample_color = dplyr::recode(short_histology,
                                             !!!histology_col_palette))

Step 4) Make your plot and use the sample_color column.

Using the ggplot2::scale_fill_identity() or ggplot2::scale_color_identity() allows you to supply the hex_code column from a color palette to ggplot2 with a fill or color argument respectively. For base R plots, you should be able to supply the sample_color column as your col argument.

metadata %>%
  dplyr::group_by(short_histology, sample_color) %>%
  dplyr::summarize(count = dplyr::n()) %>%
  ggplot2::ggplot(ggplot2::aes(x = short_histology, y = count, fill = sample_color)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::scale_fill_identity()

Example 2) Color coding by numeric data

Step 1) Import the palette.

You may want to remove the na_color at the end of the list depending on whether your data include NAs or if the plotting function you are using has the na_color supplied separately.

gradient_col_palette <- readr::read_tsv(
  file.path(figures_dir, "palettes", "gradient_color_palette.tsv")
)

If we need the NA color separated, like for use with ComplexHeatmap which has a separate argument for the color for NA values.

na_color <- gradient_col_palette %>%
  dplyr::filter(color_names == "na_color")

gradient_col_palette <- gradient_col_palette %>%
  dplyr::filter(color_names != "na_color")

Step 2) Make a color function.

In this example, we are building a colorRamp2 function based on a regular interval between the minimum and maximum of our variable df$variable by using seq. However, depending on your data's distribution a regular interval based palette might not represent your data well on the plot. You can provide any numeric vector to color code a palette using circlize::colorRamp2 as long as that numeric vector is the same length as the palette itself.

gradient_col_val <- seq(from = min(df$variable), to = max(df$variable),
                        length.out = nrow(gradient_col_palette))

col_fun <- circlize::colorRamp2(gradient_col_val,
                                gradient_col_palette$hex_codes)

Step 3) Apply to numeric data, or supply to your plotting code.

This step depends on how your main plotting function would like the data supplied. For example, ComplexHeatmap wants a function to be supplied to their col argument.

# Apply to variable directly and make a new column
df <- df %>%
  dplyr::mutate(color_key = col_fun(variable))

## OR ##

# Some plotting packages want a color function

ComplexHeatmap::Heatmap(
  df,
  col = col_fun, 
  na_col = na_color$hex_codes
)

Updating color palettes

The color palette TSV files are created by running scripts/color_palettes.R, which can be called by Rscript scripts/color_palettes.R. Hex codes for the palettes are hard-coded in this script. The script can be called from anywhere in this repository (will look for the .git file). The hex codes table in figures/README.md and its swatches should also be updated by using the swatches_table function at the end of the script and copy and pasting this function's output to the appropriate place in the table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Running the figure generation script

1. Obtain the current dataset.

2. Set up an up-to-date project Docker container.

3. Run the bash script that generates the figures (`scripts/run-figures.sh`).

Summary for each figure

Color Palette Usage

Color coding examples in R

Example 1) Color coding by `short_histology`.

Example 2) Color coding by numeric data

Updating color palettes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Running the figure generation script

1. Obtain the current dataset.

2. Set up an up-to-date project Docker container.

3. Run the bash script that generates the figures (scripts/run-figures.sh).

Summary for each figure

Color Palette Usage

Color coding examples in R

Example 1) Color coding by short_histology.

Example 2) Color coding by numeric data

Updating color palettes

3. Run the bash script that generates the figures (`scripts/run-figures.sh`).

Example 1) Color coding by `short_histology`.