Running the figure generation script

Steps to refresh all publication-ready figures. All steps assume your current directory is the top of this repository.

1. Obtain the current dataset.

See these instructions to obtain the current data release. We recommend using the download script to obtain data because this will automatically create symlinks in data/ to the latest files.

2. Set up an up-to-date project Docker container.

See these instructions for setting up the project Docker container. Briefly, the latest version of the project Docker image, which is updated upon commit to master, can be obtained and run via:

docker pull ccdlopenpbta/open-pbta:latest
docker run \
  -e PASSWORD=<password> \
  -p 8787:8787 \
  -v $(pwd):/home/rstudio/kitematic \
  ccdlopenpbta/open-pbta:latest

You may choose to use docker exec to interact with the container from there or if you'd prefer the RStudio interface, you can navigate to localhost:8787 and enter username rstudio and the password you set with the run command above.

3. Run the bash script that generates the figures (`scripts/run-figures.sh`).

This script runs all the intermediate steps needed to generate figures starting with the original data files.

bash figures/generate-figures.sh

Figures are saved to the figures/pngs folder and will be linked to the accompanying manuscript repository AlexsLemonade/OpenPBTA-manuscript.

Summary for each figure

Each figure has its own script stored in the figures/scripts. All are called by the main bash script figures/run-figures.sh. However, we list information about the resources, intermediate steps, and PBTA data files required for generating each figure below for convenience.

Figure	Individual script	Notes on requirements	Linked analysis modules	PBTA data files consumed
Figure 1	`scripts/fig1-sample-distribution.R`	No high RAM requirements	`sample-distribution-analysis`	`pbta-histologies.tsv`
Figure 2	`scripts/fig2-mutational-landscape.R`	256GB of RAM are needed due to the run_caller_consensus_analysis-pbta.sh handling of large MAF files	`snv-callers` `mutational-signatures`	`pbta-snv-lancet.vep.maf.gz` `pbta-snv-mutect2.vep.maf.gz` `pbta-snv-strelka2.vep.maf.gz` `pbta-snv-vardict.vep.maf.gz` `tcga-snv-lancet.vep.maf.gz` `tcga-snv-mutect2.vep.maf.gz` `tcga-snv-strelka2.vep.maf.gz`
CN status heatmap	`analyses/copy_number_consensus_call/run_consensus_call.sh` and `analyses/cnv-chrom-plot/cn_status_heatmap.Rmd`	No high RAM requirements	`cnv-chrom-plot`	`pbta-cnv-controlfreec.tsv.gz` `pbta-sv-manta.tsv.gz` `pbta-cnv-cnvkit.seg.gz`
Figure 3	No individual script (`analyses/focal-cn-file-preparation/run-prepare-cn.sh` and `analyses/oncoprint-landscape/run-oncoprint.sh` scripts are used)	24GB of RAM are needed due to the `run-prepare-cn.sh` handling of large copy number files	`focal-cn-file-preparation` `oncoprint-landscape`	`pbta-histologies.tsv` `pbta-snv-consensus-mutation.maf.tsv.gz` `pbta-fusion-putative-oncogenic.tsv` `consensus_seg_annotated_cn_autosomes.tsv.gz` `independent-specimens.wgs.primary-plus.tsv`
Transcriptomic overview	scripts/transcriptomic-overview.R	Due to the GSVA steps, we recommend ~32 GB of RAM for generating this figure	`transcriptomic-dimension-reduction` `collapse-rnaseq` `gene-set-enrichment-analysis` `immune-deconv`	`pbta-histologies.tsv` `pbta-gene-expression-rsem-fpkm.stranded.rds`
Mutation co-occurrence	No individual script (`analyses/interaction-plots/01-create-interaction-plots.sh` is used)	No high RAM requirements	`interaction-plots`	`independent-specimens.wgs.primary-plus.tsv` `pbta-snv-consensus-mutation.maf.tsv.gz`
Telomerase activities	scripts/TelomeraseActivitites.R	No high RAM requirements	`telomerase-activity-prediction`	`pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` `pbta-gene-counts-rsem-expected_count-collapsed.stranded.rds` `pbta-gene-counts-rsem-expected_count-collapsed.polya.rds` `pbta-histologies.tsv`

Color Palette Usage

This project has a set of unified color palettes. There are 6 sets of hex color keys to be used for all final figures, stored as 6 TSV files in the figures/palettes folder. hex_codes contains the colors to be passed to your plotting code and color_names contains short descriptors of each color (e.g. gradient_1, or divergent_neutral). Each palette contains an na_color that is the same color in all palettes. This color should be used for all NA values. na_color is always the last value in the palette. If na_color is not needed or is supplied separately to a plotting function, you can use a dplyr::filter(hex_code != "na_color") to remove na_color.

To see a summary of what colors are used for histology labeling, see mapping-histology-labels.nb.html

Palette File Name	HEX color key	Color Notes	Variable application
`gradient_col_palette.tsv`	gradient_0: gradient_1: gradient_2: gradient_3: gradient_4: gradient_5: gradient_6: gradient_7: gradient_8: gradient_9: na_color:	10 hex_codes where gradient_0 is for an absolute `0` but may need to be removed from the palette depending on the application	For numeric data being plotted e.g. tumor mutation burden
`divergent_col_palette.tsv`	divergent_low_5: divergent_low_4: divergent_low_3: divergent_low_2: divergent_low_1: divergent_neutral: divergent_high_1: divergent_high_2: divergent_high_3: divergent_high_4: divergent_high_5: na_color:	12 hex codes where the numbers in the name indicate distance from `divergent_neutral`.	For data has that is bidirectional e.g. Amplification/Deletion values like `seg.mean`
`binary_col_palette.tsv`	binary_1: binary_2: na_color:	A vector of two hex codes	For binary variables e.g. presence/absence or Amp/Del as statuses
`oncoprint_color_palette.tsv`	Missense_Mutation: Nonsense_Mutation: Frame_Shift_Del: Frame_Shift_Ins: Splice_Site: Translation_Start_Site: Nonstop_Mutation: In_Frame_Del: In_Frame_Ins: Stop_Codon_Ins: Start_Codon_Del: Fusion: Multi_Hit: Hom_Deletion: Hem_Deletion: amplification: loss: gain: High_Level_Gain: Multi_Hit_Fusion:	A named vector of hex codes assigned to each `short_histology` and to each `CNV`, `SNV` and `Fusion` category	For plotting an oncoprint figure, this vector provides hex codes for `CNV`, `SNV`, and `Fusion` categories

Color coding examples in R

Example 1) Color coding by histologies

Step 1) Read in color palette and select the pertinent columns

There's some extra columns in histology_label_color_table.tsv that you don't need for plotting per se but are more record-keeping purposes. With the code chunk below, we only import the four columns we need and then do a factor reorder to make sure the display_group is in the order declared by display_order.

# Import standard color palettes for project
histology_label_mapping <- readr::read_tsv(
  file.path(figures_dir, "palettes", "histology_label_color_table.tsv")
  ) %>% 
  # Select just the columns we will need for plotting
  dplyr::select(Kids_First_Biospecimen_ID, display_group, display_order, hex_codes) %>% 
  # Reorder display_group based on display_order
  dplyr::mutate(display_group = forcats::fct_reorder(display_group, display_order))

Step 2) Use dplyr::inner_join using Kids_First_Biospecimen_ID to join by so you can add on the hex_codes and display_group for each biospecimen. display_order specifies what order the display_groups should be displayed.

# Read in the metadata
metadata <- readr::read_tsv(metadata_file, guess_max = 10000) %>%
  dplyr::inner_join(histology_label_mapping, by = "Kids_First_Biospecimen_ID")

Step 4) Make your plot and use the hex_codes and display_group columns.

Using the ggplot2::scale_fill_identity() or ggplot2::scale_color_identity() allows you to supply the hex_codes column from a color palette to ggplot2 with a fill or color argument respectively. For base R plots, you should be able to supply the hex_codes column as your col argument. display_group should be used as the labels in the plot.

metadata %>%
  dplyr::group_by(display_group, hex_codes) %>%
  dplyr::summarize(count = dplyr::n()) %>%
  ggplot2::ggplot(ggplot2::aes(x = display_group, y = count, fill = hex_codes)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::scale_fill_identity()

Example 2) Color coding by numeric data

Step 1) Import the palette.

You may want to remove the na_color at the end of the list depending on whether your data include NAs or if the plotting function you are using has the na_color supplied separately.

gradient_col_palette <- readr::read_tsv(
  file.path(figures_dir, "palettes", "gradient_color_palette.tsv")
)

If we need the NA color separated, like for use with ComplexHeatmap which has a separate argument for the color for NA values.

na_color <- gradient_col_palette %>%
  dplyr::filter(color_names == "na_color")

gradient_col_palette <- gradient_col_palette %>%
  dplyr::filter(color_names != "na_color")

Step 2) Make a color function.

In this example, we are building a colorRamp2 function based on a regular interval between the minimum and maximum of our variable df$variable by using seq. However, depending on your data's distribution a regular interval based palette might not represent your data well on the plot. You can provide any numeric vector to color code a palette using circlize::colorRamp2 as long as that numeric vector is the same length as the palette itself.

gradient_col_val <- seq(from = min(df$variable), to = max(df$variable),
                        length.out = nrow(gradient_col_palette))

col_fun <- circlize::colorRamp2(gradient_col_val,
                                gradient_col_palette$hex_codes)

Step 3) Apply to numeric data, or supply to your plotting code.

This step depends on how your main plotting function would like the data supplied. For example, ComplexHeatmap wants a function to be supplied to their col argument.

# Apply to variable directly and make a new column
df <- df %>%
  dplyr::mutate(color_key = col_fun(variable))

## OR ##

# Some plotting packages want a color function

ComplexHeatmap::Heatmap(
  df,
  col = col_fun,
  na_col = na_color$hex_codes
)

Updating color palettes

The non-histologies color palette TSV files are created by running scripts/color_palettes.R, which can be called by Rscript scripts/color_palettes.R. Hex codes for the palettes are hard-coded in this script. The script can be called from anywhere in this repository (will look for the .git file). The hex codes table in figures/README.md and its swatches should also be updated by using the swatches_table function at the end of the script and copy and pasting this function's output to the appropriate place in the table.

The histology color palette file is created by running Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', clean = TRUE)".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Running the figure generation script

1. Obtain the current dataset.

2. Set up an up-to-date project Docker container.

3. Run the bash script that generates the figures (`scripts/run-figures.sh`).

Summary for each figure

Color Palette Usage

Color coding examples in R

Example 1) Color coding by histologies

Example 2) Color coding by numeric data

Updating color palettes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Running the figure generation script

1. Obtain the current dataset.

2. Set up an up-to-date project Docker container.

3. Run the bash script that generates the figures (scripts/run-figures.sh).

Summary for each figure

Color Palette Usage

Color coding examples in R

Example 1) Color coding by histologies

Example 2) Color coding by numeric data

Updating color palettes

3. Run the bash script that generates the figures (`scripts/run-figures.sh`).