Example code for reading GATK4 GenomicsDBImport #12

lima1 · 2020-06-13T17:39:28Z

Hi Nalini,

I was playing around with this package. Thanks a lot providing these R bindings! Much appreciated.

Do you have by any chance an example for iterating through a GATK4 GenomicsDBImport database?

I'm trying to build something similar to CreateSomaticPanelOfNormals which tries to fit beta-binomial distributions to all heterozygous variants. So simply extracting the AD field for all variants and all samples.

I'll figure it out if you don't have anything in place, but I thought I ask as it seems like a fairly standard task.

Thanks a lot,
Markus

lima1 · 2020-07-04T19:52:56Z

The new query_variants_calls function is nice and does almost all I need. This is the code I came up with:

library(genomicsdb)
library(jsonlite)
reference <- "references/hg38/Reference/sequence.fa"

workspace <- "gatk4_pon_db/"

db <- genomicsdb::connect(workspace = workspace,
    vid_mapping_file = file.path(workspace, "vidmap.json"),
    callset_mapping_file=file.path(workspace, "callset.json"),
    reference_genome = reference,
    c("DP", "AD", "AF"))

j <- jsonlite::read_json(file.path(workspace, "callset.json"))

query <- query_variant_calls(db, array="1$3682058$243843487", column_ranges=list(c(3682058,243843487)), row_ranges=list(range(sapply(j$callsets, function(x) x$row_idx))))
head(query)
  ROW     COL        SAMPLE CHROM     POS     END REF  AD    AF ALT   DP
1   1 3682335 LIB-020359pc4     1 3682336 3682336   G 617 0.430 [A] 1105
2   4 3682335 LIB-020362pc4     1 3682336 3682336   G 626 0.445 [A] 1166
3   8 3682335 LIB-020366pc4     1 3682336 3682336   G 475 0.433 [A]  891
4   9 3682335 LIB-020367pc4     1 3682336 3682336   G 766 0.325 [A] 1191

One issue I have is that Mutect2 generates a DP field in both FORMAT and INFO, one is the unfiltered one the filtered. Looks like the unfiltered is picked. That's fine, but AD is a vector with the filtered REF and ALT reads, but only the REF reads make it into the data.frame. That's a bug, right? ~~So I cannot get the filtered ALT count from the data.frame~~.

And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns? My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.

Best,
Markus

Update iterate through all arrays:

   workspace <- normalizePath(workspace, mustWork = TRUE)
   db <- genomicsdb::connect(workspace = workspace,
       vid_mapping_file = file.path(workspace, "vidmap.json"),
       callset_mapping_file=file.path(workspace, "callset.json"),
       reference_genome = reference,
       c("DP", "AD", "AF"))

   jcallset <- jsonlite::read_json(file.path(workspace, "callset.json"))
   jvidmap <- jsonlite::read_json(file.path(workspace, "vidmap.json"))

   # get all available arrays
   arrays <- sapply(dir(workspace, full.names=TRUE), file.path, "genomicsdb_meta_dir")
   arrays <- basename(names(arrays)[which(file.exists(arrays))])
   # get offsets and lengths
   contigs <- sapply(arrays, function(ary) strsplit(ary, "\\$")[[1]][1])
   contigs <- jvidmap$contigs[match(contigs, sapply(jvidmap$contigs, function(x) x$name))]
   idx <- order(rankSeqlevels(sapply(contigs, function(x) x$name)))

   all<- lapply(idx, function(i) {
       # avoid integer overflow
       c_offset <- as.numeric(contigs[[i]]$tiledb_column_offset)
       c_length <- as.numeric(contigs[[i]]$length)

       flog.info("Processing %s (offset %.0f, length %.0f)...",
           arrays[i], c_offset, c_length)
       query <- data.table(genomicsdb::query_variant_calls(db,
           array = arrays[i],
           column_ranges = list(c(c_offset, c_offset + c_length)),
           row_ranges = list(range(sapply(jcallset$callsets,
               function(x) x$row_idx)))))
   })

nalinigans · 2020-09-11T06:12:04Z

@lima1, sorry I was not watching this repository. Just wanted to let you know this is work in progress. Please let me know if you have solved your issue. Otherwise I will get back with answers soon.

And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns?
My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.

Are you open to using spark or mpi or any parallel processing paradigm?

lima1 · 2020-09-11T11:40:10Z

Thanks @nalinigans . The code works well enough for my purpose, but would be curious to hear your suggestions once the code matured! Thanks for all your work, it’s really nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example code for reading GATK4 GenomicsDBImport #12

Example code for reading GATK4 GenomicsDBImport #12

lima1 commented Jun 13, 2020

lima1 commented Jul 4, 2020 •

edited

Loading

nalinigans commented Sep 11, 2020 •

edited

Loading

lima1 commented Sep 11, 2020 •

edited

Loading

Example code for reading GATK4 GenomicsDBImport #12

Example code for reading GATK4 GenomicsDBImport #12

Comments

lima1 commented Jun 13, 2020

lima1 commented Jul 4, 2020 • edited Loading

nalinigans commented Sep 11, 2020 • edited Loading

lima1 commented Sep 11, 2020 • edited Loading

lima1 commented Jul 4, 2020 •

edited

Loading

nalinigans commented Sep 11, 2020 •

edited

Loading

lima1 commented Sep 11, 2020 •

edited

Loading