Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example code for reading GATK4 GenomicsDBImport #12

Open
lima1 opened this issue Jun 13, 2020 · 3 comments
Open

Example code for reading GATK4 GenomicsDBImport #12

lima1 opened this issue Jun 13, 2020 · 3 comments

Comments

@lima1
Copy link

lima1 commented Jun 13, 2020

Hi Nalini,

I was playing around with this package. Thanks a lot providing these R bindings! Much appreciated.

Do you have by any chance an example for iterating through a GATK4 GenomicsDBImport database?

I'm trying to build something similar to CreateSomaticPanelOfNormals which tries to fit beta-binomial distributions to all heterozygous variants. So simply extracting the AD field for all variants and all samples.

I'll figure it out if you don't have anything in place, but I thought I ask as it seems like a fairly standard task.

Thanks a lot,
Markus

@lima1
Copy link
Author

lima1 commented Jul 4, 2020

The new query_variants_calls function is nice and does almost all I need. This is the code I came up with:

library(genomicsdb)
library(jsonlite)
reference <- "references/hg38/Reference/sequence.fa"

workspace <- "gatk4_pon_db/"

db <- genomicsdb::connect(workspace = workspace,
    vid_mapping_file = file.path(workspace, "vidmap.json"),
    callset_mapping_file=file.path(workspace, "callset.json"),
    reference_genome = reference,
    c("DP", "AD", "AF"))

j <- jsonlite::read_json(file.path(workspace, "callset.json"))

query <- query_variant_calls(db, array="1$3682058$243843487", column_ranges=list(c(3682058,243843487)), row_ranges=list(range(sapply(j$callsets, function(x) x$row_idx))))
head(query)
  ROW     COL        SAMPLE CHROM     POS     END REF  AD    AF ALT   DP
1   1 3682335 LIB-020359pc4     1 3682336 3682336   G 617 0.430 [A] 1105
2   4 3682335 LIB-020362pc4     1 3682336 3682336   G 626 0.445 [A] 1166
3   8 3682335 LIB-020366pc4     1 3682336 3682336   G 475 0.433 [A]  891
4   9 3682335 LIB-020367pc4     1 3682336 3682336   G 766 0.325 [A] 1191

One issue I have is that Mutect2 generates a DP field in both FORMAT and INFO, one is the unfiltered one the filtered. Looks like the unfiltered is picked. That's fine, but AD is a vector with the filtered REF and ALT reads, but only the REF reads make it into the data.frame. That's a bug, right? So I cannot get the filtered ALT count from the data.frame.

And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns? My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.

Best,
Markus

Update iterate through all arrays:

   workspace <- normalizePath(workspace, mustWork = TRUE)
   db <- genomicsdb::connect(workspace = workspace,
       vid_mapping_file = file.path(workspace, "vidmap.json"),
       callset_mapping_file=file.path(workspace, "callset.json"),
       reference_genome = reference,
       c("DP", "AD", "AF"))

   jcallset <- jsonlite::read_json(file.path(workspace, "callset.json"))
   jvidmap <- jsonlite::read_json(file.path(workspace, "vidmap.json"))

   # get all available arrays
   arrays <- sapply(dir(workspace, full.names=TRUE), file.path, "genomicsdb_meta_dir")
   arrays <- basename(names(arrays)[which(file.exists(arrays))])
   # get offsets and lengths
   contigs <- sapply(arrays, function(ary) strsplit(ary, "\\$")[[1]][1])
   contigs <- jvidmap$contigs[match(contigs, sapply(jvidmap$contigs, function(x) x$name))]
   idx <- order(rankSeqlevels(sapply(contigs, function(x) x$name)))

   all<- lapply(idx, function(i) {
       # avoid integer overflow
       c_offset <- as.numeric(contigs[[i]]$tiledb_column_offset)
       c_length <- as.numeric(contigs[[i]]$length)

       flog.info("Processing %s (offset %.0f, length %.0f)...",
           arrays[i], c_offset, c_length)
       query <- data.table(genomicsdb::query_variant_calls(db,
           array = arrays[i],
           column_ranges = list(c(c_offset, c_offset + c_length)),
           row_ranges = list(range(sapply(jcallset$callsets,
               function(x) x$row_idx)))))
   })

@nalinigans
Copy link
Member

nalinigans commented Sep 11, 2020

@lima1, sorry I was not watching this repository. Just wanted to let you know this is work in progress. Please let me know if you have solved your issue. Otherwise I will get back with answers soon.

And quick question, is there an easy way to cycle through the available arrays + columns, say in batches of 100,000 columns?
My database is small, so reading by chromosome is fine, but wondering if there is a cleaner way.

Are you open to using spark or mpi or any parallel processing paradigm?

@lima1
Copy link
Author

lima1 commented Sep 11, 2020

Thanks @nalinigans . The code works well enough for my purpose, but would be curious to hear your suggestions once the code matured! Thanks for all your work, it’s really nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants