Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz status== NA issue #168

Closed
kgaonkar6 opened this issue Aug 13, 2021 · 7 comments
Closed

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz status== NA issue #168

kgaonkar6 opened this issue Aug 13, 2021 · 7 comments

Comments

@kgaonkar6
Copy link
Contributor

What data file(s) does this issue pertain to?

consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz

What release are you using?

v7

Put your question or report your issue here.

There are NAs in consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz because germline_sex_estimate is NA and is used in annotating the status for samples here

focal_cn_xy %>% dplyr::select(-germline_sex_estimate) %>% left_join(histology,by=c('biospecimen_id'="Kids_First_Biospecimen_ID")) %>% group_by(status,ploidy,germline_sex_estimate,experimental_strategy,cohort) %>% tally() %>% filter(is.na(status))
# A tibble: 8 × 6
# Groups:   status, ploidy, germline_sex_estimate, experimental_strategy [6]
  status ploidy germline_sex_estimate experimental_strategy cohort     n
  <chr>   <dbl> <chr>                 <chr>                 <chr>  <int>
1 NA          2 NA                    WGS                   GMKF   12983
2 NA          2 NA                    WXS                   PBTA    3738
3 NA          2 NA                    WXS                   TARGET 70983
4 NA          3 NA                    WGS                   GMKF    7167
5 NA          3 NA                    WXS                   PBTA    3611
6 NA          3 NA                    WXS                   TARGET 56289
7 NA          4 NA                    WGS                   GMKF       4
8 NA          4 NA                    WXS                   TARGET  4828
> 

WXS samples will not have germline_sex_estimate so we will have to add a filter specfic to WXS which will not have germline_sex_estimate

The WGS samples from GMKF didn't have germline_sex_estimate in v7 thus status == NA. We will have it updated in v8.

@runjin326
Copy link

This has been fixed in V8 histologies file - additionally, the v8 release of consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz already implemented the germline sex estimate - and all the status==NA comes from WXS samples (which is expected).

@kgaonkar6
Copy link
Contributor Author

Thanks for the update @runjin326

However, I believe we need to remove all status==NA in the cnv files. For WGS the GMKF germliner_sex_estimate fixes majority of the issue of WGS xy calls but in general we remove the NAs here status == NA in WGS denote the regions where CNVs were not called in WGS samples.

For WXS since we only run the annotation we are status==NAs in XY chromosomes is from missing the germline_sex_estimate values. The status column is used while uploading to pedbcio and NA is not a accepted value there.

cc @jharenza for input if I missed anything.

@runjin326
Copy link

@kgaonkar6, thanks for clarifying this and I just reopened this! So you meant maybe we should add another line of code to remove all WXS samples with status ==NA?

@runjin326 runjin326 reopened this Aug 19, 2021
@jharenza
Copy link
Collaborator

The status column is used while uploading to pedbcio and NA is not a accepted value there.

@kgaonkar6 - were these left in for v7? I believe @migbro just set them to 0 for pedcbio to be aligned with GISTIC no copy change calls. I think we want to keep these in because we would want these to simulate a full, not partial seg file.

Additionally, for chromothripsis analysis, we may need these regions available for recoding to infer CN oscillation events. So, if the only reason to remove is for pedcbio, I would say don't, and @migbro can handle that during his transform.

@kgaonkar6
Copy link
Contributor Author

Hmm so the status==NA in XY chromosomes in WXS samples is because the germline_sex_estimate is not available for WXS and thus a status cannot be assigned using the logic here, not because there is no copy change in the region I believe:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/1956d0dc705fa6204bd672de24e895063d78bfbe/analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd#L105-L130

Example here we see the copy change compared to ploidy but since germline_sex_estimate is NA the status is NA:

check %>% filter(is.na(check$status),biospecimen_id %in% histology_table_wxs$Kids_First_Biospecimen_ID)
# A tibble: 139,449 × 8
   biospecimen_id status copy_number ploidy ensembl         gene_symbol cytoband germline_sex_esti…
   <chr>          <chr>        <dbl>  <dbl> <chr>           <chr>       <chr>    <chr>             
 1 BS_1RTE2KEX    NA               1      3 ENSG00000129824 RPS4Y1      Yp11.2   NA                
 2 BS_1RTE2KEX    NA               1      3 ENSG00000067646 ZFY         Yp11.2   NA                
 3 BS_1RTE2KEX    NA               1      3 ENSG00000231535 LINC00278   Yp11.2   NA                
 4 BS_1RTE2KEX    NA               1      3 ENSG00000176679 TGIF2LY     Yp11.2   NA                
 5 BS_1RTE2KEX    NA               1      3 ENSG00000099715 PCDH11Y     Yp11.2   NA                

@jharenza
Copy link
Collaborator

Ahh, I see - in this case, I think we probably ought to use the gender or germline_sex_estimate from the matching WGS files. We should create a new ticket for this and implement in the module for v9. I believe @zhangb1 mentioned he had used the WGS germline_sex_estimate for these samples during some part of his analysis... @zhangb1 can you remind me if this is where you had implemented that, or was it upstream during cnv calling?

@runjin326
Copy link

Close with PR89 for now - the remaining issue will be addressed in ticket 177

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants