v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz `status== NA` issue #168

kgaonkar6 · 2021-08-13T23:57:59Z

What data file(s) does this issue pertain to?

consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz

What release are you using?

v7

Put your question or report your issue here.

There are NAs in consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz because germline_sex_estimate is NA and is used in annotating the status for samples here

focal_cn_xy %>% dplyr::select(-germline_sex_estimate) %>% left_join(histology,by=c('biospecimen_id'="Kids_First_Biospecimen_ID")) %>% group_by(status,ploidy,germline_sex_estimate,experimental_strategy,cohort) %>% tally() %>% filter(is.na(status))
# A tibble: 8 × 6
# Groups:   status, ploidy, germline_sex_estimate, experimental_strategy [6]
  status ploidy germline_sex_estimate experimental_strategy cohort     n
  <chr>   <dbl> <chr>                 <chr>                 <chr>  <int>
1 NA          2 NA                    WGS                   GMKF   12983
2 NA          2 NA                    WXS                   PBTA    3738
3 NA          2 NA                    WXS                   TARGET 70983
4 NA          3 NA                    WGS                   GMKF    7167
5 NA          3 NA                    WXS                   PBTA    3611
6 NA          3 NA                    WXS                   TARGET 56289
7 NA          4 NA                    WGS                   GMKF       4
8 NA          4 NA                    WXS                   TARGET  4828
>

WXS samples will not have germline_sex_estimate so we will have to add a filter specfic to WXS which will not have germline_sex_estimate

The WGS samples from GMKF didn't have germline_sex_estimate in v7 thus status == NA. We will have it updated in v8.

The text was updated successfully, but these errors were encountered:

runjin326 · 2021-08-19T13:44:53Z

This has been fixed in V8 histologies file - additionally, the v8 release of consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz already implemented the germline sex estimate - and all the status==NA comes from WXS samples (which is expected).

kgaonkar6 · 2021-08-19T14:15:34Z

Thanks for the update @runjin326

However, I believe we need to remove all status==NA in the cnv files. For WGS the GMKF germliner_sex_estimate fixes majority of the issue of WGS xy calls but in general we remove the NAs here status == NA in WGS denote the regions where CNVs were not called in WGS samples.

For WXS since we only run the annotation we are status==NAs in XY chromosomes is from missing the germline_sex_estimate values. The status column is used while uploading to pedbcio and NA is not a accepted value there.

cc @jharenza for input if I missed anything.

runjin326 · 2021-08-19T14:19:26Z

@kgaonkar6, thanks for clarifying this and I just reopened this! So you meant maybe we should add another line of code to remove all WXS samples with status ==NA?

jharenza · 2021-08-19T14:24:19Z

The status column is used while uploading to pedbcio and NA is not a accepted value there.

@kgaonkar6 - were these left in for v7? I believe @migbro just set them to 0 for pedcbio to be aligned with GISTIC no copy change calls. I think we want to keep these in because we would want these to simulate a full, not partial seg file.

Additionally, for chromothripsis analysis, we may need these regions available for recoding to infer CN oscillation events. So, if the only reason to remove is for pedcbio, I would say don't, and @migbro can handle that during his transform.

kgaonkar6 · 2021-08-19T14:33:22Z

Hmm so the status==NA in XY chromosomes in WXS samples is because the germline_sex_estimate is not available for WXS and thus a status cannot be assigned using the logic here, not because there is no copy change in the region I believe:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/1956d0dc705fa6204bd672de24e895063d78bfbe/analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd#L105-L130

Example here we see the copy change compared to ploidy but since germline_sex_estimate is NA the status is NA:

check %>% filter(is.na(check$status),biospecimen_id %in% histology_table_wxs$Kids_First_Biospecimen_ID)
# A tibble: 139,449 × 8
   biospecimen_id status copy_number ploidy ensembl         gene_symbol cytoband germline_sex_esti…
   <chr>          <chr>        <dbl>  <dbl> <chr>           <chr>       <chr>    <chr>             
 1 BS_1RTE2KEX    NA               1      3 ENSG00000129824 RPS4Y1      Yp11.2   NA                
 2 BS_1RTE2KEX    NA               1      3 ENSG00000067646 ZFY         Yp11.2   NA                
 3 BS_1RTE2KEX    NA               1      3 ENSG00000231535 LINC00278   Yp11.2   NA                
 4 BS_1RTE2KEX    NA               1      3 ENSG00000176679 TGIF2LY     Yp11.2   NA                
 5 BS_1RTE2KEX    NA               1      3 ENSG00000099715 PCDH11Y     Yp11.2   NA

jharenza · 2021-08-19T18:05:56Z

Ahh, I see - in this case, I think we probably ought to use the gender or germline_sex_estimate from the matching WGS files. We should create a new ticket for this and implement in the module for v9. I believe @zhangb1 mentioned he had used the WGS germline_sex_estimate for these samples during some part of his analysis... @zhangb1 can you remind me if this is where you had implemented that, or was it upstream during cnv calling?

runjin326 · 2021-08-22T14:17:30Z

Close with PR89 for now - the remaining issue will be addressed in ticket 177

kgaonkar6 mentioned this issue Aug 14, 2021

v8 data release #151

Closed

10 tasks

kgaonkar6 added the next-data-release label Aug 14, 2021

runjin326 closed this as completed Aug 19, 2021

runjin326 reopened this Aug 19, 2021

jharenza mentioned this issue Aug 19, 2021

Updated analysis: focal cn file preparation to add sex/gender for WXS samples missing info #177

Closed

2 tasks

runjin326 closed this as completed Aug 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz `status== NA` issue #168

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz `status== NA` issue #168

kgaonkar6 commented Aug 13, 2021

runjin326 commented Aug 19, 2021

kgaonkar6 commented Aug 19, 2021

runjin326 commented Aug 19, 2021

jharenza commented Aug 19, 2021

kgaonkar6 commented Aug 19, 2021

jharenza commented Aug 19, 2021

runjin326 commented Aug 22, 2021

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz status== NA issue #168

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz status== NA issue #168

Comments

kgaonkar6 commented Aug 13, 2021

What data file(s) does this issue pertain to?

What release are you using?

Put your question or report your issue here.

runjin326 commented Aug 19, 2021

kgaonkar6 commented Aug 19, 2021

runjin326 commented Aug 19, 2021

jharenza commented Aug 19, 2021

kgaonkar6 commented Aug 19, 2021

jharenza commented Aug 19, 2021

runjin326 commented Aug 22, 2021

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz `status== NA` issue #168

v7 consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz `status== NA` issue #168