Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.NullPointerException during Bubble Processing #23

Open
ariane-lozachmeur opened this issue Nov 29, 2018 · 9 comments
Open

java.lang.NullPointerException during Bubble Processing #23

ariane-lozachmeur opened this issue Nov 29, 2018 · 9 comments

Comments

@ariane-lozachmeur
Copy link

ariane-lozachmeur commented Nov 29, 2018

Hi,

I'm was running Kourami on a dozens of samples and for a few of them, I get this java error:

Bubble Processing and Path Assembly for:        A
Bubble Processing and Path Assembly for:        B
Bubble Processing and Path Assembly for:        C
Bubble Processing and Path Assembly for:        DQA1
Bubble Processing and Path Assembly for:        DQB1
java.lang.NullPointerException
        at Bubble.removeUnsupported(Bubble.java:842)
        at Bubble.<init>(Bubble.java:379)
        at Bubble.<init>(Bubble.java:384)
        at HLAGraph.countBubbles(HLAGraph.java:2050)
        at HLAGraph.countBubblesAndMerge(HLAGraph.java:900)
        at HLA.countBubblesAndMerge(HLA.java:357)
        at HLA.main(HLA.java:585)

In this case it happened during the Bubble Processing of DQB1 but it sometimes happens for other genes.

I should specify that I changed slightly the preprocessing scripts to work on the hg19 reference and include unmapped reads.
Any idea of what might cause that?
Thank you for your help and amazing work on Kourami!

@heewookl
Copy link
Contributor

heewookl commented Dec 5, 2018

Hi,

Could you please explain what you have changed in the preprocessing scripts? My guess you changed the loci information for extracting HLA alleles from hg19? If you can share what you exactly changed, that will be helpful.

Also, could you please share the bam file aligned to the kourami panel? Thank you!

@ariane-lozachmeur
Copy link
Author

You guessed correctly, I changed the loci information to extract reads from hg19. In addition, our testing showed better performances when we added back the unmapped reads to the extracted reads.

The resulting code is the following (for the extraction part of the preprocessing)

echo ">>>>>>>>>>>>>>>> extracting reads mapping to HLA loci and ALT contigs"
$samtools_bin view -b -o hla.bam $bam_path 6:29900000-29914000 6:31230000-31241000 6:31320000-31326000 \
                                6:33032000-33042000 6:33043000-33058000 6:32604000-32612000 \
                                6:32629000-32634500 6:32407550-32413000 \
                                6:32520000-32558000
$samtools_bin sort -n -@8 -o hla.sorted.bam hla.bam
$samtools_bin view -b -f4 -@8 -o unmapped.bam $bam_path
$samtools_bin sort -n -@8 -o unmapped.sorted.bam unmapped.bam
$samtools_bin merge -p -f all.bam hla.sorted.bam unmapped.sorted.bam
$samtools_bin sort --thread $num_processors -m $samtools_sort_memory_per_thread -O BAM -o $sampleid.extract.bam all.bam

The rest of the script is the same.

Unfortunately, I am not sure I will be able to send you the file. I am waiting for approval and I'll get back to you when I know more.
In the meantime, here is what I tried to do to diagnose the issue, maybe that will help.

To try and see where the NullPointerException was coming from I made a little experiment.
Say the Bubble Processing failed during assembly for DQA1, I isolated the reads mapped to DQA1 from the bam file aligned to the Kourami panel (using samtools view -o new.bam in.bam DQA1*01:01:01:01:0-100000 [...] DQA1*06:02:0-100000)
Then I downsampled this bam by 10% (samtools view -s0.9 -o new.0.9.bam new.bam)
Depending on the seed I used in samtools, those smaller bam file sometimes worked with Kourami and sometimes didn't.
At that point I figured that some specific reads were probably causing the issue. So I identified which reads were in the files that didn't work and not in the files that worked.
Removing those reads from the original on_KouramiPanel.bam file didn't solve the problem. Moreover, a bam file containing only those suspicious reads didn't cause the bug.
Now my guess is that it is probably a certain subset of reads that together create this edge case.

Lastly, the end of the .log file looks like that:

=========================
=  DQB1
=========================
[k] = 2113
trying new k: [k] = 2112
BASE:   [A,2113]
BASE:   [C,2113]
BASE:   [G,2113]
BASE:   [T,2113]
trying new k: [k] = 2111
BASE:   [A,2112]
BASE:   [C,2112]
BASE:   [G,2112]
[...]
trying new k: [k] = 2065
BASE:   [G,2066]
BASE:   [T,2066]
trying new k: [k] = 2064
BASE:   [G,2065]
Found the new start!
Setting Trimming length(header):        49

I'm sorry I can't give you the bam file to reproduce this yet. Please let me know if you have any ideas or things you would like me to test.
Thank you!

@heewookl
Copy link
Contributor

heewookl commented Dec 5, 2018

Thank you for providing me with much details.

I am guessing you updated those coordinates from hg19 annotation for HLA genes for the extraction part. Including unmapped reads can certainly improve the results especially when the initial mapping is done on hg19, however, sometimes spurious mappings can creep in which can lead to mistyping.

I have a few questions:

  1. Are all alignments performed by bwa?
  2. Is this WGS? and what are the coverages for these samples?

I understand that you can't send me the extracted bam. In the meantime, I will think of what test you can run on your end to get this problem sorted out. I shall be in touch soon. Thank you.

@ariane-lozachmeur
Copy link
Author

  1. The first alignment (to hg19) is performed by SNAP. But during the preprocessing for Kourami I only use bwa.
  2. This is exome sequencing on around 600 genes with a 100x coverage. By adding back the unmapped reads, I get around 500k reads mapped to the Kourami reference.

@heewookl
Copy link
Contributor

Hi, I would like to follow up on this issue and would it be possible for you to drop me an email to arrange a secure data transfer? Thanks.

@Rashesh7
Copy link

Hi,
Sorry for jumping in, but I am also interested in editing the preprocessing scripts to work on the hg19 reference.

Please do let me know if this was resolved and there is a consensus on what reads should it extract.

Thanks.

@NicolaLady
Copy link

NicolaLady commented Oct 7, 2019

Hi,

I am having the same issue as the original poster. Any updates on how to resolve this?

I am working with whole exome sequencing. Could the issue be due to possibly low coverage at DQB1?

Issue:

----------------REF GRAPH CONSTRUCTION--------------
---------------- READ LOADING --------------
---------------- GRAPH CLEANING --------------
Bubble Processing and Path Assembly for: A
Bubble Processing and Path Assembly for: B
Bubble Processing and Path Assembly for: C
Bubble Processing and Path Assembly for: DQA1
Bubble Processing and Path Assembly for: DQB1
java.lang.NullPointerException
at Bubble.removeUnsupported(Bubble.java:842)
at Bubble.(Bubble.java:379)
at Bubble.(Bubble.java:384)
at HLAGraph.countBubbles(HLAGraph.java:2040)
at HLAGraph.countBubblesAndMerge(HLAGraph.java:890)
at HLA.countBubblesAndMerge(HLA.java:357)
at HLA.main(HLA.java:585)

**The tail end of my log file **


DQA1

[k] = 4757
trying new k: [k] = 4756
BASE: [G,4757]
Found the new start!
Setting Trimming length(header): 1
NumBubbles: 36 found

<---------------------------------->
CHECKING INTER-SUPERBUBBLE PHASING:
<---------------------------------->

Printing 1 fractured super bubbles.
SUPER BUBBLE [0]
FractureEndIndex:[ 0 ]
IntersectionScore: 27.701600143119084 -8.23921481214836

candidate_0-0
CTGACCACGTTGCCTCTTGTGGTGTAAACTTGTACCAGTTTTACGGTCCCTCTGGCCAGTACACCCATGAATTTGATGGAGATGAGGAGTTCTACGTGGACC
TGGAGAGGAAGGAGACTGCCTGGCGGTGGCCTGAGTTCAGCAAATTTGGAGGTTTTGACCCGCAGGGTGCACTGAGAAACATGGCTGTGGCAAAACACAACT
TGAACATCATGATTAAACGCTACAACTCTACCGCTGCTACCAATG

FractureEndIndex:[ 0 ]
IntersectionScore: 8.298399856880916 -52.64326228050601

candidate_0-1
CTGACCACGTCGCCTCTTATGGTGTAAACTTGTACCAGTCTTACGGTCCCTCTGGCCAGTACACCCATGAATTTGATGGAGATGAGCAGTTCTACGTGGACC
TGGGGAGGAAGGAGACTGTCTGGTGTTTGCCTGTTCTCAGACAATTTAGATTTGACCCGCAATTTGCACTGACAAACATCGCTGTCCTAAAACATAACTTGA
ACAGTCTGATTAAACGCTCCAACTCTACCGCTGCTACCAATG

AllelePair [0:0] {PAIRSCORE:-65.01606974352967 E_SUM:63724.0 MAXFLOW:59.0}
AllelePair [0:1] {PAIRSCORE:-62.28610893381102 E_SUM:66484.0 MAXFLOW:65.0}
AllelePair [1:1] {PAIRSCORE:-153.82416468024502 E_SUM:54037.0 MAXFLOW:17.0}

BEST PAIR[0:1]: -62.28610893381102
[0 0.7914742898034024 -8.23921481214836]BEST MATCH: DQA101:01:01G 249 1.0 [MaxFl
ow:65.0] [MinDepth1:33.0] [MinDepth2:17.0]
[1 0.23709713876802618 -52.64326228050601]BEST MATCH: DQA1
05:01:01G 246 1.0 [MaxFl
ow:65.0] [MinDepth1:33.0] [MinDepth2:17.0]


DQB1

[k] = 2113
trying new k: [k] = 2112
BASE: [A,2113]
BASE: [G,2113]
BASE: [N,2113]
trying new k: [k] = 2111
BASE: [A,2112]
BASE: [N,2112]
trying new k: [k] = 2110
BASE: [C,2111]
BASE: [N,2111]
trying new k: [k] = 2109
BASE: [G,2110]
BASE: [N,2110]
trying new k: [k] = 2108
BASE: [C,2109]
BASE: [G,2109]
BASE: [T,2109]
trying new k: [k] = 2107
BASE: [A,2108]
BASE: [C,2108]
BASE: [T,2108]
trying new k: [k] = 2106
BASE: [C,2107]
BASE: [G,2107]
trying new k: [k] = 2105
BASE: [A,2106]
BASE: [C,2106]
BASE: [T,2106]
BASE: [N,2106]
trying new k: [k] = 2104
BASE: [G,2105]
BASE: [T,2105]
BASE: [N,2105]
trying new k: [k] = 2103
BASE: [A,2104]
BASE: [T,2104]
BASE: [N,2104]
trying new k: [k] = 2102
BASE: [A,2103]
BASE: [C,2103]
BASE: [T,2103]
trying new k: [k] = 2101
BASE: [G,2102]
BASE: [N,2102]
trying new k: [k] = 2100
BASE: [A,2101]
BASE: [C,2101]
BASE: [G,2101]
BASE: [T,2101]
BASE: [N,2101]
trying new k: [k] = 2099
BASE: [A,2100]
BASE: [G,2100]
BASE: [T,2100]
trying new k: [k] = 2098
BASE: [C,2099]
BASE: [G,2099]
BASE: [N,2099]
trying new k: [k] = 2097
BASE: [C,2098]
BASE: [G,2098]
BASE: [N,2098]
trying new k: [k] = 2096
BASE: [A,2097]
BASE: [C,2097]
BASE: [G,2097]
BASE: [N,2097]
trying new k: [k] = 2095
BASE: [C,2096]
BASE: [G,2096]
BASE: [T,2096]
BASE: [N,2096]
trying new k: [k] = 2094
BASE: [A,2095]
BASE: [G,2095]
BASE: [T,2095]
trying new k: [k] = 2093
BASE: [C,2094]
BASE: [T,2094]
trying new k: [k] = 2092
BASE: [C,2093]
Found the new start!
Setting Trimming length(header): 21

Kind regards,

Nicola

@nikolasthuesen
Copy link

nikolasthuesen commented Oct 21, 2020

I got the same error, when running Kourami on extracted WES data from the 1000 genomes samples NA19346 and NA19471. These were the only two giving this error out of the 829 samples I ran the workflow for. The errors were not caused by using hg19 reference - I was using GRCh38.
The error happened during the typing of DQB1 as well, meaning that Kourami had already typed A, B, C and DQA1. As the error happened, both samples had a missing ".result" file and the predictions, that it did manage to do before the error, could therefore not be found here. However, the typing results of A, B, C and DQA1 were still present in the ".log" file.
If anyone is still having this problem, and like me don't have time to wait for the fix, the predictions for A, B, C and DQA1 can still be retrieved.

@masen1991
Copy link

Can Kourami work on hg19 ref samples?
and by the way,how can i use Kourami on WES samples (maybe hg19 ref or hg38 ref)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants