Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formatIMGT.sh NullPointerException #19

Open
zezzipa opened this issue Oct 15, 2018 · 10 comments
Open

formatIMGT.sh NullPointerException #19

zezzipa opened this issue Oct 15, 2018 · 10 comments

Comments

@zezzipa
Copy link

zezzipa commented Oct 15, 2018

Hi again,

I have downloaded the Alignments_Rel_3330.zip from https://github.com/ANHIG/IMGTHLA and put the hla_nom_g.txt in that folder. It seem to be working fine at first to create the new reference with formatIMGT.sh but then it crashes, the last part of the log is as follow:

Processing [Y] <<<<<<<<<<
nucRefAl: Y01:01 genRefAl: Y01:01
refGeneName on nuc and gen are same
Wrting to : /proj/uppstore2018100/kourami-0.9.6/scripts/../custom_db/3.33.0/Y_gen.txt
Wrting to : /proj/uppstore2018100/kourami-0.9.6/scripts/../custom_db/3.33.0/Y_nuc.txt
REF SEQ names differs :
(nuc):Y*01:01
(gen):Y
java.lang.NullPointerException
at Sequence.processBlock(Sequence.java:554)
at Sequence.(Sequence.java:498)
at MergeMSFs.mergeAndAdd(MergeMSFs.java:383)
at MergeMSFs.mergeAndAdd(MergeMSFs.java:372)
at MergeMSFs.merge(MergeMSFs.java:298)
at FormatIMGT.processGene(FormatIMGT.java:199)
at FormatIMGT.main(FormatIMGT.java:100)

Any idea what the problem might be?
Thank you in advance for the help!

@zezzipa
Copy link
Author

zezzipa commented Oct 18, 2018

And as a follow-up on that question, is there a way to include MICA, MICB, TAP1 and TAP2 in the output. These genes are of interest to us in the disease we are studying and it would be great if we could get information for all genes from the same software.

@heewookl
Copy link
Contributor

I have updated the code to handle a few minor changes in IMGT alignment file format.

Please clone the latest commit in the repo and it should run fine.

Adding MICA/B and TAP1/2 can probably done but I am not sure when I can get to it.

@zezzipa
Copy link
Author

zezzipa commented Oct 19, 2018

Thank you for looking into this. So irritating when they change format. I already had that problem another time this week.

I downloaded the new version, I had a problem with DRB5_gen.txt now (with IMGT/HLA version 3.34.0).

Processing [DRB5] <<<<<<<<<<
nucRefAl: DRB101:01:01 genRefAl: DRB501:01:01
refGeneName on nuc and gen are NOT same
Reference sequence entry [DRB5*01:01:01] is NOT found in nuc alignments.
Check the alignment files.

I fixed it by changing name on the allele in the DRB5_gen.txt file, since I don't care about DRB5 that worked for me. But for the future, with someone that does care, it can be good to look into.
Now I have a new reference, thank you so much!

I understand, it is not possible to do everything at once.

@heewookl
Copy link
Contributor

heewookl commented Oct 19, 2018

Hi,

I didn't bother checking what was new in 3.34.0 release. Addition of an allele of DRB5 in the release, DRB5_gen.txt has been newly added in the release. I understand you don't care about DRB5 sequences, but I suggest you to use 3.33.0 release for the time being rather than modifying allele names to get away with the error. I should be able to get to this early Nov along with a possibility of supporting MICA/B and TAP1/2.

@mmaiers-nmdp
Copy link

Actually the DRB5_gen.txt that comes in the Alignments_Rel_3360.zip has the right name so everything works if you just comment out the part in scripts/formatIMGT.sh where it overwrites the file from the resources directory

@davetang
Copy link

davetang commented May 1, 2020

Thank you @mmaiers-nmdp. scripts/formatIMGT.sh works with release 3.40.0, if I comment out the following lines (or remove the code block).

if [ ! -e "$resource_dir/DRB5_gen.txt" ];then
    echo "Missing DRB5_gen.txt in the resource directory. Please git pull or git clone"
    exit 1
# else
# cp $resource_dir/DRB5_gen.txt $input_msa/.
fi

@freshfischer
Copy link

Hi again,

I have downloaded the Alignments_Rel_3330.zip from https://github.com/ANHIG/IMGTHLA and put the hla_nom_g.txt in that folder. It seem to be working fine at first to create the new reference with formatIMGT.sh but then it crashes, the last part of the log is as follow:

Processing [Y] <<<<<<<<<<
nucRefAl: Y_01:01 genRefAl: Y_01:01
refGeneName on nuc and gen are same
Wrting to : /proj/uppstore2018100/kourami-0.9.6/scripts/../custom_db/3.33.0/Y_gen.txt
Wrting to : /proj/uppstore2018100/kourami-0.9.6/scripts/../custom_db/3.33.0/Y_nuc.txt
REF SEQ names differs :
(nuc):Y*01:01
(gen):Y
java.lang.NullPointerException
at Sequence.processBlock(Sequence.java:554)
at Sequence.(Sequence.java:498)
at MergeMSFs.mergeAndAdd(MergeMSFs.java:383)
at MergeMSFs.mergeAndAdd(MergeMSFs.java:372)
at MergeMSFs.merge(MergeMSFs.java:298)
at FormatIMGT.processGene(FormatIMGT.java:199)
at FormatIMGT.main(FormatIMGT.java:100)

Any idea what the problem might be?
Thank you in advance for the help!

This problem happens when java cannot identifiy allele "Y*01:01" correctly due to ' * ' in the first base position in file Y_gene.txt, script works after deleting the first alignment position base.

@danilovkiri
Copy link

@freshfischer hi, thank you for your comment. Could you please explain what do you mean by "deleting the first alignment position base"? Am I getting it correctly below? Do I need to change the -1 gDNA position to 0 and remove asterisks/G and spaces that come before the pipe in the sequence coding lines starting with Y*...?

# file: Y_gen.txt
# date: 2020-10-15
# version: IPD-IMGT/HLA 3.42.0
# origin: http://hla.alleles.org/wmda/Y_gen.txt
# repository: https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/alignments/Y_gen.t>
# author: Steven G. E. Marsh ([email protected])

 gDNA              -1
                    |
 Y*01:01           * | ATGGCGGTC GTGGCGCCCC GAACCCTCCT CCTGCTACTC TCGGGGGCCC TGGCCCTGAC>
 Y*02:01           G | --------- ---------- ---------- ---------- ---------- ---------->
 Y*03:01           * | --------- ---------- ---------- ---------- ---------- ---------->

@freshfischer
Copy link

freshfischer commented Dec 21, 2020

@freshfischer hi, thank you for your comment. Could you please explain what do you mean by "deleting the first alignment position base"? Am I getting it correctly below? Do I need to change the -1 gDNA position to 0 and remove asterisks/G and spaces that come before the pipe in the sequence coding lines starting with Y*...?

# file: Y_gen.txt
# date: 2020-10-15
# version: IPD-IMGT/HLA 3.42.0
# origin: http://hla.alleles.org/wmda/Y_gen.txt
# repository: https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/alignments/Y_gen.t>
# author: Steven G. E. Marsh ([email protected])

 gDNA              -1
                    |
 Y*01:01           * | ATGGCGGTC GTGGCGCCCC GAACCCTCCT CCTGCTACTC TCGGGGGCCC TGGCCCTGAC>
 Y*02:01           G | --------- ---------- ---------- ---------- ---------- ---------->
 Y*03:01           * | --------- ---------- ---------- ---------- ---------- ---------->

As for this problem, I just re-edit Y_gen.txt like this to make it identified by java:

> Nuc+Gen merged MSA for Kourami
> # file: Y_gen.txt
> # date: 2020-04-20
> # version: IPD-IMGT/HLA 3.40.0
> # origin: http://hla.alleles.org/wmda/Y_gen.txt
> # repository: https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/alignments/Y_gen.txt
> # author: WHO, Steven G. E. Marsh (steven.marsh.ac.uk)
>                    
>  gDNA              0                                                                                   
>                    |                                                                                 
>  Y*01:01            ATGGCGGTC GTGGCGCCCC GAACCCTCCT CCTGCTACTC TCGGGGGCCC TGGCCCTGAC CCAGACCTGG GCGG 
>  Y*02:01            --------- ---------- ---------- ---------- ---------- ---------- ---------- ---- 
>  Y*03:01            --------- ---------- ---------- ---------- ---------- ---------- ---------- ---- 

@davetang
Copy link

davetang commented Mar 1, 2021

Hi again,
I have downloaded the Alignments_Rel_3330.zip from https://github.com/ANHIG/IMGTHLA and put the hla_nom_g.txt in that folder. It seem to be working fine at first to create the new reference with formatIMGT.sh but then it crashes, the last part of the log is as follow:

Processing [Y] <<<<<<<<<<
nucRefAl: Y_01:01 genRefAl: Y_01:01
refGeneName on nuc and gen are same
Wrting to : /proj/uppstore2018100/kourami-0.9.6/scripts/../custom_db/3.33.0/Y_gen.txt
Wrting to : /proj/uppstore2018100/kourami-0.9.6/scripts/../custom_db/3.33.0/Y_nuc.txt
REF SEQ names differs :
(nuc):Y*01:01
(gen):Y
java.lang.NullPointerException
at Sequence.processBlock(Sequence.java:554)
at Sequence.(Sequence.java:498)
at MergeMSFs.mergeAndAdd(MergeMSFs.java:383)
at MergeMSFs.mergeAndAdd(MergeMSFs.java:372)
at MergeMSFs.merge(MergeMSFs.java:298)
at FormatIMGT.processGene(FormatIMGT.java:199)
at FormatIMGT.main(FormatIMGT.java:100)

Any idea what the problem might be?
Thank you in advance for the help!

This problem happens when java cannot identifiy allele "Y*01:01" correctly due to ' * ' in the first base position in file Y_gene.txt, script works after deleting the first alignment position base.

If you use the latest update of this repository (commit 545c770) instead of the latest release tag (v0.9.6), you don't get that particular error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants