Skip to content

Commit

Permalink
move liftOver dir into scripts dir
Browse files Browse the repository at this point in the history
  • Loading branch information
hxj5 committed Dec 9, 2020
1 parent 112be8d commit c30b845
Show file tree
Hide file tree
Showing 3 changed files with 0 additions and 0 deletions.
File renamed without changes.
File renamed without changes.
File renamed without changes.

5 comments on commit c30b845

@ruqianl
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hxj5,

I assume these references files that contain the list of SNPs from 1kgenome were created using this liftOver wrapper
cellSNP

It doesn't look like the headers are reformed after lifting over. So the contig reference annotation in the *.hg38.vcf still refers to GRCh37. (which triggers errors in some GATK tools).

Have you got any suggestions on how to fix this?

Thanks,
Ruqian

@hxj5
Copy link
Collaborator Author

@hxj5 hxj5 commented on c30b845 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ruqian,

Thanks for your feedback. A new vcf header file (header.txt, attached), which contains contigs of hg38, is extracted from cellranger-GRCh38-1.2.0/fasta/genome.fa.fai.

The new header could be used to replace the original header with

# tested with bcftools v1.10.2 (htslib v1.10.2)
bcftools reheader -h header.txt -o hg38.new.vcf.gz hg38.old.vcf.gz        

Xianjie

header.txt

@ruqianl
Copy link

@ruqianl ruqianl commented on c30b845 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Xianjie.

The reheader command was okay. However, after reheading, I tried to sort the vcf.gz but encountered parsing errors.

bcftools sort genome1K.phase3.SNP_AF5e4.chr1toX.hg38.header.vcf.gz -o genome1K.phase3.SNP_AF5e4.chr1toX.hg38.header.sort.vcf.gz
Writing to /tmp/bcftools-sort.djaOf8
[W::vcf_parse] INFO 'OLD' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'A' is not defined in the header, assuming Type=String
Error encountered while parsing the input

Best,
Ruqian

@hxj5
Copy link
Collaborator Author

@hxj5 hxj5 commented on c30b845 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ruqian,

Updated header.hg38.txt and a new script reheader.sh.txt were attached (seems github does not support
uploading .sh file now, so please rename reheader.sh.txt to reheader.sh after downloading it)

I was testing hg38.AF5e2.vcf.gz that does not include INFO 'OLD', which is different from the case of hg38.AF5e4.vcf.gz. Now annotation of INFO 'OLD' is added to the updated header file, which should work for both cases.

As to INFO 'A', it could be a bug of bcftools reheader. For now please use the attached reheader.sh whose cmdline is

# Example:   ./reheader.sh  header.hg38.txt  hg38.AF5e4.vcf.gz  hg38.AF5e4.header.vcf.gz
./reheader.sh  <header file>   <old vcf>  <new vcf>

A tiny advise is that -Oz option could be added when sorting vcf, otherwise the output vcf is in plain text instead of bgzip format.

Best
Xianjie

header.hg38.txt
reheader.sh.txt

@ruqianl
Copy link

@ruqianl ruqianl commented on c30b845 Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Xianjie.

I finally got GATK HaplotypeCaller working for the list of SNPs in genome1K.phase3.SNP_AF5e4.chr1toX.hg38.fixed.header.sorted.vcf.gz

Best,
Ruqian

Please sign in to comment.