Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding the annotaTR tool #226

Merged
merged 80 commits into from
Aug 23, 2024
Merged

feat: adding the annotaTR tool #226

merged 80 commits into from
Aug 23, 2024

Conversation

gymreklab
Copy link
Contributor

@gymreklab gymreklab commented Aug 12, 2024

This PR introduces the annotaTR tool, which can be used to:

  • Add dosages to TR VCF files
  • Add Beagle annotations to imputed files (previously performed with trtools_prep_beagle_vcf.sh)
  • Output to either VCF or PGEN format

It required adding pgenlib as a dependency and updating the version of cyvcf2 since we need cyvcf2.VCF.num_variants.

This feature is related to and should resolve #209. Steps to solve issues there:

# Perform TR imputation using the Saini panel
java -Xmx10g -jar beagle.19Apr22.7c0.jar \
    gt=1kg_snps_NA12878.vcf.gz \
    ref=1kg.snp.str.chr18.vcf.gz impute=true ap=true \
    out=1kg_imputed_chr18
tabix -p vcf 1kg_imputed_chr18.vcf.gz

# Annotate dosages and output both vcf and pgen
annotaTR \
    --vcf 1kg_imputed_chr18.vcf.gz --vcftype hipstr \
    --ref-panel 1kg.snp.str.chr18.vcf.gz \
    --outtype vcf pgen \
    --out test \
    --dosages bestguess_norm 
bgzip test.vcf
tabix -p vcf test.vcf.gz

# Test the pgen output is valid
plink2 --pfile test --validate

# Test the VCF output can be used as input to TRTools
statSTR --vcf test.vcf.gz --vcftype hipstr --afreq --out test

Checklist

  • [x ] I've checked to ensure there aren't already other open pull requests for the same update/change
  • [x ] I've prefixed the title of my PR according to the conventional commits specification. If your PR fixes a bug, please prefix the PR with fix: . Otherwise, if it introduces a new feature, please prefix it with feat: . If it introduces a breaking change, please add an exclamation before the colon, like feat!: . If the scope of the PR changes because of a revision to it, please update the PR title, since the title will be used in our CHANGELOG.
  • [ x] At the top of the PR, I've listed any open issues that this PR will resolve. For example, "resolves #0" if this PR resolves issue #0
  • [ x] I've explained my changes in a manner that will make it possible for both users and maintainers of TRTools to understand them
  • [ x] I've added tests for any new functionality. Or, if this PR fixes a bug, I've added test(s) that replicate it
  • [ x] All directories with large test files are listed in the "exclude" section of our pyproject.toml so that they do not appear in our PyPI distribution. All new files are also smaller than 0.5 MB.
  • [ x] I've updated the relevant REAMDEs with any new usage information and checked that the newly built documentation is formatted properly
  • [ x] All functions, modules, classes etc. still conform to numpy docstring standards
  • [x ] (if applicable) I've updated the pyproject.toml file with any changes I've made to TRTools's dependencies, and I've run poetry lock --no-update to ensure the lock file stays up to date and that our dependencies are locked to their minimum versions
  • [ x] In the body of this PR, I've included a short address to the reviewer highlighting one or two items that might deserve their focus

Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/testsupport/sample_vcfs/associaTR/many_samples_biallelic_dosages.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_imputed_withap.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_duplocus.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_missinginfoheader.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_nostrs.vcf.gz

    Consider using git-lfs to manage large files.

@github-actions github-actions bot added the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 12, 2024
Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/testsupport/sample_vcfs/associaTR/many_samples_biallelic_dosages.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_imputed_withap.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_duplocus.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_missinginfoheader.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_nostrs.vcf.gz

    Consider using git-lfs to manage large files.

Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/testsupport/sample_vcfs/associaTR/many_samples_biallelic_dosages.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_imputed_withap.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_duplocus.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_missinginfoheader.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_refpanel_nostrs.vcf.gz

    Consider using git-lfs to manage large files.

Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/testsupport/sample_vcfs/associaTR/many_samples_biallelic_dosages.vcf.gz, trtools/testsupport/sample_vcfs/beagle/beagle_imputed_withap.vcf.gz

    Consider using git-lfs to manage large files.

Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    trtools/testsupport/sample_vcfs/associaTR/many_samples_biallelic_dosages.vcf.gz

    Consider using git-lfs to manage large files.

@github-actions github-actions bot added the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 23, 2024
@github-actions github-actions bot removed the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 23, 2024
@gymreklab
Copy link
Contributor Author

gymreklab commented Aug 23, 2024

Thanks @aryarm for the thorough review!

I have addressed the code suggestions and resolved all the conversations. See below regarding the other comments:

Code suggestions

  • Clipping/thresholding is to avoid rounding errors to ensure all the probabilities add up to 1
  • Thanks for the suggestion regarding PVAR files as VCFs! I made some minor changes to make them valid (added QUAL/FILTER columns and fileformat header). Also added DSLEN prefix and changed "None" ID's to "."

Testing suggestions

  • I added a command line case for the case of SNPs/TRs mixed in but no ref-panel. This case would have failed since it would attempt to load the SNP as a TR record but now it fails more gracefully: e.g. "Error converting chr21:14245507 to a TR record. If your file is a mix of SNPs/TRs (e.g. from Beagle) you must provide a reference panel."
  • Added command line tests for VCF files not being zipped or indexed

Tests for files remaining the same

I addded tests based on the mergeSTR examples to check if files are unchanged. This required some minor changes to trtools/testsupport/utils.py:

  • I added an option max_lines_to_compare to assert_same_vcf. Since I had to just keep the top lines of my example files otherwise they were too big. By default we should still read all lines of files to compare but for my test files I chopped them at 200 lines. (see trtools/testsupport/sample_vcfs/annotaTR_vcfs/create_test_files.sh)
  • I modified _make_info_dict to not crash if an INFO field does not have a value. This is possible for INFO fields with Number=0 like the IMP flag output by Beagle.

Copy link
Contributor

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 500000 bytes, as set in the .yml configuration files:

    example-files/CEU_subset_unzipped.vcf

    Consider using git-lfs to manage large files.

@github-actions github-actions bot added the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 23, 2024
Copy link
Member

@aryarm aryarm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good! Thanks for implementing my suggestions

And chopping at 200 lines sounds very reasonable. We shouldn't need more than a few lines, anyway.

@github-actions github-actions bot removed the large-files Warning Label for use when LFS is detected in the commits of a Pull Request label Aug 23, 2024
@gymreklab gymreklab merged commit 59dd532 into master Aug 23, 2024
12 checks passed
@gymreklab gymreklab deleted the annotaTR branch August 23, 2024 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Converting Beagle imputed VCF for TRTools
2 participants