-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge multiple GWAS files #70
Comments
Hi Jie Yes you can, see example command. I usually use bcftools but you can also use GATK/Picard. If you used gwas2vcf to prepare your VCF files then the alleles/effect sign are automatically flipped so that the REF allele is non-effect and the beta relates to the ALT allele. In which case all GWAS are comparable. Cheers |
Dear Matt: Thank you very much! Just curious, it says that "GWAS2VCF produces GWAS-VCF format files". But, GWAS-VCF format is simply the standard VCF format, correct? Of course, the VCF files for GWAS do not have genotype data, but only have summary statistics data, just like the dbSNP summary files located at https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/. I have been using GCTA and LDSC to run many GWAS. LDSC uses munge_sumstats.py to reformat GWAS files. So does GCTA, it requires certain columns with certain names. Recently, I also began to use PheWEB. The problem is that none of these "mainstraim" software supports VCF format. Then I have to use bcftools to generate TXT formats for these software. I really wish that the community begans to support and adopt VCF format, especially given that GWAS files now include millions of rows and need fast query. What is your perspective on this? Best regards, |
1 similar comment
Dear Matt: Thank you very much! Just curious, it says that "GWAS2VCF produces GWAS-VCF format files". But, GWAS-VCF format is simply the standard VCF format, correct? Of course, the VCF files for GWAS do not have genotype data, but only have summary statistics data, just like the dbSNP summary files located at https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/. I have been using GCTA and LDSC to run many GWAS. LDSC uses munge_sumstats.py to reformat GWAS files. So does GCTA, it requires certain columns with certain names. Recently, I also began to use PheWEB. The problem is that none of these "mainstraim" software supports VCF format. Then I have to use bcftools to generate TXT formats for these software. I really wish that the community begans to support and adopt VCF format, especially given that GWAS files now include millions of rows and need fast query. What is your perspective on this? Best regards, |
Hi Jie, Yes, GWAS-VCF is just a suggested standard for using VCF to store summary stats, you could prepare your own and even use different keys/columns but we aim for consistency to allow inter-study comparisons. I recently uploaded a diagram of the gwas2vcf workflow which shows each step and might be of interest. With respect to compatibility with existing tools, my colleague is developing an R-package gwasglue that automates analysis of summary stats in VCF using a range of tools. We also have a fork of LDSC which reads from VCF but only supports univariate analysis at the moment. These projects are under active development and we hope to provide integrations with other tools in the future. Thanks |
Dear Matt: Thank you very much for letting me know gwasglue. It is really a great idea. And I strongly feel that the human genomics research field needs something like this. I just posted and suggested a few powerful tools published this year MRCIEU/gwasglue#27. As you know, most researchers woud like to use newly published tools, and usually one is enough for each category of analysis. For GWAS2VCF, I also have some minor suggestions:
Thank you for building the nice GWAS2VCF and pushing for a standard, which is dearly needed. Best regards, |
Dear Matt: I found that it takes super long time to run GWAS2VCF on my laptop. I had to kill it after running a few hours... These days, some GWAS are in tabix indexed tab delimited BED format, for example, the UK Biobank biochemistry GWAS (A Nature Genetics paper), posted at https://doi.org/10.35092/yhjc.12355382. It has 35 GWAS in .GZ and .TBI format. I think this format is similar as VCF, much better than a regular TXT file, much faster for query. Don't know if GWAS2VCF has a fast way to work on tabix indexed files like these. I think bcftools could work on these tabix indexed files directly. BTW, there is python version of GWAS2VCF and also gwasvcf R package. What is the main difference between these two? Best regards, |
Hi, guys:
We could use bcftools to merge multiple regular VCF files with genotype data.
Now, I have multiple GWAS VCF files, each of which have rsID, REF, ALT, BETA, SE, P, etc.
Can I use bcftools to merge them so that I could then extract the BETA and SE from the merged file to run downstream analyses? Is there other tools for doing this of merging for multiple GWAS files, which usually have millions of SNPs?
The key here is that most of these GWAS files only have A1 and A2 instead of REF and ALT.
I wish that GWAS VCF files are widely used, but these days many software such as LDSC and PheWEB and 2-sample-MR don't support VCF format.
Best regards,
jie
The text was updated successfully, but these errors were encountered: