This repository is for ancestry inference. Here we implement a support-vector-machine(SVM)-based method to identify the most likely ancestral group(s) for an individual by leveraging known ancestry in a reference dataset (e.g., the 1000 Genomes Project data).
We prepared three reference datasets.
- KGRef: cleaned 1000 genome with five super population groups (AFR,EUR, EAS, SAS and AMR).
- KGeurref: cleaned EUR samples from 1000 genome. They are NEUR, SEUR and FIN.
- HGDP_AsianRef: cleaned asian samples from HGDP. They are Central_South_Asia, Est_Asia and Middle_Est.
Related files are saved at https://www.dropbox.com/sh/fanfst7lyc1kn9u/AAAPyJhwiYdHc8H-31I-xbZua?dl=0
After clicking the DropBox link, please click the Download button at the top right corner to download files.
Use unzip to unzip Reference.zip file.
unzip Refernece.zip
Then use unxz to unzip xzipped files.
For example
unxz KGeurref.bed.xz
PC file: a text file with header line. It requires the following columns: FID, IID, AFF, PC1, PC2, PC3,...,PC10.
- FID: Family ID
- IID: Within-family ID
- AFF code ('1' = Reference Sample, '2' = Studay Sample)
- PC: PCs inforamtion. Study samples will be projected to the reference PC space.
Example PCs file from KING. FA, MO and SEX columns are not required for the analysis.
FID IID FA MO SEX AFF PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
HG00096 HG00096 0 0 0 1 0.0110 -0.0271 0.0098 0.0198 -0.0017 -0.0097 -0.0003 0.0010 0.0031 -0.0152
HG00097 HG00097 0 0 0 1 0.0107 -0.0275 0.0090 0.0189 -0.0008 -0.0097 -0.0012 -0.0014 -0.0024 -0.0245
HG00099 HG00099 0 0 0 1 0.0111 -0.0276 0.0102 0.0183 -0.0025 -0.0151 0.0014 0.0079 -0.0090 -0.0121
Popref file: a text file with header line. It would contain three columns. They are FID, IID and Population. Users need to creat this file before the analysis.
FID IID Population
HG00096 HG00096 EUR
HG00097 HG00097 EUR
HG00099 HG00099 EUR
Download KING from https://www.kingrelatedness.com/Download.shtml
Get PCs from KING PCA projection. The affection status (6th column) in study fam file need to be 2. The referecen's affection status is 1 or missing. Nothing is required if the reference is KGref.
king -b KGref,studydata --pca --projection --prefix example
Run R code for ancestry inference. Three arguments are required. They are PC file(examplepc.txt), popref file(example_popref.txt) and prefix(example). Package 'e1071' is required. Package 'ggplot2' and package 'doParallel' are optional.
Rscript Ancestry_Inference.R examplepc.txt example_popref.txt example
Also, we can run ancestry inference in KING from binary file with one command line.
king -b KGref,studydata --pca --projection --pngplot
Keep European samples only. Get PCs from KING PCA projection.
king -b KGeurref,EurStudy --pca --projection --prefix EurStudy
Run R code the get the ancestry inference results. Three arguments are required. They are pc file, popref file and prefixname. Please keep the order.
Rscript Ancestry_Inference.R EurStudypc.txt KGeurref_popref.txt prefixname
example_InferredAncestry.txt
FID IID PC1 PC2 Anc_1st Pr_1st Anc_2nd Pr_2nd Ancestry
2427 NA19919 -0.0299 0.0012 AFR 0.9973 AMR 0.0016 AFR
2425 NA19902 -0.0295 0.0008 AFR 0.997 AMR 0.0017 AFR
2484 NA20335 -0.0239 -0.0029 AFR 0.9954 AMR 0.0016 AFR
Run the following R code in R to get interactive plots. Package 'shiny' and 'ggplot2' are required. Related R files are saved at Rshiny folder.
Please upload a text file with ancestry information. Three columns are required. They are PC1, PC2 and Ancestry.
library(shiny)
runGitHub("AncestryInference_KING", "chenlab-uva", ref = "main", subdir = "Rshiny")
We will see study samples' PC1 and PC2 information after we upload the *InferredAncestry txt file.
The second plot (interactive plot) only show samples from the choosen ancestry group.
Detailed information will be listed if we are clicking the dots from the interactive plot.
Also, we can type a family ID that we are interested in and see samples' detailed information.
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873