# Ancestry-Inference This repository is for ancestry inference. Here we implement a support-vector-machine(SVM)-based method to identify the most likely ancestral group(s) for an individual by leveraging known ancestry in a reference dataset (e.g., the 1000 Genomes Project data). ## Referece We prepared three reference datasets. <br/> - KGRef: cleaned 1000 genome with five super population groups (AFR,EUR, EAS, SAS and AMR). <br/> - KGeurref: cleaned EUR samples from 1000 genome. They are NEUR, SEUR and FIN. <br/> - HGDP_AsianRef: cleaned asian samples from HGDP. They are Central_South_Asia, Est_Asia and Middle_Est. <br/> Related files are saved at https://www.dropbox.com/sh/fanfst7lyc1kn9u/AAAPyJhwiYdHc8H-31I-xbZua?dl=0 <br/> After clicking the DropBox link, please click the Download button at the top right corner to download files. <br/> Use unzip to unzip Reference.zip file. ```{bash} unzip Refernece.zip ``` Then use unxz to unzip xzipped files. For example <br/> ```{bash} unxz KGeurref.bed.xz ``` ## File format PC file: a text file with header line. It requires the following columns: FID, IID, AFF, PC1, PC2, PC3,...,PC10. <br/> - FID: Family ID <br/> - IID: Within-family ID <br/> - AFF code ('1' = Reference Sample, '2' = Studay Sample) <br/> - PC: PCs inforamtion. Study samples will be projected to the reference PC space. <br/> Example PCs file from KING. FA, MO and SEX columns are not required for the analysis. ```{bash} FID IID FA MO SEX AFF PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 HG00096 HG00096 0 0 0 1 0.0110 -0.0271 0.0098 0.0198 -0.0017 -0.0097 -0.0003 0.0010 0.0031 -0.0152 HG00097 HG00097 0 0 0 1 0.0107 -0.0275 0.0090 0.0189 -0.0008 -0.0097 -0.0012 -0.0014 -0.0024 -0.0245 HG00099 HG00099 0 0 0 1 0.0111 -0.0276 0.0102 0.0183 -0.0025 -0.0151 0.0014 0.0079 -0.0090 -0.0121 ``` Popref file: a text file with header line. It would contain three columns. They are FID, IID and Population. Users need to creat this file before the analysis. ```{bash} FID IID Population HG00096 HG00096 EUR HG00097 HG00097 EUR HG00099 HG00099 EUR ``` ## Quickstart Download KING from https://www.kingrelatedness.com/Download.shtml Get PCs from KING PCA projection. The affection status (6th column) in study fam file need to be 2. The referecen's affection status is 1 or missing. Nothing is required if the reference is KGref. ```{bash} king -b KGref,studydata --pca --projection --prefix example ``` Run R code for ancestry inference. Three arguments are required. They are PC file(examplepc.txt), popref file(example_popref.txt) and prefix(example). Package 'e1071' is required. Package 'ggplot2' and package 'doParallel' are optional. ```{bash} Rscript Ancestry_Inference.R examplepc.txt example_popref.txt example ``` Also, we can run ancestry inference in KING from binary file with one command line. ```{bash} king -b KGref,studydata --pca --projection --pngplot ``` ## European only inference Keep European samples only. Get PCs from KING PCA projection. ```{bash} king -b KGeurref,EurStudy --pca --projection --prefix EurStudy ``` Run R code the get the ancestry inference results. Three arguments are required. They are pc file, popref file and prefixname. Please keep the order. ```{bash} Rscript Ancestry_Inference.R EurStudypc.txt KGeurref_popref.txt prefixname ``` ## Output file example_InferredAncestry.txt ```{bash} FID IID PC1 PC2 Anc_1st Pr_1st Anc_2nd Pr_2nd Ancestry 2427 NA19919 -0.0299 0.0012 AFR 0.9973 AMR 0.0016 AFR 2425 NA19902 -0.0295 0.0008 AFR 0.997 AMR 0.0017 AFR 2484 NA20335 -0.0239 -0.0029 AFR 0.9954 AMR 0.0016 AFR ``` PNG file <br/> <img src="https://github.com/chenlab-uva/AncestryInference_KING/blob/main/output/example_ancestryplot.png" width="854" height="480"> ## Interactive plots for ancestry inference results. Run the following R code in R to get interactive plots. Package 'shiny' and 'ggplot2' are required. Related R files are saved at Rshiny folder. <br/> Please upload a text file with ancestry information. Three columns are required. They are PC1, PC2 and Ancestry. ```{bash} library(shiny) runGitHub("AncestryInference_KING", "chenlab-uva", ref = "main", subdir = "Rshiny") ``` We will see study samples' PC1 and PC2 information after we upload the *InferredAncestry txt file. <img src="https://github.com/chenlab-uva/AncestryInference_KING/blob/main/output/viewAncestry_1.png" width="854" height="480"> The second plot (interactive plot) only show samples from the choosen ancestry group. <img src="https://github.com/chenlab-uva/AncestryInference_KING/blob/main/output/viewAncestry_2.png" width="854" height="480"> Detailed information will be listed if we are clicking the dots from the interactive plot. <img src="https://github.com/chenlab-uva/AncestryInference_KING/blob/main/output/viewAncestry_3.png" width="854" height="480"> Also, we can type a family ID that we are interested in and see samples' detailed information. <img src="https://github.com/chenlab-uva/AncestryInference_KING/blob/main/output/viewAncestry_4.png" width="854" height="480"> ## Reference Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873