Skip to content

Latest commit

 

History

History
118 lines (91 loc) · 5.12 KB

README.md

File metadata and controls

118 lines (91 loc) · 5.12 KB

Ancestry-Inference

This repository is for ancestry inference. Here we implement a support-vector-machine(SVM)-based method to identify the most likely ancestral group(s) for an individual by leveraging known ancestry in a reference dataset (e.g., the 1000 Genomes Project data).

Referece

We prepared three reference datasets.

  • KGRef: cleaned 1000 genome with five super population groups (AFR,EUR, EAS, SAS and AMR).
  • KGeurref: cleaned EUR samples from 1000 genome. They are NEUR, SEUR and FIN.
  • HGDP_AsianRef: cleaned asian samples from HGDP. They are Central_South_Asia, Est_Asia and Middle_Est.

Related files are saved at https://www.dropbox.com/sh/fanfst7lyc1kn9u/AAAPyJhwiYdHc8H-31I-xbZua?dl=0
After clicking the DropBox link, please click the Download button at the top right corner to download files.
Use unzip to unzip Reference.zip file.

unzip Refernece.zip

Then use unxz to unzip xzipped files. For example

unxz KGeurref.bed.xz

File format

PC file: a text file with header line. It requires the following columns: FID, IID, AFF, PC1, PC2, PC3,...,PC10.

  • FID: Family ID
  • IID: Within-family ID
  • AFF code ('1' = Reference Sample, '2' = Studay Sample)
  • PC: PCs inforamtion. Study samples will be projected to the reference PC space.

Example PCs file from KING. FA, MO and SEX columns are not required for the analysis.

FID IID FA MO SEX AFF PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
HG00096 HG00096 0 0 0 1 0.0110 -0.0271 0.0098 0.0198 -0.0017 -0.0097 -0.0003 0.0010 0.0031 -0.0152
HG00097 HG00097 0 0 0 1 0.0107 -0.0275 0.0090 0.0189 -0.0008 -0.0097 -0.0012 -0.0014 -0.0024 -0.0245
HG00099 HG00099 0 0 0 1 0.0111 -0.0276 0.0102 0.0183 -0.0025 -0.0151 0.0014 0.0079 -0.0090 -0.0121

Popref file: a text file with header line. It would contain three columns. They are FID, IID and Population. Users need to creat this file before the analysis.

FID IID Population
HG00096 HG00096 EUR
HG00097 HG00097 EUR
HG00099 HG00099 EUR

Quickstart

Download KING from https://www.kingrelatedness.com/Download.shtml

Get PCs from KING PCA projection. The affection status (6th column) in study fam file need to be 2. The referecen's affection status is 1 or missing. Nothing is required if the reference is KGref.

king -b KGref,studydata --pca --projection --prefix example

Run R code for ancestry inference. Three arguments are required. They are PC file(examplepc.txt), popref file(example_popref.txt) and prefix(example). Package 'e1071' is required. Package 'ggplot2' and package 'doParallel' are optional.

Rscript Ancestry_Inference.R examplepc.txt example_popref.txt example

Also, we can run ancestry inference in KING from binary file with one command line.

king -b KGref,studydata --pca --projection --pngplot

European only inference

Keep European samples only. Get PCs from KING PCA projection.

king -b KGeurref,EurStudy --pca --projection --prefix EurStudy

Run R code the get the ancestry inference results. Three arguments are required. They are pc file, popref file and prefixname. Please keep the order.

Rscript Ancestry_Inference.R EurStudypc.txt KGeurref_popref.txt prefixname

Output file

example_InferredAncestry.txt

FID	IID	PC1	PC2	Anc_1st	Pr_1st	Anc_2nd	Pr_2nd	Ancestry
2427	NA19919	-0.0299	0.0012	AFR	0.9973	AMR	0.0016	AFR
2425	NA19902	-0.0295	0.0008	AFR	0.997	AMR	0.0017	AFR
2484	NA20335	-0.0239	-0.0029	AFR	0.9954	AMR	0.0016	AFR

PNG file

Interactive plots for ancestry inference results.

Run the following R code in R to get interactive plots. Package 'shiny' and 'ggplot2' are required. Related R files are saved at Rshiny folder.
Please upload a text file with ancestry information. Three columns are required. They are PC1, PC2 and Ancestry.

library(shiny)
runGitHub("AncestryInference_KING", "chenlab-uva", ref = "main", subdir = "Rshiny")

We will see study samples' PC1 and PC2 information after we upload the *InferredAncestry txt file.

The second plot (interactive plot) only show samples from the choosen ancestry group.

Detailed information will be listed if we are clicking the dots from the interactive plot.

Also, we can type a family ID that we are interested in and see samples' detailed information.

Reference

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873