Skip to content

bgeils/snip_warehouse

 
 

Repository files navigation

Warehousing DbSNP's JSON Data into PostgreSQL

Intro

NCBI hosts a large, open-sourced dataset of human SNPs (Single-nucleotide Polymorphisms). Further, they store a good deal of auxillary data that is related to each SNP. The data is hosted on an FTP server here:

ftp://ftp.ncbi.nlm.nih.gov/snp/.redesign/latest_release/JSON

and is split across 25 gzipped JSON files (Chromosomes 1-22, X, Y and Mitochondrial DNA), amassing a total compressed size of ~100GB (~2TB uncompressed!).

Further Reading

More details can be found in this series of blog posts, detailing a three-part walkthrough, breaking the development of this application down in three steps:

  1. Downloading JSON SNP Data & Initilizing the Database
  2. Extracting ClinVar Disease & Frequency Study Data
  3. Efficiently Writing Data to PostgreSQL Database

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%