Skip to content
Weizhong Li edited this page Jun 21, 2017 · 4 revisions

Welcome to the CD-HIT Wiki - http://cd-hit.org

Program developed by Weizhong Li's lab at UCSD http://weizhongli-lab.org and JCVI http://jcvi.org
Contact: [email protected].

Introduction

CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses.

CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li, et al., 2001) and was then extended to support clustering nucleotide sequences and comparing two datasets (Li and Godzik, 2006). The CD-HIT web server was implemented in 2009, which allows users to cluster or compare sequences without using command-line CD-HIT. The server provides interactive interface and additional visualization tools.

Currently, CD-HIT package has many programs: cd-hit, cd-hit-2d, cd-hit-est, cd-hit-est-2d, cd-hit-para, cd-hit-2d-para, psi-cd-hit, cd-hit-454, cd-hit-dup, cd-hit-lap, cd-hit-OTU etc. There are also many utility scripts, written in Perl, to help run and analyze CD-HIT jobs. Briefly:

  * cd-hit	Cluster peptide sequences	
  * cd-hit-est	Cluster nucleotide sequences
  * cd-hit-2d	Compare 2 peptide databases	
  * cd-hit-est-2d	Compare 2 nucleotide databases
  * psi-cd-hit	Cluster proteins at <40% cutoff	
  * cd-hit-lap	Identify overlapping reads
  * cd-hit-dup 	Identify duplicates from single or paired Illumina reads	
  * cd-hit-454 	Identify duplicates from 454 reads 
  * cd-hit-otu	Cluster rRNA tags	
  * cd-hit Web server	Cluster user-uploaded data 
  * cd-hit-para	Cluster sequences in parallel on a computer cluster	
  * scripts	Parse results and so on
  * h-cd-hit	Hierarchical clustering 		

Recently development of cd-hit, especially the multiple-threaded version introduced in 2012 (Fu et al), enables clustering of very large NGS datasets. For example, it only took cd-hit less than a day on a 32-core computer to cluster a few hundred million protein sequences from a metagenomics study.

Clone this wiki locally