1 

2 
CDHIT 
3 

4 
Cluster Database at High Identity with Tolerance 
5 
http://bioinformatics.burnhaminst.org/cdhi 
6 

7 
================================================================================ 
8 
This program is modified from CDHI, you may read algorithm.cdhi first. 
9 
================================================================================ 
10 

11 
The basic filter system of CDHI states: 
12 

13 
"If two proteins share certain sequence identity, they should have 
14 
at least a certain number of identical pentapeptide. For example, 
15 
two sequences having 85% identical residues over a 100residue 
16 
window will have at least 25 pentapeptides." 
17 

18 
Theoretically, two sequence have 80% identity, have don't need have a single 
19 
identical pentapeptides. They can differ every 4 aminoacid. like 
20 

21 
MSHHWGYGKHNGPEMWHKDFPIAKGERQS.... 
22 
MSHH GYGK NGPE WHKD PIAK ERQS.... 
23 
MSHHcGYGKdNGPEhWHKDiPIAKtERQS.... 
24 

25 
But, this is very very rare in real world of alignments. Even the alignment 
26 
is at 60%. there are still some identical pentapeptides in general. This is 
27 
the basis of CDHIT. 
28 

29 
CDHIT is based on the statistical analysis of a large mount of alignments. 
30 
While speeding up the program, it won't lose much of quality of clustering. 
31 
