CDHIT 
Cluster Database at High Identity with Tolerance 
http://bioinformatics.burnhaminst.org/cdhi 
================================================================================ 
This program is modified from CDHI, you may read algorithm.cdhi first. 
================================================================================ 
The basic filter system of CDHI states: 
"If two proteins share certain sequence identity, they should have 
at least a certain number of identical pentapeptide. For example, 
two sequences having 85% identical residues over a 100residue 
window will have at least 25 pentapeptides." 
Theoretically, two sequence have 80% identity, have don't need have a single 
identical pentapeptides. They can differ every 4 aminoacid. like 
MSHHWGYGKHNGPEMWHKDFPIAKGERQS.... 
MSHH GYGK NGPE WHKD PIAK ERQS.... 
MSHHcGYGKdNGPEhWHKDiPIAKtERQS.... 
But, this is very very rare in real world of alignments. Even the alignment 
is at 60%. there are still some identical pentapeptides in general. This is 
the basis of CDHIT. 
CDHIT is based on the statistical analysis of a large mount of alignments. 
While speeding up the program, it won't lose much of quality of clustering. 
