Chemical Forums

General Forums => Generic Discussion => Topic started by: Donaldson Tan on July 08, 2006, 08:34:38 PM

Title: Building an open source thermophysical and chemical database program
Post by: Donaldson Tan on July 08, 2006, 08:34:38 PM
Any thoughts?

I intend to use InChI to form the basis of the catalog system.

However, if I have 200000 entries in my database, I am not going to do a case-by-case comparision until I find the right substance. That will take up too long. I wonder if there is any known mathematical function that can convert InChI into a unique numerical identifier. I believe there is such a function because InChI is based on SMILES which has an extensive graph theory backing.

Any mathematician wanabe here familiar with graph theory?
Title: Re: Building an open source thermophysical and chemical database program
Post by: Donaldson Tan on July 08, 2006, 09:27:14 PM
According to IUPAC, InChI only works on neutral and ionic organic structures at the moment.

What about the SMILES protocol? Are there any current limitation on it?
Title: Re: Building an open source thermophysical and chemical database program
Post by: Borek on July 09, 2006, 03:37:08 AM
InChI - any hashing function will do. CRC32, MD5 or something. Note there is also short version of InChI, slightly coded. Besides, you don't have to compare every one with every one - InChI strings are comparable, so you may use some binary search tree to store them, with logarithmic access time (sorry, no idea about proper English nomenclature here). Check out

http://en.wikipedia.org/wiki/AVL_tree
http://en.wikipedia.org/wiki/Balanced_tree

SMILES - they can be used to remember structures, but they are not comparable, as there is no canonical version (search forums for canonical SMILES). So, CCC(=O)C and CC(=O)CC both describe the same molecule, which you can't say by simply comparong them. InChI doesn't have this problem.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Donaldson Tan on July 09, 2006, 07:17:07 AM
Borek, are you able to access the IUPAC site to download the official InChI converter?

The username/password (iudown/bun53n) cannot work

The reason why I want to use InChI is because of the canonical problem associated to SMILES.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Borek on July 09, 2006, 07:26:43 AM
Sorry, no time to try. I am using ChemSketch from acdlabs.com for InChI generation.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Donaldson Tan on July 09, 2006, 07:50:04 AM
LOL. I thought everyboy is using the official IUPAC-endorsed InChI converter.

Cool.. Chemsketch is a freeware (http://www.acdlabs.com/download/chemsk.html)

I will try to access the IUPAC site again to download, because I want to port the InChI C++ source code for SMILES to InChI conversion to Java. Btw do you know how to implement the AVL or B tree?

I am trying my best to interpret the pseudo-code by the original inventors of the B Tree (DSW) algorithm (http://www.eecs.umich.edu/~qstout/pap/CACM86.pdf). I don't really understand it.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Borek on July 09, 2006, 09:09:05 AM
Don't bother with own implementation, look for STL (standard template library), or MFC collections - or something similar.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Donaldson Tan on July 11, 2006, 02:51:24 AM
Basically, the substances will be ordered according to their MD5 Checksum equivalent of the the SMILES code. Since the MD5 Checksum must be 128bit, it can be represented as a 32 hexadecimal-digit integer. This represents an upper limit of 3.4E38 for the number of substances that can be stored in the database.

Java has its own internal MD5 implementation, so that saves a lot of trouble.

Things to implement:
1. OpenBabel (http://openbabel.sourceforge.net)
2. Correlation Calculator
3. Interpolation Calculator
4. Mixing function

What other useful features should a cheminformatic software include?
Title: Re: Building an open source thermophysical and chemical database program
Post by: Borek on July 11, 2006, 03:21:37 AM
Generally speaking hashing functions never work perfectly, so you must be prepared to support situation when there are two identical MD5 values for two different substances. In the case of MD5 it is extremally rare situation, but still for obvious reasons you can use MD5 to enumerate "only" 2128 substances.

If you use InChI strings as indexes you don't have that problem, as they have been designed with uniqueness in mind.

But InChI is not a perfect solution as wel if you don't have full information about molecule available.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Donaldson Tan on July 11, 2006, 10:21:12 AM
How is InChI not a perfect solution if I do not have full information of the substance?

I am only using InChI as the basis to index different substance.

The only problem here is InChI is currently limited to organic substance.

I wonder how InChI describes inorganic compound..

I merely want to make a program that makes up for the inadequancy of getting essential thermophysical and chemical data. Correlations are very convenient functions which are often not available. It would be useful for people like me when I want to model a chemical reactor.
Title: Re: Building an open source thermophysical and chemical database program
Post by: Borek on July 11, 2006, 10:33:20 AM
How is InChI not a perfect solution if I do not have full information of the substance?

Each layer contains additional information - for example if there is no conformation information, several substances will be described by the same InChI string with one layer only.

Quote
I wonder how InChI describes inorganic compound.

No idea about details, but ChemSkecth generates InChI for salts (like InChI=1/ClH.Na/h1H;/q;+1/p-1 for NaCl). If you will find something more precise please post.