April 19, 2024, 06:22:14 PM
Forum Rules: Read This Before Posting


Topic: Building an open source thermophysical and chemical database program  (Read 9464 times)

0 Members and 1 Guest are viewing this topic.

Offline Donaldson Tan

  • Editor, New Asia Republic
  • Retired Staff
  • Sr. Member
  • *
  • Posts: 3177
  • Mole Snacks: +261/-13
  • Gender: Male
    • New Asia Republic
Any thoughts?

I intend to use InChI to form the basis of the catalog system.

However, if I have 200000 entries in my database, I am not going to do a case-by-case comparision until I find the right substance. That will take up too long. I wonder if there is any known mathematical function that can convert InChI into a unique numerical identifier. I believe there is such a function because InChI is based on SMILES which has an extensive graph theory backing.

Any mathematician wanabe here familiar with graph theory?
« Last Edit: July 08, 2006, 08:52:12 PM by geodome »
"Say you're in a [chemical] plant and there's a snake on the floor. What are you going to do? Call a consultant? Get a meeting together to talk about which color is the snake? Employees should do one thing: walk over there and you step on the friggin� snake." - Jean-Pierre Garnier, CEO of Glaxosmithkline, June 2006

Offline Donaldson Tan

  • Editor, New Asia Republic
  • Retired Staff
  • Sr. Member
  • *
  • Posts: 3177
  • Mole Snacks: +261/-13
  • Gender: Male
    • New Asia Republic
Re: Building an open source thermophysical and chemical database program
« Reply #1 on: July 08, 2006, 09:27:14 PM »
According to IUPAC, InChI only works on neutral and ionic organic structures at the moment.

What about the SMILES protocol? Are there any current limitation on it?
"Say you're in a [chemical] plant and there's a snake on the floor. What are you going to do? Call a consultant? Get a meeting together to talk about which color is the snake? Employees should do one thing: walk over there and you step on the friggin� snake." - Jean-Pierre Garnier, CEO of Glaxosmithkline, June 2006

Offline Borek

  • Mr. pH
  • Administrator
  • Deity Member
  • *
  • Posts: 27652
  • Mole Snacks: +1800/-410
  • Gender: Male
  • I am known to be occasionally wrong.
    • Chembuddy
Re: Building an open source thermophysical and chemical database program
« Reply #2 on: July 09, 2006, 03:37:08 AM »
InChI - any hashing function will do. CRC32, MD5 or something. Note there is also short version of InChI, slightly coded. Besides, you don't have to compare every one with every one - InChI strings are comparable, so you may use some binary search tree to store them, with logarithmic access time (sorry, no idea about proper English nomenclature here). Check out

http://en.wikipedia.org/wiki/AVL_tree
http://en.wikipedia.org/wiki/Balanced_tree

SMILES - they can be used to remember structures, but they are not comparable, as there is no canonical version (search forums for canonical SMILES). So, CCC(=O)C and CC(=O)CC both describe the same molecule, which you can't say by simply comparong them. InChI doesn't have this problem.
« Last Edit: July 09, 2006, 03:42:53 AM by Borek »
ChemBuddy chemical calculators - stoichiometry, pH, concentration, buffer preparation, titrations.info

Offline Donaldson Tan

  • Editor, New Asia Republic
  • Retired Staff
  • Sr. Member
  • *
  • Posts: 3177
  • Mole Snacks: +261/-13
  • Gender: Male
    • New Asia Republic
Re: Building an open source thermophysical and chemical database program
« Reply #3 on: July 09, 2006, 07:17:07 AM »
Borek, are you able to access the IUPAC site to download the official InChI converter?

The username/password (iudown/bun53n) cannot work

The reason why I want to use InChI is because of the canonical problem associated to SMILES.
« Last Edit: July 09, 2006, 07:22:40 AM by geodome »
"Say you're in a [chemical] plant and there's a snake on the floor. What are you going to do? Call a consultant? Get a meeting together to talk about which color is the snake? Employees should do one thing: walk over there and you step on the friggin� snake." - Jean-Pierre Garnier, CEO of Glaxosmithkline, June 2006

Offline Borek

  • Mr. pH
  • Administrator
  • Deity Member
  • *
  • Posts: 27652
  • Mole Snacks: +1800/-410
  • Gender: Male
  • I am known to be occasionally wrong.
    • Chembuddy
Re: Building an open source thermophysical and chemical database program
« Reply #4 on: July 09, 2006, 07:26:43 AM »
Sorry, no time to try. I am using ChemSketch from acdlabs.com for InChI generation.
ChemBuddy chemical calculators - stoichiometry, pH, concentration, buffer preparation, titrations.info

Offline Donaldson Tan

  • Editor, New Asia Republic
  • Retired Staff
  • Sr. Member
  • *
  • Posts: 3177
  • Mole Snacks: +261/-13
  • Gender: Male
    • New Asia Republic
Re: Building an open source thermophysical and chemical database program
« Reply #5 on: July 09, 2006, 07:50:04 AM »
LOL. I thought everyboy is using the official IUPAC-endorsed InChI converter.

Cool.. Chemsketch is a freeware (http://www.acdlabs.com/download/chemsk.html)

I will try to access the IUPAC site again to download, because I want to port the InChI C++ source code for SMILES to InChI conversion to Java. Btw do you know how to implement the AVL or B tree?

I am trying my best to interpret the pseudo-code by the original inventors of the B Tree (DSW) algorithm (http://www.eecs.umich.edu/~qstout/pap/CACM86.pdf). I don't really understand it.
« Last Edit: July 09, 2006, 07:58:29 AM by geodome »
"Say you're in a [chemical] plant and there's a snake on the floor. What are you going to do? Call a consultant? Get a meeting together to talk about which color is the snake? Employees should do one thing: walk over there and you step on the friggin� snake." - Jean-Pierre Garnier, CEO of Glaxosmithkline, June 2006

Offline Borek

  • Mr. pH
  • Administrator
  • Deity Member
  • *
  • Posts: 27652
  • Mole Snacks: +1800/-410
  • Gender: Male
  • I am known to be occasionally wrong.
    • Chembuddy
Re: Building an open source thermophysical and chemical database program
« Reply #6 on: July 09, 2006, 09:09:05 AM »
Don't bother with own implementation, look for STL (standard template library), or MFC collections - or something similar.
ChemBuddy chemical calculators - stoichiometry, pH, concentration, buffer preparation, titrations.info

Offline Donaldson Tan

  • Editor, New Asia Republic
  • Retired Staff
  • Sr. Member
  • *
  • Posts: 3177
  • Mole Snacks: +261/-13
  • Gender: Male
    • New Asia Republic
Re: Building an open source thermophysical and chemical database program
« Reply #7 on: July 11, 2006, 02:51:24 AM »
Basically, the substances will be ordered according to their MD5 Checksum equivalent of the the SMILES code. Since the MD5 Checksum must be 128bit, it can be represented as a 32 hexadecimal-digit integer. This represents an upper limit of 3.4E38 for the number of substances that can be stored in the database.

Java has its own internal MD5 implementation, so that saves a lot of trouble.

Things to implement:
1. OpenBabel (http://openbabel.sourceforge.net)
2. Correlation Calculator
3. Interpolation Calculator
4. Mixing function

What other useful features should a cheminformatic software include?
"Say you're in a [chemical] plant and there's a snake on the floor. What are you going to do? Call a consultant? Get a meeting together to talk about which color is the snake? Employees should do one thing: walk over there and you step on the friggin� snake." - Jean-Pierre Garnier, CEO of Glaxosmithkline, June 2006

Offline Borek

  • Mr. pH
  • Administrator
  • Deity Member
  • *
  • Posts: 27652
  • Mole Snacks: +1800/-410
  • Gender: Male
  • I am known to be occasionally wrong.
    • Chembuddy
Re: Building an open source thermophysical and chemical database program
« Reply #8 on: July 11, 2006, 03:21:37 AM »
Generally speaking hashing functions never work perfectly, so you must be prepared to support situation when there are two identical MD5 values for two different substances. In the case of MD5 it is extremally rare situation, but still for obvious reasons you can use MD5 to enumerate "only" 2128 substances.

If you use InChI strings as indexes you don't have that problem, as they have been designed with uniqueness in mind.

But InChI is not a perfect solution as wel if you don't have full information about molecule available.
ChemBuddy chemical calculators - stoichiometry, pH, concentration, buffer preparation, titrations.info

Offline Donaldson Tan

  • Editor, New Asia Republic
  • Retired Staff
  • Sr. Member
  • *
  • Posts: 3177
  • Mole Snacks: +261/-13
  • Gender: Male
    • New Asia Republic
Re: Building an open source thermophysical and chemical database program
« Reply #9 on: July 11, 2006, 10:21:12 AM »
How is InChI not a perfect solution if I do not have full information of the substance?

I am only using InChI as the basis to index different substance.

The only problem here is InChI is currently limited to organic substance.

I wonder how InChI describes inorganic compound..

I merely want to make a program that makes up for the inadequancy of getting essential thermophysical and chemical data. Correlations are very convenient functions which are often not available. It would be useful for people like me when I want to model a chemical reactor.
"Say you're in a [chemical] plant and there's a snake on the floor. What are you going to do? Call a consultant? Get a meeting together to talk about which color is the snake? Employees should do one thing: walk over there and you step on the friggin� snake." - Jean-Pierre Garnier, CEO of Glaxosmithkline, June 2006

Offline Borek

  • Mr. pH
  • Administrator
  • Deity Member
  • *
  • Posts: 27652
  • Mole Snacks: +1800/-410
  • Gender: Male
  • I am known to be occasionally wrong.
    • Chembuddy
Re: Building an open source thermophysical and chemical database program
« Reply #10 on: July 11, 2006, 10:33:20 AM »
How is InChI not a perfect solution if I do not have full information of the substance?

Each layer contains additional information - for example if there is no conformation information, several substances will be described by the same InChI string with one layer only.

Quote
I wonder how InChI describes inorganic compound.

No idea about details, but ChemSkecth generates InChI for salts (like InChI=1/ClH.Na/h1H;/q;+1/p-1 for NaCl). If you will find something more precise please post.
ChemBuddy chemical calculators - stoichiometry, pH, concentration, buffer preparation, titrations.info

Sponsored Links