Wednesday, December 7, 2011

Obtaining Contact Numbers using SMARTS

Because I'm on a roll with blogging right now and I am looking for excuses to not write the next paper, here is a post about how to utilize SMARTS[1] for something cheminformatically-vant such as the contact number[2]. We will base the new code on what I did in the last post I wrote. The contact number is basically "Select an $\alpha$-Carbon and count the number $\alpha$-Carbons within a sphere of a certain radius $R$ from it". This give rise to a 1D measurement (see below) of how buried residues are. If you massage your data enough (we will get to this in a later post) it can be used to validate whether a protein is folded (somewhat) correctly or not. There is much more to this story than I have room for on my blog, but I do recommend you follow the blog of Anders Christensen, a fellow student who folds more proteins than I.

Our work today shall start from Crambine (PDB: 1CRN) which I've prepared for your viewing pleasure in pymol

From last week, the boiler-plate code we shall start from is

import openbabel
filename = "1CRN.pdb"
pattern = "[$(CC(=O)[N,O])]"
obmol = openbabel.OBMol()
obpat = openbabel.OBSmartsPattern()
obconv = openbabel.OBConversion()
obconv.ReadFile(obmol, filename)
alpha_carbons = [m[0] for m in obpat.GetUMapList()]

I'm interested in $\alpha$-Carbons so the atomic primitive pattern we use to match with will be


The last part, [N,O] is a match of either a Nitrogen or an Oxygen. This Oxygen match is needed for the C-terminal. Because we are using SMARTS, this is the only thing that will change in our code once it is done. Should we be interested in locating the carboxyl groups of the backbone for instance, simply change one pattern. Of course, if you want to match $\alpha$-Carbons to carboxyl groups, this requires an extra pattern match.

Running this code will return a list of atoms. To calculate the contact number, the following function iterates over each match, forms a pair with all other matches and calculates the contact number for one residue at a time

def ContactNumbersForCutoff(matches, rcut):
  contact_numbers = []
  for i in matches:
    contact_number = 0 
    ai = obmol.GetAtom(i)
    for j in matches:
      if i == j: continue
      aj = obmol.GetAtom(j)
      if ai.GetDistance(aj) < rcut:
        contact_number +=1

  return contact_numbers

which we will invoke to generate the list of contact numbers

cn12 = ContactNumbersForCutoff(alpha_carbons, 12.0)

Since I am fond of the matplotlib[3] library for python when dealing with plots and I don't feel like doing the hard work myself anyways, I'll just plot[4] the list of numbers using the following code


which (after some additional touches - see the github page for the final code) results in

Here we see that residues 20, 38, 39 and 40 are clearly not buried when compared to residues 30 to 36. In a later posts, I will add a few lines of code to calculate the above information in a slightly more useful way. I'd also like to explore a quick implementation of the Half-Sphere Exposure[5] method instead as it provides even more clue to the local environment.

check out the source code on my github page.

Wednesday, November 30, 2011

Cut'n'Frag: Fragment smart using SMARTS

In our research group, part of the research we do is method development. To facilitate these new methods, tools are often required to setup various calculations or treat output files. Some of this research[1] is in fragment based methods[2] where a large system is divided into several smaller pieces and the total property of the whole system can be assembled from the individual fragments.

To help facilitate easier setup of fragment based calculations, we've made a tool (still under development) called FragIt which we use to fragment large molecules and prepare input files for GAMESS[3]. This post is about one key aspect of FragIt: How to use Open Babel[4] to figure out where to cut bonds in a protein.

For starters, one needs to load a molecule from a file, this is accomplished by the following lines of code[*]

import openbabel
obmol = openbabel.OBMol()
obconv = openbabel.OBConversion()
obconv.ReadFile(obmol, filename)

which basically tells OpenBabel that we like to open a file with a fileformat specified by the extension of the filename("pdb").

Open Babel includes functionality to do SMARTS[5] pattern matching[6]. Basically, you write a pattern to search for a substructure in a molecule. The way we want to use it is to identify an atom A on one side of a bond, and an atom B on the other side of that bond. Thus, using the following SMARTS pattern on a protein


we can search for A: 'find carbon connected to nitrogen on one side' and B: 'find carbon connected to an oxygen on the other side'. Results are only returned if the above pattern mathes on both sides of a bond at the same time. The code to do this is

matches = [m for m in obpat.GetUMapList()]

where the last line converts some horrible <openbabel.vectorvInt; proxy ... > SWIG code into a list of tuples with matching atom ID's, i.e.

print matches


[(2, 3), (11, 12), (32, 33), (44, 45), (58, 59), (73, 74), (87, 88), (94, 95), (108, 109), (132, 133)]

Each tuple has information about the found atoms as (A, B).

The real beauty of this is of course that you can make your own patterns and match whatever you want in whichever molecule you chose to play with.

You can find this example as well as others on my github page.

[*] We opted for the full fledged API of OpenBabel, thus it takes 5 lines of code to actually open a file. The more pythonic library pybel[7] which is also included in OpenBabel was not used.

update: Thanks to Anders Christensen for pointing me towards an easy syntax highlighter.

Monday, October 31, 2011

Using OpenBabel to Make Input Files for GAMESS

One thing that people tend to hate to do, is to make repetitive tasks. One such thing could be to prepare 100 different geometry optimizations using DFT[1] and a happy mixture of basis sets[2] and functionals[3] for GAMESS[4]. In this blog post, I'll show you how you can get around manual labor and prepare 10 files with just a little bit of coding. Extending it to a 100 is an easy task after that. Lets start.

An input geometry of ethanol in the very common XYZ file format might look something very similar to this, when made in Avogadro[5]

C  0.00004  0.00001  0.00002
C  0.00035 -0.00106  1.52091
H  0.02794  1.04324 -0.38022
H -0.92039 -0.49540 -0.37524
H  0.88619 -0.55034 -0.38151
H -0.89666  0.55109  1.87908
O  1.16332  0.62487  1.98684
H -0.03205 -1.05436  1.87773
H  1.11207  0.59582  2.97727

Using Open Babel[6] we can quickly make a GAMESS input file using the following command

babel -ixyz -ogamin ethanol.inp

resulting in

C 6.0  0.0000400000  0.0000100000  0.0000200000 
C 6.0  0.0003500000 -0.0010600000  1.5209100000 
H 1.0  0.0279400000  1.0432400000 -0.3802200000 
H 1.0 -0.9203900000 -0.4954000000 -0.3752400000 
H 1.0  0.8861900000 -0.5503400000 -0.3815100000 
H 1.0 -0.8966600000  0.5510900000  1.8790800000 
O 8.0  1.1633200000  0.6248700000  1.9868400000 
H 1.0 -0.0320500000 -1.0543600000  1.8777300000 
H 1.0  1.1120700000  0.5958200000  2.9772700000 

but we can't run this file right away since it is obviously lacking information about the basis set, the DFT-functional and even the runtype (GAMESS assumes runtyp=energy if nothing is specified).

Ideally what we want is to generate a header for the input file and then insert it into the input file, but this header should be generated automatically for us.

Given a set of parameters (functional and basis set), the following python script ( prints a correct header for GAMESS to the screen

import sys

basissets = {'sto3g' :'STO NGAUSS=3',
             'pc0' :'PC0',
             'pc1' :'PC1',
             'apc0' :'APC0',
             'apc1' :'APC1'}

functionals = {'blyp' :'BLYP',
               'b3lyp': 'B3LYP'}

basis = basissets[sys.argv[1]]
functional = functionals[sys.argv[2]]
print " $SYSTEM MWORDS=125 $END"
print " $CONTRL RUNTYP=OPTIMIZE DFTTYP=%s ISPHER=1 $END" % (functional)
print " $BASIS GBASIS=%s $END" % (basis)

Notice how we use a dictionary to store the basis set information so it can be easily recovered and printed correctly. The GAMESS basis group can be quite hideous with Pople basis sets[7].

All we require now is a small bash script, which makes loops over the various basis sets and functionals, invoke the python script we just made with the correct arguments and dump that printout information to a file (header.inp) and then use some Open Babel magic (to add a custom header with keywords, use the very secret -xf filename flag when converting to GAMESS input files) to make it work. The following script does exactly that

#!/usr/bin/env bash
for basis in sto3g pc0 pc1 apc0 apc1
  for functional in blyp b3lyp
    python $basis $functional > header.inp
    babel -xf header.inp -ixyz -ogamin $filename

For information on how to use GAMESS, I recommend Professor Jan Jensens blog ( as a great resource and for general usage notes of GAMESS and the Google user group ( is very helpful if you have any problems or questions related to running GAMESS.

NB! To discover what options the babel file conversion tool supports for your file format (gamin in this case), simply invoke the following command

babel -H gamin

Happy Conversion!

Thursday, September 1, 2011

First Post!


I've tried it numerous times before. This time, I'll try it again.

This blog will contain blog posts on using python in chemistry related fields such as:
I'll also divulge into areas related to the above

Hopefully, there will be more posts soon ...