Wednesday, August 8, 2012

Fun and a little disappointment with amino acids

So you have Open Babel, you have the Python bindings to Open Babel and you think: "I just want to have fun!".

I wrote a small script to generate the 3D structures of the amino acids because I wan't to investigate how well the MMFF94 force field can calculate the charge of them. I've done this before for entire proteins, but here I just want to do it one amino acid at a time. To challenge myself, I cannot rely on the excellent (o)babel tool of the Open Babel package when generating the structures. All code is on github, but have shared it on the blog too via select gists.

My SMILES for the amino acids are saved in a dictionary:

and to simply generate the structures I've updated my obutil library with the following utility functions

Finally, to generate the structures I simply run the following code

I have a separate script to calculate the charge of a molecule

The table below is the result of the above. The images are generated using the babel tool with the -opng option.


Amino AcidSMILESChargeDepiction
GLYNCC(=O)O0
ALANC(C)C(=O)O0
SERNC(CO)C(=O)O0
THRNC(C(C)O)C(=O)O0
CYSNC(CS)C(=O)O0
VALNC(C(C)C)C(=O)O0
LEUNC(CC(C)C)C(=O)O0
ILENC(C(C)CC)C(=O)O0
METNC(CCSC)C(=O)O0
PRON1C(C(=O)O)CCC10

PHENC(Cc1ccccc1)C(=O)O0
TYRNC(Cc1cc(O)ccc1)C(=O)O0
TRPNC(Cc1c2ccccc2nc1)C(=O)O0
ASPNC(CC(=O)O)C(=O)O0
GLUNC(CCC(=O)O)C(=O)O0
ASNNC(CC(=O)N)C(=O)O0
GLNNC(CCC(=O)N)C(=O)O0
HISNC(C(=O)O)Cc1cncn10
LYSNC(CCCCN)C(=O)O0
ARGNC(CCCNC(=N)N)C(=O)O0
In this case, all charges are zero on all amino acids. Notice that I did not make use of Open Babel trying to assign Hydrogen for a specific pH as that is not what I want to test. Instead, notice in the aminoacids.py file listed above there are specific patterns for obtaining (de)protonated states of a few select amino acids: ASP, GLU, HIS, LYS and ARG.

Running the (de)protonated molecules above gives the following table

Amino AcidSMILESChargeDepiction
ASP-NC(CC(=O)[O-])C(=O)O-1
GLU-NC(CCC(=O)[O-])C(=O)O-1
HIS+NC(C(=O)O)Cc1c[nH+]cn1+1
LYS+NC(CCCC[NH3+])C(=O)O+1
ARG+NC(CCCNC(=[NH2+])N)C(=O)O0

The disappointing thing here is that the protonated Arginine is incorrectly assigned a charge of 0 (zero). Edit: As noted by Jan, the depiction of the protonated ARG, ASP and GLU is also wrong.

Edit2: It appears the problem is also present in the SVG depicter.

Maybe it's time to submit a bug-report?

Thursday, June 28, 2012

A Threaded PyQt GUI and Particles

I wrote a post about adding a GUI to your command line interface program. Sometimes people just want to click stuff. If you tried it / implemented it yourself, you'd have seen that the program would have gone from responsive to non-responsive when you hit the run button. Why? Because the program can only do ONE think at a time and frankly, keeping the controls available for clicking at all times, updating the graphics and calculating particle movement do not get along nicely. This post is about making it behave like we expect.

I present to you threading. Specifically for PyQt there is the QThread class which is actually discussed rather nicely in this PyQtWiki and this page by Jo Plaete which I used for inspiration.

What complicates the whole scene is that we need to realize that we have two independant things we want to accomplish: 1) move particles and 2) update the graphics. Both independent of the GUI.

I started by making my own class called BaseThread which subclasses QThread. The code is

where I believe the magic for everything not just crashing and burning is the subtle use of the custom boolean flag exiting (which we will use later) and the use of self.wait() in the __del__ method. Again, for details you should read the blog posts mentioned above.

To come around many of the syncronization problems that arises with threads the Qt framework allows us to use custom signals to make everything talk together. Here is my version of a StepThread which defines the run method (you do NEVER call this explicitly!) and a convenience function called simulate to start a simulation for N steps. The run function emits a custom signal and sleeps for a little while before simulating again.

Very similarly we have the DrawThread which emits a signal so we can update the matplotlib surface.

Notice that it updates less frequently. I have yet to figure out something clever in this regard.

Finally, I had to make some tweaks to the Simulator class (nothing very fancy) to hook up our custom signals.

What I need now is a video of it and maybe to implement some different potentials instead of only the non-interacting particles. Stay tuned.

The entire code can be downloaded from the latest gist I made.

Sunday, June 24, 2012

Adding a PyQt GUI to your program

I am all in for command line interface (CLI) programs. I'll admit that. They work in terminals. You can run them without an X Server and usually option flags are easy to use and obvious*. But sometimes I'd like to just hit a button. Click and magic happens. No more 200 character command line statement to get the ball rolling. A good example of this is from my colleague Anders Christensen who made a graphical user interface (GUI) for the Phaistos program.

In this post, we'll take the first small steps towards actually making a GUI for a simulation to run the ideal gas in and not just watch some numbers printed in a terminal. I'll be using Qt4 and the PyQt4 bindings. If you have a fairly recent linux distribution, you can apt-get install yourself to success very fast. Remember to also install the Qt-Designer app so you visually can layout your controls.

We'll start out simple by constructing buttons, textedits and a frame. I use a regular QWidget for my main form. The end goal for a user is to push the 1) setup button to initialize particles and 2) push the run button and make the simulation run. The simulation (i.e. the particles) should be displayed in the frame via matplotlib. It looks like this



If you want the resulting simulator.ui code its listed as part the gist for this blog post.

To move on from here, there is a nifty little tool called pyuic4 which will convert your .ui file into a python class with the name Simulator_UI. I always just parse it down to a file called simulator_ui.py. The code is

$ pyuic4 simulator.ui > simulator_ui.py

The simulator_ui class contains the framework for the QWidget form we will be using. The most clever approach I could think of was to subclass the simulator_ui class to separate the "setting up the form" code with the "what happens when I push a button" code. The QWidget I'll use to represent the window the user will see and I call it Simulator. It imports the Simulator_UI class, subclasses it and sets up the GUI. The two buttons are hooked up to either initialize new particles or run for some steps


The bread and butter of this application is the ParticleCanvas. Its tailored to draw and move the particles around. Look at its code


You see that it subclasses the Canvas class which is my best bet on how one should make the code for fast blitting of a matplotlib canvas. A class which subclasses the Canvas class is only responsible for redrawing the actual plot (and do it fast). However,  I would gladly appreciate comments of you have a better way to implement it


Finally, the code that makes it all run is just my main executable which is defined like this


The final result looks like this

We now have a finished simulation GUI. However, the main problem is that the GUI and the simulation code is intimately hooked up so the GUI will freeze at some point. Fortunately, it leaves a new post on the horizon.

*I never understood anything of the find command, however.

Thursday, May 3, 2012

Programming the Ideal Gas

You've heard about it before. Many times. It's a model system used a lot in chemistry. The Ideal Gas law: A system where the particles can be treated as non-interacting. How do you program such a system? What do you need to describe it? Given 1) the basic introduction covered by Prof. Jan Jensen in another lecture, and 2) the following assignment (notice it is part 2), here is what the students are supposed to do once this exercise is complete:

  1. Take a box with dimensions $(-1<x<1$ and $-1<y<1)$.
  2. Make npart particles placed randomly in the right side of a box ($0<x<1$) but evenly from the top through the bottom
  3. Give the npart particles random velocities in both directions and make it so they can move in both the positive and negative direction.
  4. Move the particles around via the equation: $X_i^{n+1} = X_i^{n}+dt*dX_i^{n}$
  5. Keep them inside the box at all times.
  6. Plot the result.
The students start out with 1, 2 and 6 presented to them, both for inspiration and because they contain some subtle hints. The code is

One of the tricky parts is knowing "what is the i'th particle" because we need this information to update the system according to the equation above in point 4. Once you realize that you can access this information in the X, Y, dX and dY arrays as X[i] under a loop, then there is really no problem with that point either, for instance, to update all particles according to the equation in 4 in one go, we could write

Assigning a loop to iterate over the particles is not really troublesome, but understanding why is hard - even to explain it to yourself. A simple for-loop might not be the smartest thing to do, but I think it is the easiest one to go about and more importantly I often feel that many things can get lost in the whole numpy magic, especially for newcomers.

The next part is, that we should keep the particles inside the box at all times. The exercise has a little discussion about which version is the better, but in my opinion the particles should always be inside at all times. To keep them in, we ask whether a step (equation in point 4) in X and Y will result in the particle being outside the box and if that is the case, we reverse velocity that potentially brought the particle outside. We could use the following code to do that

Inserting the above two snippets of code, under a loop over the number of steps we want to take, we finally end up with the code that simulates non-interacting particles in a container.

Try and code it for yourself or use it in a course you teach. The students do find it interesting, both in the form of "what do I need to tell the computer to make it do stuff?" and "oh - the particles do move!".

Disclaimer: This code is part of a course I help teach at the University of Copenhagen - just imagine you never saw python before.

Friday, April 20, 2012

The Wonders of Numpy

I originally envisioned this post to be somewhat of a wall of code which would show just how ugly you can actually write code that works (see the public gist for that) but without the checks to see that it does what I want it to do. I've been parsing text lately - a weird and wonderful combination of taking the molecular orbital (MO) coefficients of a calculation and transforming them into a new basis set to calculate spin-spin coupling constants. This involves reading the input file to get the basis set exponents of the uncontracted basis set and matching them with the corresponding MO coefficients and generate a new segmented contracted basis set.

In the table below we have a segmented basis set where the primitive gaussian type orbitals (PGTO) are combined into contracted gaussian type orbitals (CGTO). In this case, we have a basis set with three degrees of freedom (three CGTOs) down from five degrees of freedom (five PGTOs). Dark fields in the table shows coefficients whose value are larger than zero. Bright fields are zero.

CGTO1 CGTO2 CGTO3
PGTO1
PGTO2
PGTO3
PGTO4
PGTO5

So the PGTOs are obtained from the .mol file (I'm using Dalton for this) using the read_mol.py script found in the gist since it is used anyways to generate the new basis set. The CGTOs are obtained from an RHF calculation (see the hf.dal file in the gist) using the read_log.py script.

The problem with the design of a new basis set is to choose how many of the PGTOs to use in a CGTO, for instance, why choose three PGTOs in CGTO1? Since the number of PGTOs is not too overwhelming, I've gone ahead and made a general solution so I have the possibility to test all cases. The main problem then boiled down to: given an N x N matrix with zeros (notice that there are many zeros in this matrix), how do I overwrite subarrays (3 x 1 in this case) in Numpy so I can insert a specific number of coefficients to make CGTO1 from PGTO1 to PGTO3? and how to I make CGTO2 and CGTO3 couple with PGTO4 and PGTO5 using identity matrix inserted at the correct spot?

The trick is to make use of Numpys excellent (and very difficult) features known as boolean masking where a matrix filled with True or False will extract the corresponding values and one is able to assign to them. In the end, we need to remove excessive columns and I've found my head can do that easily by doing a transpose of the matrix, remove the unneeded rows (using a simpler Numpy/python logic which I know by hand) and transpose it back.

This was one of the more difficult issues, but at least I can now generate a multitude of basis set calculations and examining my property of interest in a matter of seconds. Numpy rocks, but I think I could grow old trying to learn all the awesome features it has.

Friday, February 3, 2012

Great resources are often hidden (and forgotten)

You know the problem. You found a resource that you thought was awesome, but where did you find it? how did you find it? and most importantly - how do you find it again after you closed the browser window?

I think we're a product of Google. Open the browser, start typing and find the answer you are looking for. Heck, it does not even matter if the result you find 5 minutes later is slightly different from the one you found earlier, cause you just type, find, use and move on. This is great and works in many cases, but ...

... sometimes you find resources that a very valuable, but you cannot remember the steps you actually took to get there. This is where bookmarks come in, but I'll be the first to admit that I bookmark 50% of what I should and half of what I bookmark I intend to read later, but who has the time?

I was lucky enough to rediscover an old friend recently whilst fighting with the Open Babel API: This snippet of test code for the python bindings - it is really valuable when trying to figure out what you can and cannot with the API. The file is stored as a plain file on a server who could be gone tomorrow - where could I possibly store it? I see no good solution to this. Since there is no license to redistribute, I ventured to github and created a private gist* - now it is stored "forever" and at least I know where to look. On the plus side, this is clever because it is even version controlled and if I manage to grab something I am sure I can publicly display (like Noel'o'blawg does), I sure will. On the down side, this is just another tool in the box of tools.

I think it will increase forever ...

* From the github homepage:
Gist is a simple way to share snippets and pastes with others. All gists are git repositories, so they are automatically versioned,forkable and usable as a git repository.

Wednesday, January 18, 2012

Adding Hydrogen Atoms Is No Mean Feat

I've had a problem since Monday. Hydrogen atoms. You take an MD from say GROMACS or Tinker and make a 16 Angstrom extract of the entire system (this is Chorismate Mutase, PDB: 2CHT) in pymol via the command

cmd.select("sele", "(byres (sele expand 16))",enable=1)

so you can run QM/MM on it afterwards (see picture of the actual cut-out or look at my github page of the pdb-file with missing hydrogens).


However, doing the extract (which cuts the peptide-bond) improper chemistry appears around this bond since atoms are missing - on one side (where you remove atoms) it does not matter since you remove it all together, but on the side that remains, this is a big problem since the chemical valence is not correct. I want to add hydrogens where atoms are missing, but no standard solution was apparently available to add hydrogens to the bonds that were cut. Trying in pymol added hydrogens all over the place and the usual Open Babel tools also made funny business with the termini in its clever algorithms for proteins. In the end, it was time to revisit the good old trusty Open Babel and scour the API to actually see if something was usable. Again, I struck gold and here is what I did to actually make it work.


Chemically speaking, the local bonding environment is not satisfied at the points where cleavage took place. Open Babel has functionality to get the assumed valence of an atom as well as the actual valence (based on connectivity) - this we will use to our advantage and make sure it only applies to backbone carbons (not $C_{\alpha}$) and not to Oxygen.

For convenience I'm looping over residues and then atoms in residues and then seeing if the assumed and actual valencies are matching. This is not nescessary as I do not use any information about the residues (yet). If the valencies are not matching I add hydrogen(s) to the atom in question to satisfy the valency.

The following code does this

import sys
import openbabel
from obutil import *

if __name__ == "__main__":
  if( len(sys.argv) != 3 ):
    print "Usage: protonate in.pdb out.pdb"
    sys.exit()
  file_in = sys.argv[1]
  file_out = sys.argv[2]

  mol = OBMolFromFilename(file_in)
  for residue in openbabel.OBResidueIter(mol):
    rname = residue.GetName()
    chain = residue.GetChain()
    for atom in openbabel.OBResidueAtomIter(residue):
      if atom.IsCarbon() or atom.IsNitrogen():
        idx = atom.GetIdx()
        imval = atom.GetImplicitValence()
        reval = atom.GetValence()
        if imval != reval:
          print "%5i (%3s %s) %2i %2i" % (idx,rname,chain,imval,reval)
          mol.AddHydrogens(atom)

  OBMolToFilename(mol, file_out)

Notice it uses a new obutil.py module I've made with some convenience functions to easily access features of Open Babel. By using the code on the supplied model1.pdb file, one ends up with

    1 (VAL A)  3  2
    3 (VAL A)  3  2
   17 (ASP A)  3  2
  239 (LEU A)  3  2
  256 (PRO A)  3  2
  258 (PRO A)  3  2
  270 (VAL A)  3  2
  397 (THR A)  3  2
  409 (LEU A)  3  2
  808 (GLU A)  3  2
  821 (CYS A)  3  2
  910 (MET A)  3  2
  925 (GLY C)  3  2
 1008 (THR C)  3  2
 1020 (LEU C)  3  2
 1022 (LEU C)  3  2
 1039 (ILE C)  3  2
 1041 (ILE C)  3  2
 1058 (LEU C)  3  2
 1159 (LEU C)  3  2
 1176 (MET C)  3  2
 1450 (MET C)  3  2
 1465 (GLN C)  3  2
 1467 (GLN C)  3  2
 1482 (ILE C)  3  2

and there you have it: A protonated cutout from an MD simulation which does not have reprotonated atoms.

NB! There is one issue with this approach which concerns the actual output to PDB. There is no reordering of the atoms when you save the file, so if you depend on the order of the atoms, then this approach needs some more work. A colleague (and I) have yet to figure out a clever way to actually do this.