2. File Formats

Note

This is probably the most boring part of the course.

However, getting an understanding of how chemical structure are encoded in different formats is very useful because different software will often have different ‘default’ file formats and it can be useful to know what these formats are capable of (or not).

Many different ways to store molecular structures exist and all have varying strengths and weaknesses.

A general instroduction to file formats in computational chemistry can be found here

https://en.wikipedia.org/wiki/Chemical_file_format

For the purposes of this protein-ligand docking course a brief description of the most relevant formats are given below. To begin with we will stick to a very simple example molecule, \(\text{H}_2 \text{O}_2\), simply so we can see how the different file formats do what they do.

Each of the formats we will look at have standard file extensions that modelling software recognises so it is wise to stick to these extensions in your work. For example, you might save the structure of benzene in several different formats and these could be called ‘benzene.xyz’, ‘benzene.mol2’, etc.

Fortunately, there are software packages that will write these files (and convert between them) for you but it is a good idea to understand what is going on inside the files.

2.1. XYZ format

Perhaps the simplest format, the Cartesian/XYZ formatted \(\text{H}_2 \text{O}_2\) molecule is shown below

4
hydrogen peroxide
O  0.0000  0.7375 -0.0528
O  0.0000 -0.7375 -0.0528
H  0.8190  0.8170  0.4220
H -0.8190 -0.8170  0.4220

The first line contains the number of atoms that are to be found in the system. More specifically, it tells any software reading this file how many coordinate lines to read - if you changed this to 2 then the software would only read the first two lines and would think that this file contained \(\text{O}_2\).

The second line is a comment line that you can use to store information about the contents of the file. It can be left blank but it must be there. The following file would not work:

4
O  0.0000  0.7375 -0.0528
O  0.0000 -0.7375 -0.0528
H  0.8190  0.8170  0.4220
H -0.8190 -0.8170  0.4220

as the software would think that the second line was a comment and that there are only three atoms, even though the file states that there are four.

The remaining lines start with the element symbol and the next three columns contain the X, Y and Z coordinates of that atom.

This format has the advantage of simplicity and you can change the order of the coordinate lines to suit yourself. However, it is not possible to store other information about the molecule that might be important such as atomic charges or how the atoms are connected.

This last point is particularly problematic for complicated molecules where software may ‘guess’ the bonding connections incorrectly leading to wrong results.

2.2. Mol2 format

The Mol2 format, whilst appearing much more complex, is in fact fairly straightforward as you will see below. It was designed to enable the encoding of a large amount of chemical data about systems ranging from small molecules to polymers and biomolecules like proteins. It can even handle complex systems like non-covalently bound clusters of atoms and molecules.

One of the great advantages of this format is that it contains a description of the bonding in the system that is unique, avoiding potential problems as mentioned for the XYZ format, above.

Note that you may have to scroll side to side in order to see all of the file.

@<TRIPOS>MOLECULE
hydrogen peroxide
 4 3 0 0 0
SMALL
GASTEIGER

@<TRIPOS>ATOM
      1 O           0.0000    0.7375   -0.0528 O.3     1  HOOH       -0.2528
      2 O           0.0000   -0.7375   -0.0528 O.3     1  HOOH       -0.2528
      3 H           0.8190    0.8170    0.4220 H       1  HOOH        0.2528
      4 H          -0.8190   -0.8170    0.4220 H       1  HOOH        0.2528
@<TRIPOS>BOND
     1     1     2    1
     2     1     3    1
     3     2     4    1

This file contains the following sections separated by blank lines (these blank lines are optional but help to improve human readability):

  • @<TRIPOS>MOLECULE

    This block contains the general information about the molecule.
    Line 1 - some comment about the molecule, usually the name.
    Line 2 - the number of entries in the following sections of the file.
    These include atoms, bonds, substructures and other sets of atoms that might be of use in defining structural details of much larger molecules.
    Line 3 - refers to the class of molecule.
    Not all software uses this line but in some cases ‘SMALL’ informs the software to expect a discrete single molecule whereas ‘PROTEIN’ lets it know that the molecule is composed of different residues.
    Line 4 - tells us what type of atomic charges are included.
    In this case the charges come from the common Gasteiger charge calculation method [1] but they could also come from a force-field or even be user-specified (e.g. from the output of high-level quantum chemical calculations).
  • @<TRIPOS>ATOM

    This block contains the coordinates and other information about the constituent atoms in the molecule.
    Column 1 - number of each atom in the file.
    Column 2 - element symbol
    Columns 3-5 - X, Y and Z coordinates (in Angstroms).
    Column 6 - Mol2 atom type.
    These atom types augment the element symbols as they indicate the hybridisation and (optionally) formal charge of the atom. [2]
    Column 7 - residue number (only really important for proteins)
    Column 8 - residue name (only really important for proteins)
    Column 9 - atomic charges
  • @<TRIPOS>BOND

    Column 1 - bond number
    Column 2 - 1st atom participating in bond
    Column 3 - 2nd atom participating in bond
    Column 4 - bond order
    This is most commonly 1, 2, 3 or ar (aromatic).

2.3. PDB format

pdb files are the native format used by the Protein Data Bank for the representation of biomolecuclar crystallographic data. They have the ability to store a huge amount of data about crystal structures and a description of the full specifications is beyond the scope of the current course.

If you wish to find out more about the pdb format the official description can be found here.

Although structure files in the pdb format are important for protein-ligand docking (you will almost always be using protein structures downloaded from the Protein Data Bank in your docking studies), the format itself is problematic as different software packages will know how to read/write different versions of pdb file and there is even some software that uses modified versions of the format that are not completely compatible with thae standard format.

In the following example we will look at the pdb representation of hydrogen peroxide.

COMPND    hydrogen peroxide 
REMARK    this is a comment line    
ATOM      1  O   UNK A   1       0.000   0.738  -0.053  1.00  0.00           O  
ATOM      2  O   UNK A   1       0.000  -0.738  -0.053  1.00  0.00           O  
ATOM      3  H   UNK A   1       0.819   0.817   0.422  1.00  0.00           H  
ATOM      4  H   UNK A   1      -0.819  -0.817   0.422  1.00  0.00           H  
CONECT    1    2    3                                                 
CONECT    2    1    4                                                 
CONECT    3    1                                                      
CONECT    4    2                                                      
MASTER        0    0    0    0    0    0    0    0    4    0    4    0
END

Although a number of other fields are permitted, the following give a good overview of the structure of the pdb file:

  • COMPND - the name of the molecule contained in the file.

  • REMARK - any lines starting with this are ignored by default and can contain any text relating to the contents of the file.

  • ATOM - these lines contain the atomic coordinates and other data (see below).

  • CONECT - in this version of the pdb file the connectivity (bonds) are stored here. Note that no information on bond order is given.

  • MASTER - this line contains numerous fields that can contain information relevant to crystallographic structures, however, in this case only the number of “ATOM” and “CONECT” records have non-zero values.

  • END - the end of the file.

The ATOM block has a complicated column structure and the contents must adhere structly to the following positioning:

Columns

Contents

1 - 6

“ATOM”

7 - 11

Atom serial number

13 - 16

Atom name

17

Alternate location indicator

18 - 20

Residue name

22

Chain ID

23 - 26

Residue sequence number

27

Code for insertion of residues

31 - 38

X coordinate (Angstroms)

39 - 46

Y coordinate (Angstroms)

47 - 54

Z coordinate (Angstroms)

55 - 60

Occupancy

61 - 66

Temperature factor

77 - 78

Element symbol

As you can see, in the current example not all of these columns are filled (or may have default values like the occupancy and temperature factor). This is because the majority of the fields in the pdb format are not used for small molecule and/or non-crystallographic structures. To add to the confusion, many (but not all) of these column ranges must be have their contents either right- or left-justified in order to be read correctly.

It should be clear now that the pdb format, whilst undeniably one of the most important structure file formats is far too complex for routine computational chemistry tasks and prone to accumulation of errors, particularly if edited by hand.

Note

Generally speaking, pdb files are best avoided in computational chemistry and formats like mol2 are more reliable for storing your structures. This is particularly true for small molecules such as the ligands that you will be working with in protein-ligand docking calculations.

One important reason for this mol2 files contain both the molecular connectivity and the bond type/order (the @<TRIPOS>BOND section in the example above) so that the way that the atoms are bonded is defined explicitly and does not need to be guessed at by the software. This is not always the case for the different versions of pdb file that exist and can lead to misinterpretation of the molecular structure by different software packages you may be using.

2.4. PDBQT format

The pdbqt format is a development of the pdb one that was specifically designed to contain the information required by the protein-ligand docking software AutoDock. The pdbqt format was inherited by the software that we will be using in this course, vina, when it was created later as a development of AutoDock.

Note

The information required by vina in order to calculate the binding affinity between a protein and ligand is contained in the pdbqt file format. For this reason, both the ligand and the protein must be provided in pdbqt format so that the interactions between them can be calculated.

The HOOH molecule that we looked at in the previous examples is shown below in pdbqt format:

REMARK  Name = hydrogen peroxide
REMARK  1 active torsions:
REMARK  status: ('A' for Active; 'I' for Inactive)
REMARK    1  A    between atoms: O_1  and  O_2
REMARK                            x       y       z     vdW  Elec       q    Type
REMARK                         _______ _______ _______ _____ _____    ______ ____
ROOT
ATOM      1  O   UNK A   1       0.000   0.738  -0.053  0.00  0.00    -0.253 OA
ATOM      2  H   UNK A   1       0.819   0.817   0.422  0.00  0.00    +0.253 HD
ENDROOT
BRANCH   1   3
ATOM      3  O   UNK A   1       0.000  -0.738  -0.053  0.00  0.00    -0.253 OA
ATOM      4  H   UNK A   1      -0.819  -0.817   0.422  0.00  0.00    +0.253 HD
ENDBRANCH   1   3
TORSDOF 1

Some similarities with the pdb format can be seen

  • REMARK - as in the pdb file, lines beginning with this are ignored by the software making them useful for storing human-readable information.
    • in this case, the REMARK lines contain information on how many ‘active’ torsions there are in the molecule and tells you which atoms they connect (in larger molecules this will be a table of all the active torsions).

  • ATOM - again, like the pdb format these lines contain the atomic data (coordinates, residue that they belong to, etc). However, new fields have been added to the ATOM records:
    • The vdW and Elec fields are not used by vina but were included in the pdbqt format so that van-der-Waals radii and electronegativities could be incorporated.

    • The atomic partial charges, q, have values assigned here and are important if using AutoDock but again, vina does not directly use charges in calculating binding affinities.

    • Most important for vina is the Type field. This allows vina to evaluate docking contributions from atom-type pairs e.g. a hydroxyl group hydrogen in the ligand and a peptide backbone carbonyl oxygen in the peptide.

The remaining lines are completely different from the pdb format but at vital for defining those parts of the ligand that are flexible and those parts that are to be treated flexibly during the docking calculation.

vina uses a ‘tree’ model of molecules to achieve this. In this model the rigid core of the molecule is the ‘root’ and the flexible portions are the ‘branches’ that emanate from this. This torsion tree is defined by the following lines

  • ROOT - defines the rigid core of the molecule, in this case the first H and O atoms.

  • ENDROOT - tells vina that the rigid core ends at the atom before this line.

  • BRANCH - specifies a flexible branch in the tree and gives the numbers of the atoms connecting the flexible torsion (oxygen atoms 1 and 3 in this case).

  • ENDBRANCH - defines where this particular flexible branch terminates. Note that because this line specifies the atoms numbers of the branch that is ending, you could also have nested branches defining more complicated flexible sections of the molecule such as aliphatic chains.

The final line, TORSDOF, defines the total number of torsional degrees of freedom (rotatable bonds) in the molecule. This number does not include things like bonds that are part of rings, bonds to ‘leaf’ atoms, amide bonds, etc. This value is important because it is used in calculating the change in free energy due to loss of torsional freedom occuring during binding of a ligand to its protein target.

2.5. Converting between file types

Whilst it is possible to convert molecule files by hand it is considerably easier to have it done automatically. The OpenBabel program can be used for this purpose and uses fairly simple command line instructions. For example. to convert a PDB file (ligand.pdb) to Mol2 format we can simply use:

obabel -ipdb ligand.pdb -omol2 -O ligand.mol2

Here -ipdb ligand.pdb tells OpenBabel that the input file, ligand.pdb, is in pdb format and -omol2 lets it know that we want the output in Mol2 format. It is important to include the -O before the output file name otherwise the converted file’s contents will simply be printed to the terminal and not saved as a file.

OpenBabel can do many other things in addition to converting the file between formats. For example, if your molecule has no hydrogens (as can be the case with downloaded structures) you can ask OpenBabel to add them by adding -h to the command:

obabel -ipdb ligand.pdb -omol2 -O ligand.mol2 -h

If your molecule contains titratable groups that may change protonation state with pH, OpenBabel can make a good guess at whether they should have hydrogens added by adding the -p modifier and specifying the pH you want, e.g.:

obabel -ipdb ligand.pdb -omol2 -O ligand.mol2 -p 7.4

It can also sometimes be useful to center the molecule with the -c modifier:

obabel -ipdb ligand.pdb -omol2 -O ligand.mol2 -c

Many other things are possible with this software and these are listed on the OpenBabel site.

Footnotes