Theseus

A program for maximum likelihood superpositioning and analysis of macromolecular structures

Theseus is a program that simultaneously superimposes multiple macromolecular structures. Instead of using the conventional least-squares criteria, Theseus finds the optimal solution to the superposition problem using the method of maximum likelihood (ML). The ML method downweights variable regions of the superposition and corrects for correlations among atoms, producing much more accurate results.

When superposing macromolecules with different residue sequences, other programs and algorithms discard residues that are aligned with gaps. Theseus, however, uses a novel ML superposition algorithm that includes all of the data. To use Theseus to superposition homologous proteins with different length sequences (e.g., when the protein sequences align with gaps and insertions), a sequence alignment must be provided. We supply a wrapper script, theseus_align (linked below), that calls Theseus, extracts the proper sequences from the PDB files, aligns them, and performs the superposition using that alignment. Future versions of Theseus will address the much harder structural alignment problem, by simultaneously finding the best alignment and superposition using the method of maximum likelihood.


LS vs ML of kunitz domain

A conventional least-squares superposition of the Kunitz domain from PDB ID 1adz is shown at left. A maximum likelihood superposition from Theseus is shown at center. At right is the first principal component of the superposition plotted on the family of models. The red loops at lower right are highly correlated with each other, whereas they are moderately anti-correlated with the light blue strands at left center.


Author

Douglas Theobald <>


Citations

"Optimal simultaneous superpositioning of multiple structures with missing data."
Theobald, Douglas L. & Steindel, Philip A. (2012) Bioinformatics 28 (15): 1972-1979 [Open Access]

"Accurate structural correlations from maximum likelihood superpositions."
Theobald, Douglas L. & Wuttke, Deborah S. (2008) PLOS Computational Biology 4(2):e43 [Open Access]

Latest Version — THESEUS 3.3.0 (2015 June 5)


Version 3.3.0: Optimizations, code cleaning.

Version 3.1.1: Minor tweaks to help page, updates to man page, etc.

Version 3.1.0: Differences from version 3.0.0 include (1) a slightly improved algorithm, (2) corrected marginal likelihood calculation, (3) made average structure be in same reference frame as the superposition, and (4) fixed a bug in the PCA output (theseus_pc#_ave.pdb was correct, but superposition pc files were not).

Version 3.0.0: Differences from version 2 include (1) improved algorithm, (2) slight change in target criterion, now maximizing a marginal likelihood (with covariance matrix integrated out) instead of a joint likelihood, which should improve stability in certain rare pathological cases, and (3) lots of code restructuring and streamlining.


UNIX C source code, licensed uder the GPLv3 open source license.
Requires an ANSI C compiler (preferably GNU GCC) to compile and a working GSL library to link against.
Download source (22 Mb)


Macintosh OS X Universal binary.
Download


Linux generic x86 binary executable.
Download


'theseus_align' script
Download

This very useful wrapper script runs THESEUS on multiple PDB files when the proteins (or nucleic acids) are of different lengths. For example, you will probably want to use this script when superpositioning structurally similar homologous proteins (having different sequences). This script transparently extracts the proper sequences from the PDB files, aligns them, and then performs the ML superposition based on that alignment. Examples are given in the examples directory provided with the source code and binaries. In general, the command will look something like:
theseus_align -f protein1.pdb protein2.pdb