Abstract
Background
Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a largescale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from positiondependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.
Results
In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the lowprobability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rareevent simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach.
Conclusions
The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.
Background
A large amount of molecular biological data is stored in form of sequences of symbols in huge data bases, e.g., DNA sequences or the primary structure of proteins. It is one main task of Bioinformatics [1] to develop algorithms [2] which allow to find, via sequence comparison, for a given "query" sequence the most similar "subject" sequences in a data base. All widelyused algorithms depend on many parameters, the socalled scoring schemes [2], which are usually suitably adopted to test sets of data. Parts of these scoring schemes involve also rules how to deal with small nonsimilar subsequences, the socalled gaps. The most popular sequencecomparison algorithms are the SmithWaterman algorithm [3] for pairwise local sequence alignment (which means that the most similar subsequences of two sequences are found) and the Viterbi algorithm for sequencetoHMM alignment. In the latter case, one specific sequence is compared to a full set of sequences which is specified via a Hidden Markov Model (HMM) [4]. For practical applications, often very fast heuristics are used, which do not find the exact bestmatching (sub)sequences but only approximations of the optimum. The BLAST algorithm [5] is used widely. Sequencecomparison algorithms return a raw similarity score, i.e., a number, that quantifies the similarity between the input objects. Unfortunately, this raw score is hard to interpret because one does not know its absolute scale.
An interpretation becomes possible when we specify a probabilistic null model for the input: Then the similarity score becomes a random variable S whose probabilities Prob(S = s) under the null model can be determined. Sometimes this can be done analytically, but usually one has to apply numerical simulation [6]. The pvalue assigned to an observed score s is defined as pval(s): = Prob(S ≥ s) in the null model, and log pval(s) is a measure of surprise (and hence a universally normalized score) for s. The key problem is, of course, to find Prob(S = s) for a given sequencecomparison algorithm, a given scoring scheme, and a given null model.
In this paper, we explain and extend an efficient and generally applicable numerical technique that solves this problem in many different sequence comparison settings, such as for a BLASTlike database search [5] with a fixed query, for positionspecific scoring and/or gapcost schemes (essentially HMMs), or for normalized alignment [7]. In each of those settings a variety of null models in addition to the i.i.d. model is possible.
Previous work
We start by introducing some necessary formal notations. For a full description, please refer to Ref. [2]. Let Σ be a fixed alphabet of symbols, denoting e.g. nucleotides (Σ = 4) or amino acids (Σ = 20).
Most of the existing statistical work for pairwise sequence comparison focuses on null models where both sequences are random and at each position a symbol σ ∈ Σ is chosen independently of the other positions ("i.i.d. model"), with a given frequency . f often reflects the average composition of proteins in the UniProt/SwissProt database [8]. An alignment of the two sequences is a set of pairs {(i_{k}, j_{k})}, which means that symbol of position i_{k }of the first sequence is aligned (or paired) to the symbol at position j_{k }of the second sequence. The pairs must not cross, i.e., if for two pairs (i_{k}, j_{k}) (i_{l}, j_{l}) the condition i_{k }< i_{l }holds, then also j_{k }< j_{l}. Positions which are not paired, i.e., which do not appear in the alignment, contribute to the above mentioned gaps. The length of a gap is the number of adjacent ungapped positions. In the following example, where the two sequences are shown such that paired symbols are atop on each other,
two gaps of lengths two and three appear.
Scores for individual pairs of symbols are given by a constant (positionindependent) symmetric Σ × Σ scoring matrix with negative expected score, such as BLOSUM62 [9]. The score of an alignment is given by the sum of the scores of the pairs, plus (negative) contributions for the gaps. For the conventional gapscoring schemes (called "affined"), each gap contributes a score which depends only linearly on its length (times a parameter called gapextension penalty) plus a constant (gapopen penalty). We shall refer to this model later as "random query  generalpurpose scoring" (RQGS).
For gapless pairwise local sequence alignment, the raw score distribution can be derived numerically by Markov chain analysis [10] and also asymptotically for infinite sequences (KarlinAltschul or DemboKarlin statistics [11]): It is an extremevalue distribution (EVD), also called Gumbel distribution [12]:
where the parameters λ > 0 and c > 0 depend on the score matrix, on the symbol frequencies f, and on the query and subject sequence lengths L_{Q }and L_{S}. Asymptotically we have c = KL_{Q}L_{S }for a lengthindependent K > 0.
For gapped pairwise local sequence alignment, which is the most relevant case in database queries there exist no universal analytic results, with the exception of few special cases [13]. Empirical evidence also indicates convergence towards the Gumbel form for long sequences; λ and K additionally depend on the gapcost function [14]. Several works have focused on efficient numerical estimation of these parameters [15]. The influence of varying lengths of the finite sequences [16] is treated in various ways, e.g. by adjusting the lengths of the sequences to "effective lengths" but still assuming a Gumbel form of the distribution. Nevertheless, for moderate sequence lengths, which are biologically most relevant, the true distribution differs strongly from a Gumbel form [17,18], which can be dealt with by including a correction term to the Gumbel form (giving rise to an additional parameter).
The (RQGS) model is convenient, because the problem of computing significance values reduces to the estimation of only two parameters, which can be precomputed for each scoring scheme. However, there are also several problems. For instance, even if one considers just the gapless case, it is in general not easy to extend the analytic asymptotic theory to more complex null models. Furthermore, for practical applications where finite sequence lengths are considered, of even more importance is: The pvalues reported by (the original) BLAST only depend on the raw score, the query and the subject length, and not on the actual query sequence. This leads to large distortions when the composition of the query sequence does not match the composition of the null model. For example, when we run a homology search for the human transmembrane protein rhodopsin (UniProt accession P08100) with BLAST (BLOSUM 62, gapinit 12, gapextend 1, no composition adjustment, no filtering), we find a possibly remote homolog Q8NH42 with an Evalue of 9 · 10^{8}. The Evalue for score s is the expected number of database hits with score at least s and can be easily computed from pval(s) and the database size. Hence, it appears unlikely to obtain such a homolog by pure chance, i.e., the homolog appears to be relevant. However, using a recent "compositionbased adjustment" option [19,20] leads to a very different Evalue of 0:001 for the same protein. This underlines the importance of queryspecific or at least compositionbased statistics, particularly for intermediate pvalues.
The statistics of positiondependent scoring and/or gapcost schemes, as used in PSIBLAST [21] or in hidden Markov model (HMM) frameworks, are less well explored. The central question here is, "given a query Q and a positionspecific scoring scheme, what is the score distribution when random nullmodel sequences of given length are scored against Q?". We refer to this model as "fixed query  positiondependent scoring" (FQPS). As a compromise between the general (RQGS) and the very specific (FQPS) models, one may neither use a completely free (i.i.d) nor use a fixed query but draw query sequences according to more specific models, e.g., HMMs for transmembrane proteins.
In all these cases EVDs Eq. (2) may still used heuristically by fitting the parameters of the EVD to the simulated data. This can be achieved by generating pairs of random sequences according to the given null model while recording the histogram of observed alignment scores. Using such a "simple sampling" approach, the largeprobability region of the score distribution can be investigated, e.g., for probabilities about > 10^{4 }when generating 10^{4 }sequence pairs. Such an approach is implemented, e.g. in the hmmcalibrate program from the HMMER package [22]. Nevertheless, this procedure may fail to describe distribution in the "rareeven tail", i.e., where the probability is small (say < 10^{4}), although this part of the distribution is most important for the estimation of the statistical significance.
Our motivation for a simulationbased method that makes no initial parametric assumption refers to the approach [23] to increase the sensitivity of detecting homologs of a given transmembrane (TM) protein in a database search: A bipartite scoring scheme with a (nonsymmetric) transmembrane helix specific scoring matrix (such as SLIM [23]) for the TM helices and a generalpurpose scoring matrix (such as BLOSUM [9]) for the remaining regions of the query protein were applied, see Figure 1. This results in higher search sensitivity and specificity. However, a statistical theory or efficient computational method to obtain the score probabilities in such a (FQPS) framework is missing so far.
Figure 1. Bipartite scoring scheme. Bipartite scoring scheme for the detection of homologous transmembrane proteins from Ref. [23]. The figure represents the SmithWaterman alignment matrix and indicates which scoring matrix is used for which query positions (rows): In transmembrane helices, a transmembranespecific scoring matrix is used. For pvalue computations, the query is assumed fixed or generated by the TMHMM and the subject is assumed a random i.i.d. sequence drawn from the distribution of aminoacid frequencies of the database.
Our contributions and paper outline
We present a general framework for efficient estimation of raw score distributions in sequence comparison problems. In particular the rareevent tail for large scores can be accessed. We only make the following assumptions:
1. We are able to sample pairs x, y of sequences according to the null model and to compute the null model probability of any given x, y.
2. We have an efficient algorithm that computes the score S(x, y), where x, y could be a pair of randomly drawn sequences (RQGS or HMM), or one fixed and one random sequence (FQPS).
3. The scores are rational numbers with a common denominator. Hence, without loss of generality, they can be assumed to be integers.
4. Optionally for the (HMM) approach, we have an efficient algorithm that predicts the most likely state sequence for a given sequence.
Our framework is readily applicable to the (RQGS), (FQPS) and (HMM) models, but also to more exotic settings, such as normalized alignment [7], where the score is not additive, but normalized by the alignment length, for which no statistical framework exists so far. Very recently Eddy [24] studied the distributions of Viterbi and Forward scores under probabilistic local alignment, for which a numerical analysis of the rareevent tail would be of interest as well.
In the current stage of the methodology, the computation of an accurate "on the fly" pvalue for each particular database query might be impracticable as each full calculation is not achieved within a few minutes.
We will illustrate the approach for the HMM for TM proteins (TMHMM [25,26]), which has been proven valuable in predicting TM helices. In this approach (and possible in other models as well) one is able to specify the score distribution in more detail. Each query may be classified to be a member of certain given subclasses . In this case, it could be meaningful to obtain an individual score distribution for each subclass. A natural classification of the TMHMM is the number of transmembrane regions. Since for a given sequence this number is usually not known exactly, one takes the most likely one.
The rest of the paper is organized as follows. The following section presents the mathematical background on importance sampling and Markov chain Monte Carlo methods which are fundamental to the methods used to obtain the score distribution, in particular in the rareevent tail, for different null models. Next, we present a description of the methodology. Section "Results" shows computational results on transmembrane protein similarity statistics in (RQGS), (FQPS) and (HMM). A discussion closes the paper.
Methods
Importance sampling
Importance sampling is a general technique to reduce the variance in the estimation of quantities that can be written as an expectation , where Z is a random object representing the null model and h is a realvalued function. We assume that we can draw within a computer simulation n random samples z_{1}, ..., z_{n }according the null model. The expectation is then approximated by the empirical mean .
In our setting, to estimate the score distribution (and then pvalues), we consider the state space , from which we generate n random pairs of sequences({x^{(1) }, y^{(1)}} ,..., {x^{(n) }, y^{(n)}}).
These pairs are then aligned by a given algorithm and the corresponding similarity scores S(x^{(i)}, y^{(i)}) are computed. To formally write a histogram as an expectation value, we consider the family of indicator functions for all s > 0, defined by h_{s}(x, y) := 1 if S(x, y) = s, and h_{s}(x, y) := 0 if S(x, y) ≠ s, hence
This means, we approximate the unknown exact probability Prob(S(X, Y)) by normalized score histograms over all sampled sequence pairs. If the probability to be estimated is small, say 10^{9}, when using simple sampling, we need about 10^{12 }samples to estimate it with reasonable precision. Thus, for very rare events, this sampling quickly becomes infeasible.
Importance sampling generates the "interesting" events more often by sampling from a different distribution and correcting for this bias afterward, which results in a more accurate estimate with a reasonable number of samples. Let p be the probability mass function (pmf) of (X, Y), and let q be another pmf satisfying q(x, y) > 0 whenever p(x, y) > 0. Then
where each pair (x'^{(i)}, y'^{(i)}) is sampled from the pmf q. Eq. (2) gives us the relationship between the expectation value w.r.t. the unknown distribution of interest, the target distribution p, and samples drawn from the actually used sampling distribution q. To successfully apply importance sampling, q has to fulfill three properties: First, it needs to put high probability on the region of interest; second, we need to be able to sample according to q; third, we need to be able to compute the correcting weight p(x, y)/q(x, y). Since directly sampling from q often is impossible, we shall use a general sampling method which we describe next.
MetropolisHastings sampling
If we need to generate samples from a discrete (or continuous) distribution q but have no simple direct method to do so, the MetropolisHastings method [27] provides a solution by constructing an ergodic Markov chain with stationary distribution q in the following way.
Extensive introductions to such socalled Monte Carlo simulations can be found, e.g., in Refs. [28,29]. Here we just give a concise introduction specifically tailored to the problem of sampling pairs of sequences. Let us call the elements (x, y) of the sample space configurations. The sampling will be performed by randomly "moving" in the space of configurations, such that for each step a configuration is altered only slightly. Configurations which can be visited and are connected by a move are called neighbors. Hence, each configuration (x, y) has a algorithmdependent set of potential neighbors (x', y') ∈ . The movement is performed in such a way that each neighbor exhibits a positive probability P_{(x,y), (x',y') }as being the next configuration. The proposal is accepted with probability
in which case (x', y') becomes the new current configuration. Otherwise (x', y') is discarded and (x, y) remains unchanged. Thus, a sampling algorithm is specified via the neighborhoods and the proposal probabilities P_{(x,y),(x', y')}
The acceptance criterion Equation (3) is quite general. By using a symmetric proposal probability matrix, P_{(x',y') (x, y) }= P _{(x,y), (x',y')}, the relationship simplifies to
Since the distribution q appears only in form of a ratio we need to be able to compute q(x, y) only up to a normalization constant. If q(x', y') > q(x, y) we accept the proposal and otherwise the proposal is accepted as new configuration with the finite probability q(x', y')/q(x, y). Hence, we need to choose the neighborhood of x, y such that that the ratio is not too far away from one. Otherwise, virtually only proposals that increase the probability are accepted and the sampling procedure gets stuck at local maxima.
Equation (4) and its generalization Equation (3) describe Markov chains in the configuration space with the transition matrix T_{(x, y), (x', y') }= P_{(x, y), (x', y')}·α((x, y) → (x', y')).
For an appropriate choice of the neighborhoods and of the proposal distribution P_{(x, y), (x', y')}, the soconstructed Markov chain is ergodic (each configuration can be reached from any starting configuration with finite probability). Furthermore, one can show that the detailed balance condition
which is fulfilled due to the choice of α(.) according to Eq. (3), implies that the chain converges towards the desired sampling distribution q.
We say that the chain has reached equilibrium when convergence has occurred up to numerically negligible error. Thus, if the configuration (x, y) is sampled after equilibration, it will behave like a sample from q. In practice, the exact speed of equilibration is unknown and convergence diagnostics are applied (see below). Several (almost) independent samples are obtained by running the chain further and taking a sample every kth step for sufficiently large k to allow time for "forgetting" (i.e., decorreclating from) the state of the last sampled configuration. This time is usually referred as mixing time.
Implementation
In this section we show how the sampling algorithm for pairs of sequences is actually designed, based on the background given in the two preceding sections, such that the tails of the probability distributions for the scores can be addressed. The crucial point of the MetropolisHastings update is the choice of an appropriate neighborhood (and the related proposal probabilities) and the computation of the probabilities of newly proposed states q(x', y'). The neighborhood should be chosen such that the acceptance rate Eq. (3) is between 0.3 and 0.7. We shall factorize the (unnormalized) pmf q in two contributions, firstly weights w: ℤ → ℝ^{+ }that assign each score value of interest a weight and secondly the null probability, i.e.
Note that we will leave w(·) undetermined for a moment, until Section "WangLandau Sampling". The importance reweighting equation Eq. (2) for h_{s }is then
with the normalization constant
For the (HMM) a single distribution of scores is not sufficient: Each query is a member of a certain subclass characterized by the number of transmembrane regions "# of TM helices" to be determined by the Viterbi algorithm (see below). Thus, each class has its own probability P_{n}(s) = Prob(S = s, # of TM helices = n_{TM}). In order to take this property into account, we deal with the joint probability Prob(S = s, # of TM helices = n_{TM}). Accordingly, the weights have a two dimensional domain, we write w(s, n). Also h_{s }in Eq. (7) is replaced by an indicator function h_{s,n }that depends on two parameters: h_{s,n}(x, y) equals 1 if S(x, y) = s and # of TM helices of x = n_{TM}. The sampling distribution is generalized to
and the the reweighting relationship reads as
Generally the occurrence of two sequences x = and is characterized by the null probability
This simple factorization allows us to draw proposals for the query and for the subject independently. Hence, for simplicity, a neighboring configuration will leave one of the two sequences unchanged. Thus, for selecting a neighboring configuration, first one of the two sequences is chosen at random with probability 1/2. In the case of (FQPS) the subject is always chosen. Then one sequence is chosen from the neighborhood of the selected sequence, as described next. Formally, this means for (RQGS) and (HMM) we use the factorized proposal densities P_{(x, y), (x', y') }= 0.5P_{x, x' }1_{y, y' }or P_{(x, y) (x', y') }= 1_{x, x' }P_{y, y' }(1_{y, y' }denotes the indicator function which is only one if y = y', 0 else. P_{x, x' }denotes the proposal of a single sequence) depending on the choice of sequence in the first step.
Proposal densities for (FQPS) and (RQGS)
In the simplest case either both sequences are i.i.d. or the query is fixed (to some sequence ) and the nullmodel probabilities of their occurrence factorize, i.e.
and of course in both cases
Due to the factorization that occurs in Eq. (10) it is possible to draw sequences from such that the detailed balance condition f^{iid}(x) P_{x,x' }= f^{iid}(x') P_{x', x }is fulfilled by the following set of Monte Carlo moves (see also Figure 2 and Table 1)
Figure 2. Monte Carlo moves used in the simulation. (a) substitution, (b) insertion with left shift, (c) insertion with right shift,(d) deletion with right shift and (e) deletion with left shift.
Table 1. Monte Carlo operations
a) substitution of a single symbol at position k,
b) insertion of a single new symbol at position k with left shift (deletion of the first symbol),
c) insertion of a single new symbol at position k with right shift (deletion of the last symbol),
d) deletion of a single symbol at position k with right shift and insertion of a single new symbol at the beginning,
e) deletion of a single symbol at position k with left shift and insertion of a single new symbol at the end.
Operation a) appears with probability 1/2 and the other ones with probability 1/2 · 1/4 each. This is one possible choice that guarantees detailed balance.
Note that all sequences in have the same length and each operation involves a replacement of an existing symbol with a newly drawn symbol, in case a) by a direct substitution and in the cases b)e) indirect via a shift operation. Each position of a sequence has the same probability of being chosen and the replaced symbol is chosen in all cases according to the frequencies f_{σ }(σ ∈ ∑).
With this construction the MetropolisHastings ratio Eq. (3) simplifies to the special case of the Metropolis algorithm, i.e.
The right part in the second line cancels, because all contributions to p(x, y) and p(x', y') where symbols from (x, x') and y, y) agree cancel directly and for the few remaining letters, the proposal probability in the nominator contains exactly the frequencies of the null probability in the denominator, and vice versa. Thus, the acceptance rate depends only on the score value of the current configuration S(x, y) and the one of the proposal S(x', y'). Furthermore, it is easy to prove that the detailed balance conditions in Eq. (5) is fulfilled for this chain which in turn implies that the chain converges towards the sampling distribution q.
Proposal densities for the (HMM)
In contrast to the approach presented in the previous section, the generalized method we use here also works for null models that do not allow for direct sampling from as in the case of i.i.d. sequences. This framework can be summarized by following algorithm:
METROPOLISHASTINGSUPDATE(x, y, z, p, s, n, w)
Input: Sequences x, y, a hidden state sequence z, the null probability p(x, y) = f^{query}(x). f^{subject}(y), the score s = S(x, y), the subclass n and weights w
Output: Possibly new values for x, y, z, p, s, n.
1: Draw (x', y') ∈ N(x, y)
2: compute z' := V(x) using and determine the corresponding class n';
3: compute p' := f^{query }(x') · f^{subject}(y');
4: compute s' := S(x', y') using .
▷ Designed such that
p(x', y') · P_{(x', y'), (x,y) }= p(x, y) · P_{(x,y),(x', y')}
6: With probability min {1, α}:
Let (x, y, z, p, s, n)← (x', y', z'. p', s', n')
7: return (x, y, z, p, s, n)
The algorithm is applicable to all models that allow for a rapid calculation of the null probabilities f(·). Sequence models based on HMMs fall into this class. In the following we brie y describe this framework. A detailed discussion can be found in the specialized literature on the topic [2,4].
In the probabilistic framework of HMMs one assumes a sequence of "observed" symbols (the protein sequences here) which is generated conditioned on a sequence of "hidden" states. For the case of TM proteins, the state corresponds to the physical region where the corresponding amino acid is located in, as detailed below. Within a modeling using HMMs, this state sequence, also called path, follows a simple Markov chain. The actually generated symbols are connected to the hidden states by conditional "emission" probabilities. More formally, a HMM consists of
• a finite set ∑ of (output) symbols (in our case the amino acid alphabet),
• a finite set Γ of (hidden) states,
• initial state probabilities π_{μ }for all μ ∈ Γ with ∑ _{μ∈Γ }π_{μ }= 1,
• emission probabilities in each state μ ∈ Γ Σ and for all σ ∈ ∑ with = 1 for all σ ∈ ∑,
• a stochastic transition probability matrix P = (p_{μ,τ})_{μ,τ ∈ Γ}, i.e. = 1 for all μ ∈ Γ.
Given these model parameters, the "most natural" application of a HMM is to generate a sequence of hidden states by a stochastic process and, in parallel, to generate a random sequence of symbols given the generated states. Hence, the stochastic process describes pairs of states and symbols. But also given a fixed state Z = Z_{1 }... Z_{L}, the symbol sequence x = x_{1 }... x_{L }is a stochastic process, furthermore the opposite case of a fixed sequence of output symbols, the state sequence is a stochastic process.
For the Monte Carlo sampling as needed here, it is not possible to simulate a HMM directly to generate symbol sequences, since importance sampling changes the underlying sequence probabilities. Nevertheless, one still needs to compute the probabilities f^{HMM}(x) for the Monte Carlo acceptance procedure, i.e. the probabilities that x is the observed symbol sequence generated by the HMM using any feasible state sequence. These probabilities can be computed in (L · Γ^{2}) time using the well known forward algorithm as described in the following. One introduces the auxiliary variables f_{μ}(i), which correspond to the probability that the subsequence x_{1 }... x_{i }is generated by the HMM given that the last state variable Z_{i }has the value μ, i.e. f_{μ }(i) = Prob(X_{1 }... X_{i }= x_{1 }... x_{i}Z_{i }= μ). The overall probability is then f^{HMM}(x) = ∑ _{μ∈Γ }f_{μ }(L). The probabilities f_{μ }(i) can be determined by the recursion
Within the same time complexity the Viterbi algorithm computes the most probable state path for a given sequence of observations, that is
For this purpose one uses a different set of auxiliary variables: Let v_{μ }(i) be the probability of the most probable path ending in state μ ∈ Γ with observed partial output sequence x_{1}, ..., x_{i}. These values can be computed recursively by
with boundary condition . Note that these probabilities are not normalized, in particular . The missing normalization is no problem, since we are interested only in the most probable path, which is reconstructed by backtracking [2].
For the approach discussed in this section, the subject sequences are drawn almost as above, see below. The HMM approach we use to sample transmembrane queries is the TMHMM developed by Sonnhammer et. al. [25]. In this setting, the states are (structural) domains. Some of them are "tied", which means that they share the same emission probabilities. They are classified into seven groups:
• Helix core,
• two different groups of caps on either side,
• loops on the cytoplasmic side,
• short and long loops on the noncytoplasmic side,
• globular domains.
The internal structure of the helix core and loop module allows modeling different lengths of the corresponding protein domain by assigning jump probabilities. The globular domains have a selflooping structure and hence may also have various lengths. The other modules have fixed lengths. The overall number of model parameters is 216. Figure 3 shows the actual layout of TMHMM. Each box represents a group of "tied" states. The states corresponding to "helix core" represent the transmembrane helices that connect states of the cytoplasmic side and the noncytoplasmic side of the membrane. The prediction of the positions of the "helix core" states determines the loci of the special purpose scoring matrix SLIM for position specific alignment (see Figure 1).
Figure 3. The layout of the HMM for transmembrane proteins. The layout of the HMM for transmembrane proteins according to Sonnhammer et.al. [25]. Each box corresponds to a group of states. For example the helixcore block consists of 25 internal states. Line type of boxes represent different emission probabilities. For more details we refer the reader to the original publication.
The following MetropolisHastings update consists of two steps: First, the proposal of a new configuration from the neighborhood is made by inserting/replacing symbols with equal weights for all σ ∈ ∑ using one of the five MonteCarlo moves described above. The acceptance ratio Eq. (3) in that case is given by
The current and new number of TM regions n_{TM }and n'_{TM }are determined by the Viterbi algorithm applied on the sequence x and x' respectively. The calculation of the query probabilities is based on the TMHMM. The subject sequence probability is simply calculated according to Eq. (11). Note that, in each step, one of the two probabilities cancels, because only one of the two sequences is changed within each step, as above.
This approach allows us to sample noni.i.d. sequences with appropriate weights and to predict transmembrane helical regions that can be used in the position specific alignment scheme (as described in [23]) even for random sequences.
WangLandau sampling
The idea of importance sampling is to choose the weights w(·), such that the drawn events in the region of interest have a high probability to occur in the simulation. Ideally, P(S) is already known and in that case one might choose w(S) ∝ 1/P (S) on the entire range of interest. Then all states are visited with equal probability, and hence a at score histogram is achieved in the limit of infinite sample size. Still, for practical applications with finite sample size, the distribution of scores can be sampled with high accuracy over a large range of its support. This idea refers back to statistical physics and it is known as "generalized ensemble" or " flat histogram" methods. In the following we will denote this weights by w^{flat}.
Of course the true P(S) is unknown and the method requires some guesses which approximate w^{flat }to a suitable accuracy. The achieved score histogram becomes only approximatively flat. The true (unknown) distribution can then be estimated by reweighting the histogram of visited states using the importance sampling formula Eq. (7) for h_{s}.
Many iterative sampling schemes to achieve initial guesses had been developed in the 1990ies, for example entropic sampling [30], multicanonical sampling [31] and later transition matrix Monte Carlo [3234], only to mention a few. Here we use the WangLandau algorithm [35,36] to approximate w^{flat }as input for MetropolisHastings sampling.
The WangLandau algorithm explicitly violates detailed balance by dynamically updated weights depending on the visited states in the following way: First, a score range of interest [S_{min}, S_{max}] is chosen. The algorithm basically employs a histogram H(S) and weights w(S) defined on the desired score range. For more complicated models such as the (TMHMM), these objects are twodimensional depending on the score S and the class n_{TM}, i.e. H(S, n_{TM}) and w(S, n_{TM}). Furthermore, real valued parameters ϕ_{i }> 1 are used in each iteration i. Initially, the histogram values H(S, n_{TM}) are set to 0 in the desired range and all weights w(S, n_{TM}) to a constant, say 1. For the first iteration, i = 0, ϕ_{i }can be as large as e^{1}. Then, a simulation is performed using acceptance ratio Eq. (12) or Eq. (14). After each step, corresponding to one step of a (biased) random walk in the configuration space, w(S, n_{TM}) is updated as w(S, n_{TM}) ← w(S, n_{TM}) × ϕ_{i}, where S is the current score value and n_{TM }the subclass of the current state. Also the histogram H is updated by one H(S, n_{TM}) ← H(S, n_{TM}) + 1. In the literature this is often continued until an "approximately flat histogram" is achieved. A possible flatness criterion might be for all S, n_{TM}. Once the histogram is " flat", ϕ is decreased by the rule and all entries of the histogram H are set to 0 again, while w is kept for the next iteration. Note that the application of a flatness criterion is not essential for the good performance of the algorithm. It is enough to guarantee that all values of S have been visited, for example by requiring that the random walker has cycled several times through the interval of interest [S_{min}, S_{max}].
To summarize, we have the following recipe:
WANGLANDAU(w, ϕ, ϕ_{final}, N)
Input: Initial guess w[s, n], initial and final modification factors ϕ, ϕ_{final}, number of samples for production run N
Output: Histogram of visited scores, H(s, n):= number of samples with score s and class n and weights used in the production run w(s, n) for all s and n
1: ▷ Initialize and estimate w[s, n]
2: Pick any x, y ∈ and compute its null probability p := f^{query}(x) · f^{subject}(y);
4: compute z := V (x) using and determine corresponding class n;
5: while ϕ > ϕ_{final }do
6: H[s', n'] ← 0 for all possible score values s' and classes n'
7: while H[s', n'] is not at do
8: (x, y, z, p, s, n) ←
METROPOLISHASTINGSUPDATE (x, y, z, p, s, n,w);
9: H[s, n] ← H[s, n] + 1; w[s, n] w[s, n]/ϕ;
10: end while
12: end while
13: ▷ Obtain N samples from q and their score counts/histogram
14: H[s', n'] ←0 for all possible score values s' and classes n'
15: for i = 1..N do
16: H[s, n] ← H[s, n] + 1
17: repeat
18: (x, y, z, p, s, n) ←
METROPOLISHASTINGSUPDATE (x, y, z, p, s, n,w);
19: until mixing has occurred
20: end for
21: return counts H, weights w.
Due to the decreasing rule , the modification factor ϕ converges towards 1. The simulation is stopped when ϕ reaches a chosen threshold value which is close to 1. It turned out that in our case the range from ϕ_{0 }= exp(0.1) ≈ 1.105 to ϕ_{final }= exp(0.0002) ≈ 1.0002 has been proven valuable.
Since detailed balance is violated explicitly, the convergence of the algorithm can not be proven. For this reason one should always use the WangLandau part as a precomputation step just to obtain weights suitable w(S). After this, one performs a simulation with ϕ = 1 for data production, which corresponds to the MetropolisHastings algorithm.
Improvements
Of course there is much room for improvement. For example, consider the time evolution of the histogram H(S) for (RQGS) with L_{Q }= L_{S }= 348 up to S_{max }= 500 with Prob(S = S_{max}) ≈ 10^{65 }in Figure 4a.
Figure 4. Dynamics of the WangLandau algorithm. Typical time evolution of the histogram of visited states when starting with different initial guesses. The model parameters are RQGS with L_{Q }= L_{S }= 348. The weights have been updated dynamically with modification factor ϕ = exp(0.1) ≈ 1.105. (a) w(s) = 1 for all s. The Markov chain converges relatively slowly. (b) w(s) ≈ 1/Prob(S = sL_{Q }= 348, L_{S }= 200) has been used as an initial guess. The histogram becomes flatter within remarkable less computational effort. Inset: a detailed balance simulation (ϕ = 1 during the simulation of 1, 048, 576 steps) with initial weights that are close to the inverse target distribution. Though the histograms are not "flat", each score value on the interval [23, 500] has been visited. The estimate from this data can be used in a longer production run.
When starting with an initial guess w(S) = 1 for all S ∈ [23, 500], the random walker needed about 5.8 × 10^{5 }MonteCarlo steps for a round trip, i.e. to move from the lowest score S_{min }= 23 to the highest one S_{max }= 600 and back. The duration of a round trip is a measure of the mixing time of the corresponding Markov chain. Hence, the shorter the round trip is time, the faster the chain convergences. During the first round trip, the weights have been improved such that the second round trip (and further round trips) needed only 13% of the computational effort of the first one. Once the random walker has performed its first round trip, the typical round trip time does not change significantly. This tight bottleneck in the very early stage of the algorithm can be overcome by suitable initial guesses of w. In Figure 4b the time evolution of the same parameter set (RQGS with L_{Q }= L_{S }= 348) is shown except for the choice of the weights, which have been chosen as w(s) ≈ 1 = Prob(S = sL_{Q }= 348; L_{S }= 200), i.e. from a previous simulation of a different but similar setup. One observes that the histogram becomes "at" within a much smaller amount of MonteCarlo steps. Furthermore, the first roundtrip time decreases to 1.3 × 10^{5 }(i.e. 22% of the value for the naive guess w(s) = 1). From a practical point of view, this allows us to save computing time for two distributions with close by parameters (e.g. close by sequence lengths). One can use the results of one distribution as input for the second one. With this approach we may also explore the parameter space successively. In some cases it is sufficient to run a short batch run with the weights of a close by distribution and ϕ = 1, i.e. a detailed balance simulation, and then apply importance reweighting and use the so obtained approximation of P(s) for a longer production run. This kind of procedure is shown in the inset of Figure 4b: The detailed balance simulations were performed with L_{Q }= L_{S }= 348, whereas the weights w(s) came from a simulation with L_{S }= 320 and L_{S }= 400, respectively. The result shows that the histograms are not " flat" at all, but the distributions were close enough to visit all score values on the range of interest. In this successive way of iterations a broad range of the parameter space is accessible.
Estimation of the statistical error
Statistical analysis of Markovchain MonteCarlo data requires a careful inspection of correlation effects because the events depend on the history of the chain. This correlations vanish within a typical timescale: Events that are separated by a sufficient number of steps can be assumed to be independent. However, since MonteCarlo methods are only approximative, an assignment of statistical errors are requisite. In this study we used Flyvbjerg and Peterson's [37] blocking method to estimate the error.
Results
To our knowledge we present the first highly accurate score statistics for alignments with positionspecific scoring schemes. The alignment scores were calculated with the standard SmithWaterman algorithm with the BLOSUM62 matrix for the (RQGS) and a bipartite version BLOSUM62/SLIM for (FQPS) and (HMM) (see Figure 1). For the a fine gap costs we have chosen the standard values with a gapopen penalty of 12 and a gapextension penalty of 1, and UniProt symbol frequencies for i.i.d. sequences.
We discuss four different transmembrane proteins as queries (see Table 2) in the (FQGS) scheme. The results are shown in Figure 5, where the distributions of (FQGS) and (RQGS) are compared against each other. The subject lengths are set to the query lengths. For the production run of one distribution in Figure 5 (L_{Q }= LS = 348) 16,777,216 MetropolisHastings updates have been performed. This took about 16 hours on an Intel Pentium 4 with 3.4 GHz. The performance of the corresponding HMM is weaker for three reasons: Firstly, we are interested in a joint distribution for that we need more samples. Secondly, more proposals are rejected from the sampler due to the HMMweights and finally the computation of the forwardprobabilities requires additional floating point operations. The computation of 16,777,216 MetropolisHastings updates for this model costs about 45 CPU hours. We use an 8 times larger sample size in order to account for the first drawback. Hence, we put an overall computational effort on this model, which is 23 times as large as for (FQGS) and (RQGS) (apart from the WangLandau iterations).
Table 2. A selection of transmembrane proteins
Figure 5. Score distributions for (RQGS) and (FQPS) models. Score distributions for (RQGS) (classical) and (FQPS) models where the subject length equals the query length. In order to compare the shape, the distributions have been shifted by the center s_{0}. (a): Linear view; all distributions from the (RQGS) agree outside the tails (only two lengths are shown). The shape of the (FQPS) distributions is more variable. (b): Logarithmic view; significant differences between the two models appear in the tail of the distribution. High scores are more probable for the (FQPS) alignment. Furthermore the curvature, i.e. the deviation from the Gumbel form, is much larger for (FQPS) than for the classical model.
Here we observe in Figure 5b that on the log scale the curvature of the tail of the distribution, i.e., the deviation from the exponential tail of the pure Gumbel form Eq. (2), is more pronounced in the (FQPS) model: Significant differences of shapes already show up in the high probability region, which is accessible by simple sampling (Figure 5a). The (RQGS) distributions for different lengths match almost perfectly (only two lengths are shown), whereas the shape of the (FQPS) distributions varies slightly with the sequence type. This supports the observation of Müller et.al. [23] that positionspecific scoring in connection with a fixed query sequence may better discriminate between different sequences than the standard approach where two random sequences are compared with positionindependent scoring matrices.
The asymptotic theory for i.i.d. sequences predicts an EVD of the form of Eq. (2). The parameters λ > 0 and c > 0 depend on the score matrix, on the symbol frequencies f, and on the query and subject sequence lengths L_{Q }and L_{S}. Altschul and Gish [16] pointed out that asymptotic results where c = KL_{Q}L_{S }(with K > 0 a constant) need to be corrected by using effective sequence lengths. An alignment may extend to the end of either sequence and the score will be distorted towards lower values and high scores become less probable. In the limit of infinite sequences this effect vanishes and the tail of the Gumbel distribution can be understood as an upper bound for finite sequences. Indeed, we clearly see that the curves in Figure 5b are not straight lines in the right tail, but have negative curvature.
A better t to the empirical distribution is obtained by determining parameters s_{0}, λ > 0, λ_{2 }> 0 for a "modified" Gumbel distribution with
where s_{0 }can be interpreted as the center of the distribution. This corresponds to a EVD multiplied with a Gaussian correction factor, given by the last term. The parameter λ_{2 }is generally small (and thus shows its effect only in the far right tail). It vanishes for sequences of equal length as the length tends to infinity. Previously, such a correction has been proposed for (RQGS) statistics and has been computed for different parameter sets of BLOSUM62 and PAM250 with a ne gap costs [17,18].
More pronounced differences are seen in the behavior of the tail (Figure 5b), which is only accessible via importance sampling approach. The difference between the probabilities spans several orders of magnitude; hence a wrong choice of the model would falsify the estimation of significance drastically. Most importantly, the pmf obtained using the positionspecific scoring is considerably curved. Thus, using EVDs from fits to data of the highprobability region is even more questionable here than in the (RQGS) model, where the pmf is almost a straight line. Note that for the (RQGS) model, previous simulations [18] have already shown that for the special case of L_{S }= L_{Q}, the pmf converges for large sequence length indeed to an EVD.
Note that the Gaussian correction for local alignment parameterized by λ_{2 }is purely heuristic. Looking at the data, the shape in Figure 5b looks similar to the one of the TracyWidom distribution [38]. Interestingly, Majumdar and Nechaev [39] as well as Priezzhev and Schütz [40] obtained analytically the TracyWidom distribution as the asymptotic distribution for the model of the longest common subsequences which is closely related to global alignment. Also, Sardiu, Alves and Yu [41] observed that the the statistics of the score fluctuations of global alignment in the large probability region is compatible with the TracyWidom distribution. There could be a connection to our results, because in the rareevent tail, alignment lengths are of the order of the sequence lengths, hence the alignment is effectively global. Nevertheless all our results are obtained for finite sequence lengths. In contrast, distributions like Gumbel or TracyWidom are obtained in the asymptotic limit of infinite sizes of the underlying systems. Since finitesize corrections are hard to obtain, or even unknown, we do not attempt to determine the "true" shape of the distribution and are satisfied by our heuristic formula.
Next, we discuss the usefulness of the (FQPS) statistics in terms of retrieval performance. For this purpose we considered the ASTRAL compendium [42] version 1.75 with less than 40% identity to each other. It contains a set of reference proteins classified hierarchically based on their tertiary structure. It is a subset of the SCOP database with removed redundancy. The main hierarchy levels in the SCOP classification scheme are class, fold, superfamily, family. Proteins in the same class share the same type of secondary structure, whereas the fold level describes more specific the arrangement of the secondary structure. As the position specific scoring scheme is designed to be more sensitive to discriminate transmembrane proteins against others, its performance can be measured by searching a collection of transmembrane proteins from the ASTRAL set against the complete set. From the 10569 sequences in the database we have chosen 63 sequences which are classified as Membrane and cell surface proteins and peptides. For this collection we predicted the membrane regions using TMHMM. Each of those queries were searched against the complete set and ranked according to the pvalue on the basis of the (FQPS) statistics. The pvalue threshold under which we regard a hit as significant controls the so called receiver operating characteristic (ROC), i.e. the relationship between sensitivity vs. specificity. A transmembrane protein that appears below an pvalue threshold is referred as a true positive observation. Accordingly, proteins of all other SCOP classes below that value are false positives. The ROC space can be explored by changing the pvalue threshold. A small threshold produces less false positives but we may miss some hits. A larger value leads to more false positives. The ROC is usually illustrated by plotting the true positive rate (TPR = true positives/positives), or the sensitivity, against the false positive rate (FPR = false positives/negatives), 1 specificity. The result for the search characteristics for (FQPS) is shown in Figure 6. We did the same experiment for a BLAST search. In this case all observations are located in left bottom corner in the ROC space. This can be explained by the fact that we have only considered the highest ranked results below the Evalue threshold of 10 and many positives were beyond that value. This can also be seen in the inset of Figure 6 where we show the average number of membrane proteins found so far as a function of the rank in the result set. The line of slope 1 for (FQPS) at the begin of the result list means that virtually all hits are found in the correct SCOP class. In contrast, BLAST ranks transmembrane proteins at a high positions only randomly. Hence, the (FQPS) clearly outperforms the (RQGS) statistics.
Figure 6. Retrieval performance for the (FQPS) statstics. ROC curves (true positive rate vs. false positive rate) when searching TM proteins from the ASTRAL reference set against the complete ASTRAL set. Different symbols indicate different pvalue thresholds being used. Inset: sensitivity for (FQPS) compared with BLAST search. The plot shows the averaged number of observed helical proteins as a function of the rank in the result set.
The ROC curve in Figure 6 show the usefulness of the (FQPS) statistics for retrieval performance, but the extreme small pvalues where the the modification factor λ_{2 }plays a role are not essential for this purpose. The modified Gumbel statistics however affect a possible ranking of database search results, especially for sequences of different lengths. To illustrate this, we used BLAST to retrieve homologs of our four example proteins from the current Swissprot database. The scores were recomputed via the position specific SmithWaterman algorithm for (FQPS). We computed the corresponding pvalues from our simulation data and ranked the result set by the pvalue based on
1. the Gumbel distribution (λ_{2 }= 0) and
2. the accurate distribution (λ_{2 }> 0).
For subject sequence lengths that are not directly governed by our simulation directly we used interpolated fit parameters. In Table 3 we illustrate that the there are subjects whose relative order in the result set is switched when using the more accurate (FQPS) score distribution in contrast to the BLAST Evalue. This result might be important for applications of protein classification where the specific ranking of high scoring proteins is particularly important. However, on a global level the order of the hits persists, signaled by Kendall's rank correlation [43]. When comparing the order obtained between the cases λ_{2 }= 0 and λ_{2 }≠ 0 we measured a rank correlation = 0.986 for the query P08100. A τ of exactly 1 means identical, 0 unrelated, and, 1 exactly the reverse ranking. The rank correlation between the original BLAST ranking and the one based on the accurate pvalues is = 0.737.
Table 3. Change of ranking when using the modified Gumbel distribution
To investigate the impact of dissimilar query and subject lengths L_{Q }and L_{S }on the parameters of the modified Gumbel distribution, we vary L_{S }and consider the parameters λ and λ_{2 }as functions of the ratio L_{S}/L_{Q }(see Figure 7). The large gap between the values of λ for the two different models reflects the qualitative difference of the shape in the high probability regime. We see that in the (RQGS) model, λ is virtually independent of the query and the sequence length. However, in the model (FQPS), λ varies with each individual query, as expected. For λ_{2 }one has to distinguish between L_{S }< L_{Q }and L_{S }> L_{Q}. In the first case, λ_{2 }decreases, which is not surprising, since the correction term describes a finitesize effect and should vanish for increasing sequence lengths.
Figure 7. Fit parameters for (RQGS) and (FQPS) models. Dependence of the modified Gumbel parameters on the subject/query length ratio L_{S}/L_{Q}. The vertical line corresponds to Figure 5, where L_{S }= L_{Q}. (a): λ describes the bulk of the distribution (see Figure 5a) left). For L_{S }>L_{Q}, λ varies only slightly in the subject length. (b): The parameter λ_{2 }characterizes the curvature of the pmf in the tail (see Figure 5b). Large differences between (RQGS) and (FQPS) show up in the case where L_{S }>L_{Q}. λ_{2 }becomes subjectlength independent for L_{S }>L_{Q}.
Once the subject length exceeds the query length, the search space is still growing, but the finite length of the query enforces subject size independent edge effects.
For the (HMM), we approximate the score distribution within each class (number of helices = n). The shape of the distributions clearly agrees with the curvature for (RQGS) and (FQPS), and the modified Gumbel distribution could be fitted (see Figure 8) when the number of helices was not too small. This is indicated by a large reduced χ^{2 }value for distributions with a small number of helices. Also a visual inspection of the fit to the data supports this argument.
Figure 8. Score distributions for different alignment models. Score distributions for different alignment models (i.i.d., fixed query and TMHMM) with L_{S }= L_{Q }= 348. The distributions for the (HMM) have been obtained from the joint distribution.
The rareevent tail shows clear differences between the different subclasses of the model over several orders of magnitude. In Figure 9 the dependency of the fit parameters on the respective subclass of the model (Figure 9a and Figure 9b) as well as the dependency on the ratio L_{S}/L_{Q }(Figure 9c and Figure 9d) is shown. Note that for distributions that are not well described via Eq. (15), we only fitted the data in the high probability region. Those data points are left out in the plot for λ_{2 }in Figure 9b and are connected by dotted lines in Figure 9a.
Figure 9. Fit parameters for different alignment models. Fit parameters for score distributions P (S# of helices) for the (HMM) with a fixed query length L_{Q }= 348 and various subject lengths L_{S}. Both shape parameters λ and λ_{2 }decrease with increasing number of helices. The dependency on the subject length is stronger for λ_{2 }than for λ. For L_{S }>L_{Q }the dependency of λ_{2 }on the subject length is only of marginal order. The bars show the distribution of the number of transmembrane helices obtained by direct simulations of the (HMM). (c),(d): The L_{S}/L_{Q }dependency of λ and λ_{2 }extracted from the same data as (a),(b). The lines are guide to the eyes only. Dashed lines show the corresponding scaling behavior for the (FQRS) and (RQGS) models. The result for n = 2, that has been obtained from the high probability regions (see text), is indicated by dotted lines.
In analogy to (RQGS) and (FQPS), the curvature remains constant when L_{S }> L_{Q}. Regarding the dependence on the number of helices, the curvature decays with increasing number of transmembrane regions and then approaches an approximate constant value. Numerical values are provided in the Appendix for reference.
Discussion and conclusions
We have presented a simple universal numerical method to accurately sample the far right tail of the score distribution of various sequence comparison algorithms. It appears to be the first method that is applicable to all classical local alignment statistics, queryspecific and positiondependent score statistics, HMM calibration, statistics of normalized alignments, and many more. To sample the distribution using computer simulations, we use Markovchain Monte Carlo simulations, in particular the WangLandau approach is connection with the MetropolisHaistings algorithm. Apriori, the WangLandau approach does not require any assumption on the shape of the distribution (for example the parameters of the Gumbel distribution). The parameters can be estimated a posteriori by fitting the simulated distribution to an appropriate parametric form like Eq. (15). Here, we observed that for the (FQPS) model, the Gumbel distribution should be replaced by a more negatively curved one.
The method has a disadvantage: Because of the high number of samples required for nonparametric estimation of the distribution, it can presently not be used in online database search web services, such as a BLAST server. For example, generating the 16,777,216 samples for Figure 5 (L_{Q }= L_{S }= 348) took approximately 16 hours on an Intel Pentium 4 with 3.4GHz.
This is not as bad as it seems, though: Both the implementation and the design of the Markov chain have much room for improvement, e.g. we can choose different neighborhoods N(x) and optimize the weights in the generalized ensemble [44,45].
While this still prohibits interactive use, we see a lot of potential for our method to provide an improved version of the hmmcalibrate tool [22] and to explore the statistics of normalized sequence alignment [7].
During the preparation of this manuscript we came aware of a new related importance sampling method which is suitable for efficient pvalue computations for alignment statistics [46]. It makes use of simultaneous backward sampling of alignments and sequences. So far this method was applied to i.i.d. sequences but it should be possible to extend it to more complex model as well. We have tested it for the (FQPS) model as well. For the joint distribution of score and number of helices one would have to sample simultaneous the alignments, sequences and the hidden state sequence of the TMHMM.
Authors' contributions
SW developed the simulation program for (FQPS) and HMM based on an earlier version [18], ran the simulations and performed data analysis. SR and AKH designed the project. IH and SW developed the details for the TMHMM. All authors contributed to the manuscript and approved the final manuscript.
Appendix: modified Gumbel parameters
Acknowledgements
SW was supported by the German VolkswagenStiftung (program "Nachwuchsgruppen an Universitäten"). IH was supported by the NRW Graduate School in Bioinformatics and Genome Research, Bielefeld. The simulations were performed at the "Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen" and at the cluster of the Center for Biotechnology at the university of Bielefeld. Allocation of computing time is gratefully acknowledged.
References

Lesk AM: An Introduction to Bioinformatics. Oxford: Oxford University Press; 2005.

Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis. Cambridge: Cambridge University Press; 1998.

Smith TF, Waterman MS: Identification of Common Molecular Subsequences.
J mol Biol 1981, 147:195197. PubMed Abstract  Publisher Full Text

Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition.
Proceedings of the IEEE 1989, 77(2):257286. Publisher Full Text

Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic Local Alignment Search Tool.
J Mol Biol 1990, 215:403410. PubMed Abstract

Hartmann AK: Practical Guide to Computer Simulations. Singapore: World Scientific; 2009.

Arslan AN, Egecioglu O, Pevzner PA: A new approach to sequence comparison: normalized sequence alignment.
Bioinformatics 2001, 17(4):327337. PubMed Abstract  Publisher Full Text

Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL: The Universal Protein Resource (UniProt). [http://dx.doi.org/10.1093/nar/gki070] webcite
Nucleic Acids Res 2005, (33 Database):D154D159. PubMed Abstract  PubMed Central Full Text

Heinko S, Heinko J: Amino acid substitution matrices from protein blocks.
Proc Natl Acad Sci USA 1992, 89:1091510919. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mercier S, Daudin JJ: Exact distribution for the local score of one i.i.d. random sequence. [http://dx.doi.org/10.1089/106652701752236197] webcite
J Comput Biol 2001, 8(4):373380. PubMed Abstract  Publisher Full Text

Karlin S, Altschul S: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
Proc Natl Acad Sci USA 1990, 87:2264. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gumbel E: Statistics of Extremes. New York: Columbia University Press; 1958.

Grossmann S, Yakir B: Large Deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments.
Bernoulli 2004, 10(5):829845. Publisher Full Text

Waterman MS, Vingron M: Rapid and accurate estimates of statistical significance for sequence data base searches.
Proc Natl Acad Sci USA 1994, 91(11):46254628. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions.
Nucleic Acids Res 2001, 29(2):351361. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Altschul S, Gish W: Local Alignment Statistics.
Meth Enzym 1996, 266:460. PubMed Abstract  Publisher Full Text

Hartmann A: Sampling rare events: Statistics of local sequence alignments.
Phys Rev E 2002, 65:056102. Publisher Full Text

Wolfsheimer S, Burghardt B, Hartmann A: Local sequence alignments statistics: deviations from Gumbel statistics in the rareevent tail. [http://www.almob.org/content/2/1/9] webcite
Algor Mol Biol 2007, 2:9. BioMed Central Full Text

Yu YK, Wootton JC, Altschul SF: The compositional adjustment of amino acid substitution matrices. [http://dx.doi.org/10.1073/pnas.2533904100] webcite
Proc Natl Acad Sci USA 2003, 100(26):1568815693. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yu YK, Altschul SF: The construction of amino acid substitution matrices for the comparison of proteins with nonstandard compositions. [http://dx.doi.org/10.1093/bioinformatics/bti070] webcite
Bioinformatics 2005, 21(7):902911. PubMed Abstract  Publisher Full Text

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Eddy S: [ftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf] webcite

Müller T, Rahmann S, Rehmsmeier M: Nonsymmetric score matrices and the detection of homologous transmembrane proteins. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S182] webcite

Eddy SR: A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. [http://dx.doi.org/10.1371%2Fjournal.pcbi.1000069] webcite
PLoS Comput Biol 2008, 4(5):s1000069. Publisher Full Text

Sonnhammer EL, von Heijne G, Krogh A: A hidden Markov model for predicting transmembrane helices in protein sequences. In Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology. Edited by JG, et al. AAAI Press; 1998:175182.

Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. [http:/ / www.sciencedirect.com/ science/ article/ B6WK7457D7V9K/ 2/ 0367078014042718f39416a2c3ddeeb3] webcite
J Mol Biol 2001, 305(3):567580. PubMed Abstract  Publisher Full Text

Hastings WK: Monte Carlo Sampling Methods Using Markov Chains and Their Applications.
Biometrika 1970, 57:97109. Publisher Full Text

Newman MEJ, Barkema GT: Monte Carlo Methods in Statistical Physics. Oxford: Clarendon Press; 1999.

Rubinstein RY, kroese DP: Simulation and the Monte Carlo Method. Hoboken, New Jersey: Wiley; 2008.

Lee J: New Monte Carlo algorithm: Entropic sampling.
Phys Rev Lett 1993, 71(2):211214. PubMed Abstract  Publisher Full Text

Berg BA, Neuhaus T: Multicanonical ensemble: A new approach to simulate firstorder phase transitions.
Phys Rev Lett 1992, 68:9. PubMed Abstract  Publisher Full Text

Wang JS, Tay TK, Swendsen RH: Transition Matrix Monte Carlo Reweighting and Dynamics.
Phys Rev Lett 1999, 82(3):476479. Publisher Full Text

Wang JS: Transition matrix Monte Carlo method. [http:/ / www.sciencedirect.com/ science/ article/ B6TJ53Y0HM2TT/ 2/ 3377e3546795e04c63dc23b6982b7459] webcite
Comput Phys Commun 1999, 121122:2225. Publisher Full Text

Wang JS, Lee LW: Monte Carlo algorithms based on the number of potential moves. [http:/ / www.sciencedirect.com/ science/ article/ B6TJ5404H3KNN/ 2/ e62d53facfd5d82de4b029380ea99a78] webcite
Comput Phys Commun 2000, 127:131136. Publisher Full Text

Wang FG, Landau DP: Efficient, multiplerange random walk algorithm to calculate the density of states.
Phys Rev Lett 2001, 86:2050. PubMed Abstract  Publisher Full Text

Wang FG, Landau DP: Determining the density of states for classical statistical models: A random walk algorithm to produce a flat histogram.
Phys Rev E 2001, 64:056101. Publisher Full Text

Flyvbjerg H, Petersen HG: Error estimates on averages of correlated data. [http://link.aip.org/link/?JCP/91/461/1] webcite
The Journal of Chemical Physics 1989, 91:461466. Publisher Full Text

Tracy CA, Widom H: On orthogonal and symplectic matrix ensembles. [http://dx.doi.org/10.1007/BF02099545] webcite
Communications in Mathematical Physics 1996, 177(3):727754. Publisher Full Text

Majumdar SN, Nechaev S: Exact asymptotic results for the Bernoulli matching model of sequence alignment.
Phys Rev E 2005, 72(2):020901. Publisher Full Text

Priezzhev VB, Schütz G: Exact solution of the Bernoulli matching model of sequence alignment. [http://iopscience.iop.org/17425468/2008/09/P09007/] webcite
Journal of Statistical Mechanics: Theory and Experiment 2008, 2008(09):P09007.
(11 pp)
Publisher Full Text 
Sardiu ME, Alves G, Yu Y: Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem.
Phys Rev E 2005, 72:061917. Publisher Full Text

Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levit t M, Brenner SE: The ASTRAL Compendium in 2004. [http://nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D189] webcite
Nucl Acids Res 2004, 32(suppl_1):D189192. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kendall M, Gibbons JD: Rank Correlation Methods. 5th edition. London: Edward Arnold, a div. of Hodder & Stoughton; 1976.

Dayal P, Trebst S, Wessel S, Würtz D, Troyer M, Sabhapandit S, Coppersmith SN: Performance Limitations of FlatHistogram Methods. [http://link.aps.org/abstract/PRL/v92/e097201] webcite
Phys Rev Lett 2004, 92(9):0972014. PubMed Abstract  Publisher Full Text

Trebst S, Huse DA, Troyer M: Optimizing the ensemble for equilibration in broadhistogram Monte Carlo simulations. [http://link.aps.org/abstract/PRE/v70/e046701] webcite
Phys Rev E 2004, 70(4):046701. Publisher Full Text

Newberg LA: Significance of Gapped Sequence Alignments. [http://www.liebertonline.com/doi/abs/10.1089/cmb.2008.0125] webcite
Journal of Computational Biology 2008, 15(9):11871194.
[PMID: 18973434]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text