Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

Wolfsheimer, Stefan; Herms, Inke; Rahmann, Sven; Hartmann, Alexander K

doi:10.1186/1471-2105-12-47

Research article
Open access
Published: 03 February 2011

Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

Stefan Wolfsheimer¹,
Inke Herms²,
Sven Rahmann³ &
…
Alexander K Hartmann⁴

BMC Bioinformatics volume 12, Article number: 47 (2011) Cite this article

5300 Accesses
6 Citations
Metrics details

Abstract

Background

Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.

Results

In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach.

Conclusions

The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.

Background

A large amount of molecular biological data is stored in form of sequences of symbols in huge data bases, e.g., DNA sequences or the primary structure of proteins. It is one main task of Bioinformatics [1] to develop algorithms [2] which allow to find, via sequence comparison, for a given "query" sequence the most similar "subject" sequences in a data base. All widely-used algorithms depend on many parameters, the so-called scoring schemes[2], which are usually suitably adopted to test sets of data. Parts of these scoring schemes involve also rules how to deal with small non-similar subsequences, the so-called gaps. The most popular sequence-comparison algorithms are the Smith-Waterman algorithm [3] for pairwise local sequence alignment (which means that the most similar subsequences of two sequences are found) and the Viterbi algorithm for sequence-to-HMM alignment. In the latter case, one specific sequence is compared to a full set of sequences which is specified via a Hidden Markov Model (HMM) [4]. For practical applications, often very fast heuristics are used, which do not find the exact best-matching (sub-)sequences but only approximations of the optimum. The BLAST algorithm [5] is used widely. Sequence-comparison algorithms return a raw similarity score, i.e., a number, that quantifies the similarity between the input objects. Unfortunately, this raw score is hard to interpret because one does not know its absolute scale.

An interpretation becomes possible when we specify a probabilistic null model for the input: Then the similarity score becomes a random variable S whose probabilities Prob(S = s) under the null model can be determined. Sometimes this can be done analytically, but usually one has to apply numerical simulation [6]. The p-value assigned to an observed score s is defined as pval(s): = Prob(S ≥ s) in the null model, and log pval(s) is a measure of surprise (and hence a universally normalized score) for s. The key problem is, of course, to find Prob(S = s) for a given sequence-comparison algorithm, a given scoring scheme, and a given null model.

In this paper, we explain and extend an efficient and generally applicable numerical technique that solves this problem in many different sequence comparison settings, such as for a BLAST-like database search [5] with a fixed query, for position-specific scoring and/or gap-cost schemes (essentially HMMs), or for normalized alignment [7]. In each of those settings a variety of null models in addition to the i.i.d. model is possible.

Previous work

We start by introducing some necessary formal notations. For a full description, please refer to Ref. [2]. Let Σ be a fixed alphabet of symbols, denoting e.g. nucleotides (|Σ| = 4) or amino acids (|Σ| = 20).

Most of the existing statistical work for pairwise sequence comparison focuses on null models where both sequences are random and at each position a symbol σ ∈ Σ is chosen independently of the other positions ("i.i.d. model"), with a given frequency $f_{σ} > 0 (Σ_{σ \in Σ} f_{σ} = 1)$ . f often reflects the average composition of proteins in the UniProt/SwissProt database [8]. An alignment of the two sequences is a set of pairs {(i_k , j_k )}, which means that symbol of position i_k of the first sequence is aligned (or paired) to the symbol at position j_k of the second sequence. The pairs must not cross, i.e., if for two pairs (i_k , j_k ) (i_l , j_l ) the condition i_k < i_l holds, then also j_k < j_l . Positions which are not paired, i.e., which do not appear in the alignment, contribute to the above mentioned gaps. The length of a gap is the number of adjacent ungapped positions. In the following example, where the two sequences are shown such that paired symbols are atop on each other,

\begin{array}{l} \begin{matrix} Q & G & E & G & G & D & A & - & - & - & W & C \\ Q & G & - & - & G & D & A & T & T & T & W & C & \Leftrightarrow \end{matrix} \\ {(1, 1), (2, 2), (5, 3), (6, 3), (7, 5) (8, 9), (9, 10)} \end{array}

two gaps of lengths two and three appear.

Scores for individual pairs of symbols are given by a constant (position-independent) symmetric Σ × Σ scoring matrix with negative expected score, such as BLOSUM62 [9]. The score of an alignment is given by the sum of the scores of the pairs, plus (negative) contributions for the gaps. For the conventional gap-scoring schemes (called "affined"), each gap contributes a score which depends only linearly on its length (times a parameter called gap-extension penalty) plus a constant (gap-open penalty). We shall refer to this model later as "random query - general-purpose scoring" (RQGS).

For gapless pairwise local sequence alignment, the raw score distribution can be derived numerically by Markov chain analysis [10] and also asymptotically for infinite sequences (Karlin-Altschul or Dembo-Karlin statistics [11]): It is an extreme-value distribution (EVD), also called Gumbel distribution [12]:

\begin{matrix} Prob (S > s) & = & 1 - \exp [- c \cdot \exp (- λ s)] \\ \Leftrightarrow \\ Prob (S > s) & = & \begin{array}{l} c \cdot λ \cdot \exp (- λ s) \times \\ \exp [- c \cdot \exp (- λ s)] \end{array} \end{matrix}

(1)

where the parameters λ > 0 and c > 0 depend on the score matrix, on the symbol frequencies f, and on the query and subject sequence lengths L _Q and L _S. Asymptotically we have c = KL _Q L _S for a length-independent K > 0.

For gapped pairwise local sequence alignment, which is the most relevant case in database queries there exist no universal analytic results, with the exception of few special cases [13]. Empirical evidence also indicates convergence towards the Gumbel form for long sequences; λ and K additionally depend on the gap-cost function [14]. Several works have focused on efficient numerical estimation of these parameters [15]. The influence of varying lengths of the finite sequences [16] is treated in various ways, e.g. by adjusting the lengths of the sequences to "effective lengths" but still assuming a Gumbel form of the distribution. Nevertheless, for moderate sequence lengths, which are biologically most relevant, the true distribution differs strongly from a Gumbel form [17, 18], which can be dealt with by including a correction term to the Gumbel form (giving rise to an additional parameter).

The (RQGS) model is convenient, because the problem of computing significance values reduces to the estimation of only two parameters, which can be precomputed for each scoring scheme. However, there are also several problems. For instance, even if one considers just the gapless case, it is in general not easy to extend the analytic asymptotic theory to more complex null models. Furthermore, for practical applications where finite sequence lengths are considered, of even more importance is: The p-values reported by (the original) BLAST only depend on the raw score, the query and the subject length, and not on the actual query sequence. This leads to large distortions when the composition of the query sequence does not match the composition of the null model. For example, when we run a homology search for the human transmembrane protein rhodopsin (UniProt accession P08100) with BLAST (BLOSUM 62, gap-init 12, gap-extend 1, no composition adjustment, no filtering), we find a possibly remote homolog Q8NH42 with an E-value of 9 · 10^-8. The E-value for score s is the expected number of database hits with score at least s and can be easily computed from pval(s) and the database size. Hence, it appears unlikely to obtain such a homolog by pure chance, i.e., the homolog appears to be relevant. However, using a recent "composition-based adjustment" option [19, 20] leads to a very different E-value of 0:001 for the same protein. This underlines the importance of query-specific or at least composition-based statistics, particularly for intermediate p-values.

The statistics of position-dependent scoring and/or gap-cost schemes, as used in PSI-BLAST [21] or in hidden Markov model (HMM) frameworks, are less well explored. The central question here is, "given a query Q and a position-specific scoring scheme, what is the score distribution when random null-model sequences of given length are scored against Q?". We refer to this model as "fixed query - position-dependent scoring" (FQPS). As a compromise between the general (RQGS) and the very specific (FQPS) models, one may neither use a completely free (i.i.d) nor use a fixed query but draw query sequences according to more specific models, e.g., HMMs for transmembrane proteins.

In all these cases EVDs Eq. (2) may still used heuristically by fitting the parameters of the EVD to the simulated data. This can be achieved by generating pairs of random sequences according to the given null model while recording the histogram of observed alignment scores. Using such a "simple sampling" approach, the large-probability region of the score distribution can be investigated, e.g., for probabilities about > 10^-4 when generating 10⁴ sequence pairs. Such an approach is implemented, e.g. in the hmmcalibrate program from the HMMER package [22]. Nevertheless, this procedure may fail to describe distribution in the "rare-even tail", i.e., where the probability is small (say < 10^-4), although this part of the distribution is most important for the estimation of the statistical significance.

Our motivation for a simulation-based method that makes no initial parametric assumption refers to the approach [23] to increase the sensitivity of detecting homologs of a given transmembrane (TM) protein in a database search: A bipartite scoring scheme with a (non-symmetric) transmembrane helix specific scoring matrix (such as SLIM [23]) for the TM helices and a general-purpose scoring matrix (such as BLOSUM [9]) for the remaining regions of the query protein were applied, see Figure 1. This results in higher search sensitivity and specificity. However, a statistical theory or efficient computational method to obtain the score probabilities in such a (FQPS) framework is missing so far.

Our contributions and paper outline

We present a general framework for efficient estimation of raw score distributions in sequence comparison problems. In particular the rare-event tail for large scores can be accessed. We only make the following assumptions:

1.
We are able to sample pairs x, y of sequences according to the null model and to compute the null model probability of any given x, y.
2.
We have an efficient algorithm $A$ that computes the score S(x, y), where x, y could be a pair of randomly drawn sequences (RQGS or HMM), or one fixed and one random sequence (FQPS).
3.
The scores are rational numbers with a common denominator. Hence, without loss of generality, they can be assumed to be integers.
4.
Optionally for the (HMM) approach, we have an efficient algorithm $V$ that predicts the most likely state sequence for a given sequence.

Our framework is readily applicable to the (RQGS), (FQPS) and (HMM) models, but also to more exotic settings, such as normalized alignment [7], where the score is not additive, but normalized by the alignment length, for which no statistical framework exists so far. Very recently Eddy [24] studied the distributions of Viterbi and Forward scores under probabilistic local alignment, for which a numerical analysis of the rare-event tail would be of interest as well.

In the current stage of the methodology, the computation of an accurate "on the fly" p-value for each particular database query might be impracticable as each full calculation is not achieved within a few minutes.

We will illustrate the approach for the HMM for TM proteins (TMHMM [25, 26]), which has been proven valuable in predicting TM helices. In this approach (and possible in other models as well) one is able to specify the score distribution in more detail. Each query may be classified to be a member of certain given sub-classes $C$ . In this case, it could be meaningful to obtain an individual score distribution for each subclass. A natural classification of the TMHMM is the number of transmembrane regions. Since for a given sequence this number is usually not known exactly, one takes the most likely one.

The rest of the paper is organized as follows. The following section presents the mathematical background on importance sampling and Markov chain Monte Carlo methods which are fundamental to the methods used to obtain the score distribution, in particular in the rare-event tail, for different null models. Next, we present a description of the methodology. Section "Results" shows computational results on transmembrane protein similarity statistics in (RQGS), (FQPS) and (HMM). A discussion closes the paper.

Methods

Importance sampling

Importance sampling is a general technique to reduce the variance in the estimation of quantities that can be written as an expectation $E [h (Z)]$ , where Z is a random object representing the null model and h is a real-valued function. We assume that we can draw within a computer simulation n random samples z₁, ..., z_n according the null model. The expectation is then approximated by the empirical mean $E [h (Z)] \approx 1 / n \cdot \sum_{i = 1}^{n} h (z_{i})$ .

In our setting, to estimate the score distribution (and then p-values), we consider the state space $Z = Σ^{L_{Q}} \times Σ^{L_{S}}$ , from which we generate n random pairs of sequences({x⁽¹⁾ , y⁽¹⁾} ,..., {x⁽ⁿ⁾, y⁽ⁿ⁾}).

These pairs are then aligned by a given algorithm $A$ and the corresponding similarity scores S(x⁽ⁱ⁾, y⁽ⁱ⁾) are computed. To formally write a histogram as an expectation value, we consider the family of indicator functions $h_{s} = Z \to {0, 1}$ for all s > 0, defined by h_s (x, y) := 1 if S(x, y) = s, and h_s (x, y) := 0 if S(x, y) ≠ s, hence

\begin{array}{l} Prob (S (X, Y) = s) & = & E [h_{s} (X, Y)] \\ \approx & | {i : S (x^{(i)}, y^{(i)}) = s} | / n \end{array}

This means, we approximate the unknown exact probability Prob(S(X, Y)) by normalized score histograms over all sampled sequence pairs. If the probability to be estimated is small, say 10^-9, when using simple sampling, we need about 10¹² samples to estimate it with reasonable precision. Thus, for very rare events, this sampling quickly becomes infeasible.

Importance sampling generates the "interesting" events more often by sampling from a different distribution and correcting for this bias afterward, which results in a more accurate estimate with a reasonable number of samples. Let p be the probability mass function (pmf) of (X, Y), and let q be another pmf satisfying q(x, y) > 0 whenever p(x, y) > 0. Then

\begin{matrix} E_{p} [h (X, Y)] = \sum_{x, y} h (x, y) \cdot p (x, y) \\ = \sum_{x, y} h (x, y) \cdot \frac{p (x, y)}{q (x, y)} \cdot q (x, y) \\ = E_{q} [h (X^{'}, Y^{'}) \frac{p (X^{'}, Y^{'})}{q (X^{'}, Y^{'})}] \\ \approx \frac{1}{n} \sum_{i = 1}^{n} h ({x^{'}}^{(i)}, {y^{'}}^{(i)}) \cdot \frac{p ({x^{'}}^{(i)}, {y^{'}}^{(i)})}{q ({x^{'}}^{(i)}, {y^{'}}^{(i)})}, \end{matrix}

(2)

where each pair (x'⁽ⁱ⁾, y'⁽ⁱ⁾) is sampled from the pmf q. Eq. (2) gives us the relationship between the expectation value w.r.t. the unknown distribution of interest, the target distribution p, and samples drawn from the actually used sampling distribution q. To successfully apply importance sampling, q has to fulfill three properties: First, it needs to put high probability on the region of interest; second, we need to be able to sample according to q; third, we need to be able to compute the correcting weight p(x, y)/q(x, y). Since directly sampling from q often is impossible, we shall use a general sampling method which we describe next.

Metropolis-Hastings sampling

If we need to generate samples from a discrete (or continuous) distribution q but have no simple direct method to do so, the Metropolis-Hastings method [27] provides a solution by constructing an ergodic Markov chain with stationary distribution q in the following way.

Extensive introductions to such so-called Monte Carlo simulations can be found, e.g., in Refs. [28, 29]. Here we just give a concise introduction specifically tailored to the problem of sampling pairs of sequences. Let us call the elements (x, y) of the sample space $Z$ configurations. The sampling will be performed by randomly "moving" in the space of configurations, such that for each step a configuration is altered only slightly. Configurations which can be visited and are connected by a move are called neighbors. Hence, each configuration (x, y) has a algorithm-dependent set $N (x, y)$ of potential neighbors (x', y') ∈ $N (x, y)$ . The movement is performed in such a way that each neighbor exhibits a positive probability P_{(x,y), (x',y')}as being the next configuration. The proposal is accepted with probability

\begin{array}{l} α ((x, y) \to (x^{'}, y^{'})) \\ = & \min {1, \frac{q (x^{'}, y^{'}) \cdot P_{(x^{'}, y^{'}), (x, y)}}{q (x, y) \cdot P_{(x, y), (x^{'}, y^{'})}}}, \end{array}

(3)

in which case (x', y') becomes the new current configuration. Otherwise (x', y') is discarded and (x, y) remains unchanged. Thus, a sampling algorithm is specified via the neighborhoods $N (x, y)$ and the proposal probabilities P_{(x,y),(x', y')}

The acceptance criterion Equation (3) is quite general. By using a symmetric proposal probability matrix, P_{(x',y') (x, y)}= P_{(x,y), (x',y')}, the relationship simplifies to

α ((x, y) \to (x^{'}, y^{'})) = \min {1, \frac{q (x^{'}, y^{'})}{q (x, y)}} .

(4)

Since the distribution q appears only in form of a ratio we need to be able to compute q(x, y) only up to a normalization constant. If q(x', y') > q(x, y) we accept the proposal and otherwise the proposal is accepted as new configuration with the finite probability q(x', y')/q(x, y). Hence, we need to choose the neighborhood of x, y such that that the ratio $\frac{q (x^{'}, y^{'})}{q (x, y)}$ is not too far away from one. Otherwise, virtually only proposals that increase the probability are accepted and the sampling procedure gets stuck at local maxima.

Equation (4) and its generalization Equation (3) describe Markov chains in the configuration space $Z = Σ^{L_{Q}} \times Σ^{L_{S}}$ with the transition matrix T_{(x, y), (x', y')}= P_{(x, y), (x', y')}·α((x, y) → (x', y')).

For an appropriate choice of the neighborhoods $N (x, y)$ and of the proposal distribution P_{(x, y), (x', y')}, the so-constructed Markov chain is ergodic (each configuration can be reached from any starting configuration with finite probability). Furthermore, one can show that the detailed balance condition

q (x, y) T_{(x, y) (x^{'}, y^{'})} = q (x^{'}, y^{'}) T_{(x^{'}, y^{'}) (x, y)},

(5)

which is fulfilled due to the choice of α(.) according to Eq. (3), implies that the chain converges towards the desired sampling distribution q.

We say that the chain has reached equilibrium when convergence has occurred up to numerically negligible error. Thus, if the configuration (x, y) is sampled after equilibration, it will behave like a sample from q. In practice, the exact speed of equilibration is unknown and convergence diagnostics are applied (see below). Several (almost) independent samples are obtained by running the chain further and taking a sample every k-th step for sufficiently large k to allow time for "forgetting" (i.e., decorreclating from) the state of the last sampled configuration. This time is usually referred as mixing time.

Implementation

In this section we show how the sampling algorithm for pairs of sequences is actually designed, based on the background given in the two preceding sections, such that the tails of the probability distributions for the scores can be addressed. The crucial point of the Metropolis-Hastings update is the choice of an appropriate neighborhood $N (x, y)$ (and the related proposal probabilities) and the computation of the probabilities of newly proposed states q(x', y'). The neighborhood should be chosen such that the acceptance rate Eq. (3) is between 0.3 and 0.7. We shall factorize the (un-normalized) pmf q in two contributions, firstly weights w: ℤ → ℝ⁺ that assign each score value of interest a weight and secondly the null probability, i.e.

q (x, y) = w (S (x, y)) \cdot p (x, y) .

(6)

Note that we will leave w(·) undetermined for a moment, until Section "Wang-Landau Sampling". The importance reweighting equation Eq. (2) for h_s is then

\begin{matrix} Prob(S = s) = E [h_{s} (X, Y)] \\ = \sum_{x, y} h_{s} (x, y) \cdot p (x, y) \\ \approx \frac{1}{Z} \sum_{i = 1}^{n} \frac{h_{s} ({x^{'}}^{(i)}, {y^{'}}^{(i)})}{w (s)} \end{matrix}

(7)

with the normalization constant

Z = \sum_{s} \sum_{i = 1}^{n} \frac{h_{s} ({x^{'}}^{(i)}, {y^{'}}^{(i)})}{w (s)}

For the (HMM) a single distribution of scores is not sufficient: Each query is a member of a certain sub-class characterized by the number of transmembrane regions "# of TM helices" to be determined by the Viterbi algorithm (see below). Thus, each class has its own probability P_n (s) = Prob(S = s, # of TM helices = n_TM). In order to take this property into account, we deal with the joint probability Prob(S = s, # of TM helices = n_TM). Accordingly, the weights have a two dimensional domain, we write w(s, n). Also h_s in Eq. (7) is replaced by an indicator function h_s,n that depends on two parameters: h_s,n (x, y) equals 1 if S(x, y) = s and # of TM helices of x = n_TM. The sampling distribution is generalized to

q (x, y, n_{TM}) = w (S (x, y), n_{TM}) \cdot p (x, y),

(8)

and the the reweighting relationship reads as

\begin{array}{l} Prob(S = s, N = n_{TM}) \\ = & E [h_{s, n_{TM}} (X, Y)] \\ = & \sum_{x, y} h_{s, n_{TM}} (x, y) \cdot p (x, y) \\ \approx & \frac{1}{Z} \sum_{i = 1}^{n} \frac{h_{s, n_{TM}} ({x^{'}}^{(i)}, {y^{'}}^{(i)})}{w (s, n_{TM})}, \end{array}

(9)

with, in this case, $Z = \sum_{s, n_{TM}} \sum_{i = 1}^{n} \frac{h_{s} ({x^{'}}^{(i)}, {y^{'}}^{(i)}, n_{TM})}{w (s, n_{TM})}$ .

Generally the occurrence of two sequences x = $x = x_{1} ... x_{L_{Q}}$ and $y = y_{1} ... y_{L_{S}}$ is characterized by the null probability

\begin{array}{l} p (x, y) = Prob(X = x, Y = y) \\ = & f^{query} (x_{1} ... x_{L_{Q}}) \cdot f^{subject} (y_{1} ... y_{L_{S}}) \\ = & f^{query} (x) \cdot f^{subject} (y) . \end{array}

This simple factorization allows us to draw proposals for the query and for the subject independently. Hence, for simplicity, a neighboring configuration will leave one of the two sequences unchanged. Thus, for selecting a neighboring configuration, first one of the two sequences is chosen at random with probability 1/2. In the case of (FQPS) the subject is always chosen. Then one sequence is chosen from the neighborhood of the selected sequence, as described next. Formally, this means for (RQGS) and (HMM) we use the factorized proposal densities P_{(x, y), (x', y')}= 0.5P_{x, x'}1_{y, y'}or P_{(x, y) (x', y')}= 1_{x, x'}P_{y, y'}(1_{y, y'}denotes the indicator function which is only one if y = y', 0 else. P_{x, x'}denotes the proposal of a single sequence) depending on the choice of sequence in the first step.

Proposal densities for (FQPS) and (RQGS)

In the simplest case either both sequences are i.i.d. or the query is fixed (to some sequence $\tilde{x}$ ) and the null-model probabilities of their occurrence factorize, i.e.

f^{query} (x) = {\begin{array}{l} f^{iid} (x) = \prod_{i = 1}^{L_{Q}} f_{x_{i}} & for (RQGS) and \\ 1_{{x = \tilde{x}}} & for (FQPS) \end{array}

(10)

and of course in both cases

f^{subject} (y) = f^{iid} (y) = \prod_{i = 1}^{L_{s}} f_{y_{i}}

(11)

Due to the factorization that occurs in Eq. (10) it is possible to draw sequences from $N (x)$ such that the detailed balance condition f^iid(x) P_x,x'= f^iid(x') P_{x', x}is fulfilled by the following set of Monte Carlo moves (see also Figure 2 and Table 1)

Table 1 Monte Carlo operations

Full size table

a)
substitution of a single symbol at position k,
b)
insertion of a single new symbol at position k with left shift (deletion of the first symbol),
c)
insertion of a single new symbol at position k with right shift (deletion of the last symbol),
d)
deletion of a single symbol at position k with right shift and insertion of a single new symbol at the beginning,
e)
deletion of a single symbol at position k with left shift and insertion of a single new symbol at the end.

Operation a) appears with probability 1/2 and the other ones with probability 1/2 · 1/4 each. This is one possible choice that guarantees detailed balance.

Note that all sequences in $N (x)$ have the same length and each operation involves a replacement of an existing symbol with a newly drawn symbol, in case a) by a direct substitution and in the cases b)-e) indirect via a shift operation. Each position of a sequence has the same probability of being chosen and the replaced symbol is chosen in all cases according to the frequencies f_σ (σ ∈ ∑).

With this construction the Metropolis-Hastings ratio Eq. (3) simplifies to the special case of the Metropolis algorithm, i.e.

\begin{array}{l} α ((x, y) \to (x^{'}, y^{'})) \\ = & \min {1, \frac{q (x^{'}, y^{'}) \cdot P_{(x^{'}, y^{'}), (x, y)}}{q (x, y) \cdot P_{(x, y), (x^{'}, y^{'})}}} \\ = & \min {1, \frac{w (S (x^{'}, y^{'})) p (x^{'}, y^{'}) \cdot P_{(x^{'}, y^{'}), (x, y)}}{w (S (x, y)) p (x, y) \cdot P_{(x, y), (x^{'}, y^{'})}}} \\ = & \min {1, \frac{w (S (x^{'}, y^{'}))}{w (S (x, y))}} . \end{array}

The right part in the second line cancels, because all contributions to p(x, y) and p(x', y') where symbols from (x, x') and y, y) agree cancel directly and for the few remaining letters, the proposal probability in the nominator contains exactly the frequencies of the null probability in the denominator, and vice versa. Thus, the acceptance rate depends only on the score value of the current configuration S(x, y) and the one of the proposal S(x', y'). Furthermore, it is easy to prove that the detailed balance conditions in Eq. (5) is fulfilled for this chain which in turn implies that the chain converges towards the sampling distribution q.

Proposal densities for the (HMM)

In contrast to the approach presented in the previous section, the generalized method we use here also works for null models that do not allow for direct sampling from $N (x)$ as in the case of i.i.d. sequences. This framework can be summarized by following algorithm:

METROPOLIS HASTINGS UPDATE(x, y, z, p, s, n, w)

Input: Sequences x, y, a hidden state sequence z, the null probability p(x, y) = f^query(x). f^subject(y), the score s = S(x, y), the sub-class n and weights w

Output: Possibly new values for x, y, z, p, s, n.

1: Draw (x', y') ∈ N(x, y)

2: compute z' := V(x) using $V$ and determine the corresponding class n';

3: compute p' := f^query (x') · f^subject(y');

4: compute s' := S(x', y') using $A$ .

5: Compute $α : = \frac{w [s^{'}, n^{'}] \cdot p^{'} \cdot P_{x^{'}, x}}{w [s, n] \cdot p \cdot P_{x, x^{'}}}$ .

▷ Designed such that

p(x', y') · P_{(x', y'), (x,y)}= p(x, y) · P_{(x,y),(x', y')}

6: With probability min {1, α}:

Let (x, y, z, p, s, n)← (x', y', z'. p', s', n')

7: return (x, y, z, p, s, n)

The algorithm is applicable to all models that allow for a rapid calculation of the null probabilities f(·). Sequence models based on HMMs fall into this class. In the following we brie y describe this framework. A detailed discussion can be found in the specialized literature on the topic [2, 4].

In the probabilistic framework of HMMs one assumes a sequence of "observed" symbols (the protein sequences here) which is generated conditioned on a sequence of "hidden" states. For the case of TM proteins, the state corresponds to the physical region where the corresponding amino acid is located in, as detailed below. Within a modeling using HMMs, this state sequence, also called path, follows a simple Markov chain. The actually generated symbols are connected to the hidden states by conditional "emission" probabilities. More formally, a HMM consists of

a finite set ∑ of (output) symbols (in our case the amino acid alphabet),
a finite set Γ of (hidden) states,
initial state probabilities π_μfor all μ ∈ Γ with ∑ _μ∈Γπ_μ= 1,
emission probabilities $p_{σ}^{μ}$ in each state μ ∈ Γ Σ and for all σ ∈ ∑ with $\sum_{σ \in \sum} p_{σ}^{μ}$ = 1 for all σ ∈ ∑,
a stochastic transition probability matrix P = (p_μ,τ)_{μ,τ ∈ Γ}, i.e. $\sum_{τ \in Γ} p_{μ, τ}$ = 1 for all μ ∈ Γ.

Given these model parameters, the "most natural" application of a HMM is to generate a sequence of hidden states by a stochastic process and, in parallel, to generate a random sequence of symbols given the generated states. Hence, the stochastic process describes pairs of states and symbols. But also given a fixed state Z = Z₁ ... Z_L , the symbol sequence x = x₁ ... x_L is a stochastic process, furthermore the opposite case of a fixed sequence of output symbols, the state sequence is a stochastic process.

For the Monte Carlo sampling as needed here, it is not possible to simulate a HMM directly to generate symbol sequences, since importance sampling changes the underlying sequence probabilities. Nevertheless, one still needs to compute the probabilities f^HMM(x) for the Monte Carlo acceptance procedure, i.e. the probabilities that x is the observed symbol sequence generated by the HMM using any feasible state sequence. These probabilities can be computed in $O$ (L · |Γ|²) time using the well known forward algorithm as described in the following. One introduces the auxiliary variables f_μ (i), which correspond to the probability that the subsequence x₁ ... x_i is generated by the HMM given that the last state variable Z_i has the value μ, i.e. f_μ (i) = Prob(X₁ ... X_i = x₁ ... x_i |Z_i = μ). The overall probability is then f^HMM(x) = ∑ _μ∈Γf_μ (L). The probabilities f_μ (i) can be determined by the recursion

f_{μ} (i) = p_{x_{i}}^{μ} \sum_{τ \in Γ} f_{τ} (i - 1) p_{τ, μ}

(12)

with initial conditions $f_{μ} (1) = π_{μ} p_{x_{1}}^{μ}$ .

Within the same time complexity the Viterbi algorithm $V$ computes the most probable state path for a given sequence of observations, that is

\begin{array}{l} z_{1} \dots z_{L} = V (x_{1} \dots x_{L}) \\ = & \underset{{\bar{z}}_{1} \dots {\bar{z}}_{L} \in Γ^{L}}{argmax} Prob (Z_{1} \dots Z_{L} = {\bar{z}}_{1} \dots {\bar{z}}_{L} | x_{1} \dots x_{L}) . \end{array}

For this purpose one uses a different set of auxiliary variables: Let v_μ (i) be the probability of the most probable path ending in state μ ∈ Γ with observed partial output sequence x₁, ..., x_i . These values can be computed recursively by

v_{μ} (i) = p_{x_{i}}^{μ} \max_{τ \in Γ} {v_{τ} (i - 1) p_{τ, μ}}

(13)

with boundary condition $v_{μ} (1) = π (μ) \cdot p_{x_{1}}^{μ}$ . Note that these probabilities are not normalized, in particular $\sum_{μ \in Γ} v_{μ} (i) \leq 1$ . The missing normalization is no problem, since we are interested only in the most probable path, which is reconstructed by back-tracking [2].

For the approach discussed in this section, the subject sequences are drawn almost as above, see below. The HMM approach we use to sample transmembrane queries is the TMHMM developed by Sonnhammer et. al. [25]. In this setting, the states are (structural) domains. Some of them are "tied", which means that they share the same emission probabilities. They are classified into seven groups:

Helix core,
two different groups of caps on either side,
loops on the cytoplasmic side,
short and long loops on the non-cytoplasmic side,
globular domains.

The internal structure of the helix core and loop module allows modeling different lengths of the corresponding protein domain by assigning jump probabilities. The globular domains have a self-looping structure and hence may also have various lengths. The other modules have fixed lengths. The overall number of model parameters is 216. Figure 3 shows the actual layout of TMHMM. Each box represents a group of "tied" states. The states corresponding to "helix core" represent the transmembrane helices that connect states of the cytoplasmic side and the non-cytoplasmic side of the membrane. The prediction of the positions of the "helix core" states determines the loci of the special purpose scoring matrix SLIM for position specific alignment (see Figure 1).

The following Metropolis-Hastings update consists of two steps: First, the proposal of a new configuration from the neighborhood $N (x)$ is made by inserting/replacing symbols with equal weights $f_{σ} = \frac{1}{| \sum |}$ for all σ ∈ ∑ using one of the five Monte-Carlo moves described above. The acceptance ratio Eq. (3) in that case is given by

\begin{array}{l} α ((x, y) \to (x^{'}, y^{'})) \\ = & \min {\begin{cases} 1 \\ \frac{w (S (x^{'}, y^{'}), {n^{'}}_{TM}) f^{query} (x^{'}) f^{subject} (y^{'})}{w (S (x, y) n_{TM}) f^{query} (x) f^{subject} (y)} \end{cases}} \end{array} .

(14)

The current and new number of TM regions n_TM and n'_TM are determined by the Viterbi algorithm applied on the sequence x and x' respectively. The calculation of the query probabilities is based on the TMHMM. The subject sequence probability is simply calculated according to Eq. (11). Note that, in each step, one of the two probabilities cancels, because only one of the two sequences is changed within each step, as above.

This approach allows us to sample noni.i.d. sequences with appropriate weights and to predict transmembrane helical regions that can be used in the position specific alignment scheme (as described in [23]) even for random sequences.

Wang-Landau sampling

The idea of importance sampling is to choose the weights w(·), such that the drawn events in the region of interest have a high probability to occur in the simulation. Ideally, P(S) is already known and in that case one might choose w(S) ∝ 1/P (S) on the entire range of interest. Then all states are visited with equal probability, and hence a at score histogram is achieved in the limit of infinite sample size. Still, for practical applications with finite sample size, the distribution of scores can be sampled with high accuracy over a large range of its support. This idea refers back to statistical physics and it is known as "generalized ensemble" or " flat histogram" methods. In the following we will denote this weights by w^flat.

Of course the true P(S) is unknown and the method requires some guesses which approximate w^flat to a suitable accuracy. The achieved score histogram becomes only approximatively flat. The true (unknown) distribution can then be estimated by reweighting the histogram of visited states using the importance sampling formula Eq. (7) for h_s .

Many iterative sampling schemes to achieve initial guesses had been developed in the 1990ies, for example entropic sampling [30], multicanonical sampling [31] and later transition matrix Monte Carlo [32–34], only to mention a few. Here we use the Wang-Landau algorithm [35, 36] to approximate w^flat as input for Metropolis-Hastings sampling.

The Wang-Landau algorithm explicitly violates detailed balance by dynamically updated weights depending on the visited states in the following way: First, a score range of interest [S_min, S_max] is chosen. The algorithm basically employs a histogram H(S) and weights w(S) defined on the desired score range. For more complicated models such as the (TMHMM), these objects are two-dimensional depending on the score S and the class n_TM, i.e. H(S, n_TM) and w(S, n_TM). Furthermore, real valued parameters ϕ_i > 1 are used in each iteration i. Initially, the histogram values H(S, n_TM) are set to 0 in the desired range and all weights w(S, n_TM) to a constant, say 1. For the first iteration, i = 0, ϕ_i can be as large as e¹. Then, a simulation is performed using acceptance ratio Eq. (12) or Eq. (14). After each step, corresponding to one step of a (biased) random walk in the configuration space, w(S, n_TM) is updated as w(S, n_TM) ← w(S, n_TM) × ϕ_i , where S is the current score value and n_TM the sub-class of the current state. Also the histogram H is updated by one H(S, n_TM) ← H(S, n_TM) + 1. In the literature this is often continued until an "approximately flat histogram" is achieved. A possible flatness criterion might be $H (S, n_{TM}) > 0.6 \cdot \frac{1}{S_{\max} - S_{\min} + 1} \sum_{S^{'} = S_{\min}}^{S_{\max}} H (S^{'}, n_{TM})$ for all S, n_TM. Once the histogram is " flat", ϕ is decreased by the rule $ϕ_{i + 1} \leftarrow \sqrt{ϕ_{i}}$ and all entries of the histogram H are set to 0 again, while w is kept for the next iteration. Note that the application of a flatness criterion is not essential for the good performance of the algorithm. It is enough to guarantee that all values of S have been visited, for example by requiring that the random walker has cycled several times through the interval of interest [S_min, S_max].

To summarize, we have the following recipe:

WANG LANDAU(w, ϕ, ϕ_final, N)

Input: Initial guess w[s, n], initial and final modification factors ϕ, ϕ_final, number of samples for production run N

Output: Histogram of visited scores, H(s, n):= number of samples with score s and class n and weights used in the production run w(s, n) for all s and n

1: ▷ Initialize and estimate w[s, n]

2: Pick any x, y ∈ $X$ and compute its null probability p := f^query(x) · f^subject(y);

3: compute s := S(x, y) using $A$

4: compute z := V (x) using $V$ and determine corresponding class n;

5: while ϕ > ϕ_finaldo

6: H[s', n'] ← 0 for all possible score values s' and classes n'

7: while H[s', n'] is not at do

8: (x, y, z, p, s, n) ←

METROPOLIS HASTINGS UPDATE (x, y, z, p, s, n,w);

9: H[s, n] ← H[s, n] + 1; w[s, n] w[s, n]/ϕ;

10: end while

11: $ϕ \leftarrow \sqrt{ϕ}$

12: end while

13: ▷ Obtain N samples from q and their score counts/histogram

14: H[s', n'] ←0 for all possible score values s' and classes n'

15: for i = 1..N do

16: H[s, n] ← H[s, n] + 1

17: repeat

18: (x, y, z, p, s, n) ←

METROPOLIS HASTINGS UPDATE (x, y, z, p, s, n,w);

19: until mixing has occurred

20: end for

21: return counts H, weights w.

Due to the decreasing rule $ϕ_{i + 1} \leftarrow \sqrt{ϕ_{i}}$ , the modification factor ϕ converges towards 1. The simulation is stopped when ϕ reaches a chosen threshold value which is close to 1. It turned out that in our case the range from ϕ₀ = exp(0.1) ≈ 1.105 to ϕ_final = exp(0.0002) ≈ 1.0002 has been proven valuable.

Since detailed balance is violated explicitly, the convergence of the algorithm can not be proven. For this reason one should always use the Wang-Landau part as a precomputation step just to obtain weights suitable w(S). After this, one performs a simulation with ϕ = 1 for data production, which corresponds to the Metropolis-Hastings algorithm.

Improvements

Of course there is much room for improvement. For example, consider the time evolution of the histogram H(S) for (RQGS) with L_Q = L_S = 348 up to S_max = 500 with Prob(S = S_max) ≈ 10^-65 in Figure 4a.

When starting with an initial guess w(S) = 1 for all S ∈ [23, 500], the random walker needed about 5.8 × 10⁵ Monte-Carlo steps for a round trip, i.e. to move from the lowest score S_min = 23 to the highest one S_max = 600 and back. The duration of a round trip is a measure of the mixing time of the corresponding Markov chain. Hence, the shorter the round trip is time, the faster the chain convergences. During the first round trip, the weights have been improved such that the second round trip (and further round trips) needed only 13% of the computational effort of the first one. Once the random walker has performed its first round trip, the typical round trip time does not change significantly. This tight bottleneck in the very early stage of the algorithm can be overcome by suitable initial guesses of w. In Figure 4b the time evolution of the same parameter set (RQGS with L_Q = L_S = 348) is shown except for the choice of the weights, which have been chosen as w(s) ≈ 1 = Prob(S = s|L_Q = 348; L_S = 200), i.e. from a previous simulation of a different but similar setup. One observes that the histogram becomes "at" within a much smaller amount of Monte-Carlo steps. Furthermore, the first round-trip time decreases to 1.3 × 10⁵ (i.e. 22% of the value for the naive guess w(s) = 1). From a practical point of view, this allows us to save computing time for two distributions with close by parameters (e.g. close by sequence lengths). One can use the results of one distribution as input for the second one. With this approach we may also explore the parameter space successively. In some cases it is sufficient to run a short batch run with the weights of a close by distribution and ϕ = 1, i.e. a detailed balance simulation, and then apply importance reweighting and use the so obtained approximation of P(s) for a longer production run. This kind of procedure is shown in the inset of Figure 4b: The detailed balance simulations were performed with L_Q = L_S = 348, whereas the weights w(s) came from a simulation with L _S= 320 and L _S= 400, respectively. The result shows that the histograms are not " flat" at all, but the distributions were close enough to visit all score values on the range of interest. In this successive way of iterations a broad range of the parameter space is accessible.

Estimation of the statistical error

Statistical analysis of Markov-chain Monte-Carlo data requires a careful inspection of correlation effects because the events depend on the history of the chain. This correlations vanish within a typical timescale: Events that are separated by a sufficient number of steps can be assumed to be independent. However, since Monte-Carlo methods are only approximative, an assignment of statistical errors are requisite. In this study we used Flyvbjerg and Peterson's [37] blocking method to estimate the error.

Results

To our knowledge we present the first highly accurate score statistics for alignments with position-specific scoring schemes. The alignment scores were calculated with the standard Smith-Waterman algorithm with the BLOSUM62 matrix for the (RQGS) and a bipartite version BLOSUM62/SLIM for (FQPS) and (HMM) (see Figure 1). For the a fine gap costs we have chosen the standard values with a gap-open penalty of 12 and a gap-extension penalty of 1, and UniProt symbol frequencies for i.i.d. sequences.

We discuss four different transmembrane proteins as queries (see Table 2) in the (FQGS) scheme. The results are shown in Figure 5, where the distributions of (FQGS) and (RQGS) are compared against each other. The subject lengths are set to the query lengths. For the production run of one distribution in Figure 5 (L _Q= LS = 348) 16,777,216 Metropolis-Hastings updates have been performed. This took about 16 hours on an Intel Pentium 4 with 3.4 GHz. The performance of the corresponding HMM is weaker for three reasons: Firstly, we are interested in a joint distribution for that we need more samples. Secondly, more proposals are rejected from the sampler due to the HMM-weights and finally the computation of the forward-probabilities requires additional floating point operations. The computation of 16,777,216 Metropolis-Hastings updates for this model costs about 45 CPU hours. We use an 8 times larger sample size in order to account for the first drawback. Hence, we put an overall computational effort on this model, which is 23 times as large as for (FQGS) and (RQGS) (apart from the Wang-Landau iterations).

Table 2 A selection of transmembrane proteins

Full size table

Here we observe in Figure 5b that on the log scale the curvature of the tail of the distribution, i.e., the deviation from the exponential tail of the pure Gumbel form Eq. (2), is more pronounced in the (FQPS) model: Significant differences of shapes already show up in the high probability region, which is accessible by simple sampling (Figure 5a). The (RQGS) distributions for different lengths match almost perfectly (only two lengths are shown), whereas the shape of the (FQPS) distributions varies slightly with the sequence type. This supports the observation of Müller et.al. [23] that position-specific scoring in connection with a fixed query sequence may better discriminate between different sequences than the standard approach where two random sequences are compared with position-independent scoring matrices.

The asymptotic theory for i.i.d. sequences predicts an EVD of the form of Eq. (2). The parameters λ > 0 and c > 0 depend on the score matrix, on the symbol frequencies f, and on the query and subject sequence lengths L _Q and L _S. Altschul and Gish [16] pointed out that asymptotic results where c = KL _Q L _S(with K > 0 a constant) need to be corrected by using effective sequence lengths. An alignment may extend to the end of either sequence and the score will be distorted towards lower values and high scores become less probable. In the limit of infinite sequences this effect vanishes and the tail of the Gumbel distribution can be understood as an upper bound for finite sequences. Indeed, we clearly see that the curves in Figure 5b are not straight lines in the right tail, but have negative curvature.

A better t to the empirical distribution is obtained by determining parameters s₀, λ > 0, λ₂ > 0 for a "modified" Gumbel distribution with

\log Prob (S = s) = \log (λ) - λ (s - s_{0}) - λ_{2} {(s - s_{0})}^{2},

(15)

where s₀ can be interpreted as the center of the distribution. This corresponds to a EVD multiplied with a Gaussian correction factor, given by the last term. The parameter λ₂ is generally small (and thus shows its effect only in the far right tail). It vanishes for sequences of equal length as the length tends to infinity. Previously, such a correction has been proposed for (RQGS) statistics and has been computed for different parameter sets of BLOSUM62 and PAM250 with a ne gap costs [17, 18].

More pronounced differences are seen in the behavior of the tail (Figure 5b), which is only accessible via importance sampling approach. The difference between the probabilities spans several orders of magnitude; hence a wrong choice of the model would falsify the estimation of significance drastically. Most importantly, the pmf obtained using the position-specific scoring is considerably curved. Thus, using EVDs from fits to data of the high-probability region is even more questionable here than in the (RQGS) model, where the pmf is almost a straight line. Note that for the (RQGS) model, previous simulations [18] have already shown that for the special case of L _S= L _Q, the pmf converges for large sequence length indeed to an EVD.

Note that the Gaussian correction for local alignment parameterized by λ₂ is purely heuristic. Looking at the data, the shape in Figure 5b looks similar to the one of the Tracy-Widom distribution [38]. Interestingly, Majumdar and Nechaev [39] as well as Priezzhev and Schütz [40] obtained analytically the Tracy-Widom distribution as the asymptotic distribution for the model of the longest common subsequences which is closely related to global alignment. Also, Sardiu, Alves and Yu [41] observed that the the statistics of the score fluctuations of global alignment in the large probability region is compatible with the Tracy-Widom distribution. There could be a connection to our results, because in the rare-event tail, alignment lengths are of the order of the sequence lengths, hence the alignment is effectively global. Nevertheless all our results are obtained for finite sequence lengths. In contrast, distributions like Gumbel or Tracy-Widom are obtained in the asymptotic limit of infinite sizes of the underlying systems. Since finite-size corrections are hard to obtain, or even unknown, we do not attempt to determine the "true" shape of the distribution and are satisfied by our heuristic formula.

Next, we discuss the usefulness of the (FQPS) statistics in terms of retrieval performance. For this purpose we considered the ASTRAL compendium [42] version 1.75 with less than 40% identity to each other. It contains a set of reference proteins classified hierarchically based on their tertiary structure. It is a subset of the SCOP database with removed redundancy. The main hierarchy levels in the SCOP classification scheme are class, fold, superfamily, family. Proteins in the same class share the same type of secondary structure, whereas the fold level describes more specific the arrangement of the secondary structure. As the position specific scoring scheme is designed to be more sensitive to discriminate transmembrane proteins against others, its performance can be measured by searching a collection of transmembrane proteins from the ASTRAL set against the complete set. From the 10569 sequences in the database we have chosen 63 sequences which are classified as Membrane and cell surface proteins and peptides. For this collection we predicted the membrane regions using TMHMM. Each of those queries were searched against the complete set and ranked according to the p-value on the basis of the (FQPS) statistics. The p-value threshold under which we regard a hit as significant controls the so called receiver operating characteristic (ROC), i.e. the relationship between sensitivity vs. specificity. A transmembrane protein that appears below an p-value threshold is referred as a true positive observation. Accordingly, proteins of all other SCOP classes below that value are false positives. The ROC space can be explored by changing the p-value threshold. A small threshold produces less false positives but we may miss some hits. A larger value leads to more false positives. The ROC is usually illustrated by plotting the true positive rate (TPR = true positives/positives), or the sensitivity, against the false positive rate (FPR = false positives/negatives), 1- specificity. The result for the search characteristics for (FQPS) is shown in Figure 6. We did the same experiment for a BLAST search. In this case all observations are located in left bottom corner in the ROC space. This can be explained by the fact that we have only considered the highest ranked results below the E-value threshold of 10 and many positives were beyond that value. This can also be seen in the inset of Figure 6 where we show the average number of membrane proteins found so far as a function of the rank in the result set. The line of slope 1 for (FQPS) at the begin of the result list means that virtually all hits are found in the correct SCOP class. In contrast, BLAST ranks transmembrane proteins at a high positions only randomly. Hence, the (FQPS) clearly outperforms the (RQGS) statistics.

The ROC curve in Figure 6 show the usefulness of the (FQPS) statistics for retrieval performance, but the extreme small p-values where the the modification factor λ₂ plays a role are not essential for this purpose. The modified Gumbel statistics however affect a possible ranking of database search results, especially for sequences of different lengths. To illustrate this, we used BLAST to retrieve homologs of our four example proteins from the current Swissprot database. The scores were recomputed via the position specific Smith-Waterman algorithm for (FQPS). We computed the corresponding p-values from our simulation data and ranked the result set by the p-value based on

1.
the Gumbel distribution (λ₂ = 0) and
2.
the accurate distribution (λ₂ > 0).

For subject sequence lengths that are not directly governed by our simulation directly we used interpolated fit parameters. In Table 3 we illustrate that the there are subjects whose relative order in the result set is switched when using the more accurate (FQPS) score distribution in contrast to the BLAST E-value. This result might be important for applications of protein classification where the specific ranking of high scoring proteins is particularly important. However, on a global level the order of the hits persists, signaled by Kendall's rank correlation [43]. When comparing the order obtained between the cases λ₂ = 0 and λ₂ ≠ 0 we measured a rank correlation $\hat{τ}$ = 0.986 for the query P08100. A τ of exactly 1 means identical, 0 unrelated, and, -1 exactly the reverse ranking. The rank correlation between the original BLAST ranking and the one based on the accurate p-values is $\hat{τ}$ = 0.737.

Table 3 Change of ranking when using the modified Gumbel distribution

Full size table

To investigate the impact of dissimilar query and subject lengths L _Q and L _S on the parameters of the modified Gumbel distribution, we vary L _S and consider the parameters λ and λ₂ as functions of the ratio L _S/L _Q(see Figure 7). The large gap between the values of λ for the two different models reflects the qualitative difference of the shape in the high probability regime. We see that in the (RQGS) model, λ is virtually independent of the query and the sequence length. However, in the model (FQPS), λ varies with each individual query, as expected. For λ₂ one has to distinguish between L _S< L _Q and L _S> L _Q. In the first case, λ₂ decreases, which is not surprising, since the correction term describes a finite-size effect and should vanish for increasing sequence lengths.

Once the subject length exceeds the query length, the search space is still growing, but the finite length of the query enforces subject size independent edge effects.

For the (HMM), we approximate the score distribution within each class (number of helices = n). The shape of the distributions clearly agrees with the curvature for (RQGS) and (FQPS), and the modified Gumbel distribution could be fitted (see Figure 8) when the number of helices was not too small. This is indicated by a large reduced χ² value for distributions with a small number of helices. Also a visual inspection of the fit to the data supports this argument.

The rare-event tail shows clear differences between the different sub-classes of the model over several orders of magnitude. In Figure 9 the dependency of the fit parameters on the respective subclass of the model (Figure 9a and Figure 9b) as well as the dependency on the ratio L _S/L _Q(Figure 9c and Figure 9d) is shown. Note that for distributions that are not well described via Eq. (15), we only fitted the data in the high probability region. Those data points are left out in the plot for λ₂ in Figure 9b and are connected by dotted lines in Figure 9a.

In analogy to (RQGS) and (FQPS), the curvature remains constant when L _S> L _Q. Regarding the dependence on the number of helices, the curvature decays with increasing number of transmembrane regions and then approaches an approximate constant value. Numerical values are provided in the Appendix for reference.

Discussion and conclusions

We have presented a simple universal numerical method to accurately sample the far right tail of the score distribution of various sequence comparison algorithms. It appears to be the first method that is applicable to all classical local alignment statistics, query-specific and position-dependent score statistics, HMM calibration, statistics of normalized alignments, and many more. To sample the distribution using computer simulations, we use Markov-chain Monte Carlo simulations, in particular the Wang-Landau approach is connection with the Metropolis-Haistings algorithm. Apriori, the Wang-Landau approach does not require any assumption on the shape of the distribution (for example the parameters of the Gumbel distribution). The parameters can be estimated a posteriori by fitting the simulated distribution to an appropriate parametric form like Eq. (15). Here, we observed that for the (FQPS) model, the Gumbel distribution should be replaced by a more negatively curved one.

The method has a disadvantage: Because of the high number of samples required for non-parametric estimation of the distribution, it can presently not be used in on-line database search web services, such as a BLAST server. For example, generating the 16,777,216 samples for Figure 5 (L _Q= L _S= 348) took approximately 16 hours on an Intel Pentium 4 with 3.4GHz.

This is not as bad as it seems, though: Both the implementation and the design of the Markov chain have much room for improvement, e.g. we can choose different neighborhoods N(x) and optimize the weights in the generalized ensemble [44, 45].

While this still prohibits interactive use, we see a lot of potential for our method to provide an improved version of the hmmcalibrate tool [22] and to explore the statistics of normalized sequence alignment [7].

During the preparation of this manuscript we came aware of a new related importance sampling method which is suitable for efficient p-value computations for alignment statistics [46]. It makes use of simultaneous backward sampling of alignments and sequences. So far this method was applied to i.i.d. sequences but it should be possible to extend it to more complex model as well. We have tested it for the (FQPS) model as well. For the joint distribution of score and number of helices one would have to sample simultaneous the alignments, sequences and the hidden state sequence of the TMHMM.

Appendix: modified Gumbel parameters

Table 4 and Table 5 show numerical values for the parameters λ, λ₂ and K of the modified Gumbel distribution Eq. (15). These are visualized in Figure 7 and 9 in the body of the paper.

Table 4 Fit parameters for (FQPS) and (RQGS)

Full size table

Table 5 Fit parameters for the HMM

Full size table

References

Lesk AM: An Introduction to Bioinformatics. Oxford: Oxford University Press; 2005.
Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis. Cambridge: Cambridge University Press; 1998.
Book Google Scholar
Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5
Article CAS PubMed Google Scholar
Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77(2):257–286. 10.1109/5.18626
Article Google Scholar
Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic Local Alignment Search Tool. J Mol Biol 1990, 215: 403–410.
Article CAS PubMed Google Scholar
Hartmann AK: Practical Guide to Computer Simulations. Singapore: World Scientific; 2009.
Book Google Scholar
Arslan AN, Egecioglu O, Pevzner PA: A new approach to sequence comparison: normalized sequence alignment. Bioinformatics 2001, 17(4):327–337. 10.1093/bioinformatics/17.4.327
Article CAS PubMed Google Scholar
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL: The Universal Protein Resource (UniProt). Nucleic Acids Res 2005, (33 Database):D154-D159. [http://dx.doi.org/10.1093/nar/gki070]
Heinko S, Heinko J: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915–10919. 10.1073/pnas.89.22.10915
Article Google Scholar
Mercier S, Daudin JJ: Exact distribution for the local score of one i.i.d. random sequence. J Comput Biol 2001, 8(4):373–380. [http://dx.doi.org/10.1089/106652701752236197] 10.1089/106652701752236197
Article CAS PubMed Google Scholar
Karlin S, Altschul S: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 1990, 87: 2264. 10.1073/pnas.87.6.2264
Article PubMed Central CAS PubMed Google Scholar
Gumbel E: Statistics of Extremes. New York: Columbia University Press; 1958.
Google Scholar
Grossmann S, Yakir B: Large Deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments. Bernoulli 2004, 10(5):829–845. 10.3150/bj/1099579157
Article Google Scholar
Waterman MS, Vingron M: Rapid and accurate estimates of statistical significance for sequence data base searches. Proc Natl Acad Sci USA 1994, 91(11):4625–4628. 10.1073/pnas.91.11.4625
Article PubMed Central CAS PubMed Google Scholar
Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 2001, 29(2):351–361. 10.1093/nar/29.2.351
Article PubMed Central CAS PubMed Google Scholar
Altschul S, Gish W: Local Alignment Statistics. Meth Enzym 1996, 266: 460. full_text
Article CAS PubMed Google Scholar
Hartmann A: Sampling rare events: Statistics of local sequence alignments. Phys Rev E 2002, 65: 056102. 10.1103/PhysRevE.65.056102
Article Google Scholar
Wolfsheimer S, Burghardt B, Hartmann A: Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algor Mol Biol 2007, 2: 9. [http://www.almob.org/content/2/1/9] 10.1186/1748-7188-2-9
Article Google Scholar
Yu YK, Wootton JC, Altschul SF: The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 2003, 100(26):15688–15693. [http://dx.doi.org/10.1073/pnas.2533904100] 10.1073/pnas.2533904100
Article PubMed Central CAS PubMed Google Scholar
Yu YK, Altschul SF: The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 2005, 21(7):902–911. [http://dx.doi.org/10.1093/bioinformatics/bti070] 10.1093/bioinformatics/bti070
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389
Article PubMed Central CAS PubMed Google Scholar
Eddy S:HMMER User's guide, version 2.3.2. 2003. [ftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf]
Google Scholar
Müller T, Rahmann S, Rehmsmeier M: Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 2001, 17: 182–189. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/S182]
Article Google Scholar
Eddy SR: A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Comput Biol 2008, 4(5):s1000069. [http://dx.doi.org/10.1371%2Fjournal.pcbi.1000069] 10.1371/journal.pcbi.1000069
Article Google Scholar
Sonnhammer EL, von Heijne G, Krogh A: A hidden Markov model for predicting transmembrane helices in protein sequences. In Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology. Edited by: JG, et al. AAAI Press; 1998:175–182.
Google Scholar
Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol 2001, 305(3):567–580. [http://www.sciencedirect.com/science/article/B6WK7–457D7V9-K/2/0367078014042718f39416a2c3ddeeb3] 10.1006/jmbi.2000.4315
Article CAS PubMed Google Scholar
Hastings WK: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 1970, 57: 97–109. 10.1093/biomet/57.1.97
Article Google Scholar
Newman MEJ, Barkema GT: Monte Carlo Methods in Statistical Physics. Oxford: Clarendon Press; 1999.
Google Scholar
Rubinstein RY, kroese DP: Simulation and the Monte Carlo Method. Hoboken, New Jersey: Wiley; 2008.
Google Scholar
Lee J: New Monte Carlo algorithm: Entropic sampling. Phys Rev Lett 1993, 71(2):211–214. 10.1103/PhysRevLett.71.211
Article CAS PubMed Google Scholar
Berg BA, Neuhaus T: Multicanonical ensemble: A new approach to simulate first-order phase transitions. Phys Rev Lett 1992, 68: 9. 10.1103/PhysRevLett.68.9
Article PubMed Google Scholar
Wang JS, Tay TK, Swendsen RH: Transition Matrix Monte Carlo Reweighting and Dynamics. Phys Rev Lett 1999, 82(3):476–479. 10.1103/PhysRevLett.82.476
Article CAS Google Scholar
Wang JS: Transition matrix Monte Carlo method. Comput Phys Commun 1999, 121–122: 22–25. [http://www.sciencedirect.com/science/article/B6TJ5–3Y0HM2T-T/2/3377e3546795e04c63dc23b6982b7459] 10.1016/S0010-4655(99)00270-2
Article CAS Google Scholar
Wang JS, Lee LW: Monte Carlo algorithms based on the number of potential moves. Comput Phys Commun 2000, 127: 131–136. [http://www.sciencedirect.com/science/article/B6TJ5–404H3KN-N/2/e62d53facfd5d82de4b029380ea99a78] 10.1016/S0010-4655(00)00016-3
Article CAS Google Scholar
Wang FG, Landau DP: Efficient, multiple-range random walk algorithm to calculate the density of states. Phys Rev Lett 2001, 86: 2050. 10.1103/PhysRevLett.86.2050
Article CAS PubMed Google Scholar
Wang FG, Landau DP: Determining the density of states for classical statistical models: A random walk algorithm to produce a flat histogram. Phys Rev E 2001, 64: 056101. 10.1103/PhysRevE.64.056101
Article CAS Google Scholar
Flyvbjerg H, Petersen HG: Error estimates on averages of correlated data. The Journal of Chemical Physics 1989, 91: 461–466. [http://link.aip.org/link/?JCP/91/461/1] 10.1063/1.457480
Article CAS Google Scholar
Tracy CA, Widom H: On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics 1996, 177(3):727–754. [http://dx.doi.org/10.1007/BF02099545] 10.1007/BF02099545
Article Google Scholar
Majumdar SN, Nechaev S: Exact asymptotic results for the Bernoulli matching model of sequence alignment. Phys Rev E 2005, 72(2):020901. 10.1103/PhysRevE.72.020901
Article Google Scholar
Priezzhev VB, Schütz G: Exact solution of the Bernoulli matching model of sequence alignment. Journal of Statistical Mechanics: Theory and Experiment 2008, 2008(09):P09007. (11 pp) [http://iopscience.iop.org/1742–5468/2008/09/P09007/] (11 pp) 10.1088/1742-5468/2008/09/P09007
Article Google Scholar
Sardiu ME, Alves G, Yu Y: Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. Phys Rev E 2005, 72: 061917. 10.1103/PhysRevE.72.061917
Article Google Scholar
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levit t M, Brenner SE: The ASTRAL Compendium in 2004. Nucl Acids Res 2004, 32(suppl_1):D189–192. [http://nar.oxfordjournals.org/cgi/content/abstract/32/suppl_1/D189] 10.1093/nar/gkh034
Article PubMed Central CAS PubMed Google Scholar
Kendall M, Gibbons JD: Rank Correlation Methods. 5th edition. London: Edward Arnold, a div. of Hodder & Stoughton; 1976.
Google Scholar
Dayal P, Trebst S, Wessel S, Würtz D, Troyer M, Sabhapandit S, Coppersmith SN: Performance Limitations of Flat-Histogram Methods. Phys Rev Lett 2004, 92(9):097201–4. [http://link.aps.org/abstract/PRL/v92/e097201] 10.1103/PhysRevLett.92.097201
Article CAS PubMed Google Scholar
Trebst S, Huse DA, Troyer M: Optimizing the ensemble for equilibration in broad-histogram Monte Carlo simulations. Phys Rev E 2004, 70(4):046701. [http://link.aps.org/abstract/PRE/v70/e046701] 10.1103/PhysRevE.70.046701
Article Google Scholar
Newberg LA: Significance of Gapped Sequence Alignments. Journal of Computational Biology 2008, 15(9):1187–1194. [PMID: 18973434] [http://www.liebertonline.com/doi/abs/10.1089/cmb.2008.0125] [PMID: 18973434] 10.1089/cmb.2008.0125
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

SW was supported by the German VolkswagenStiftung (program "Nachwuchsgruppen an Universitäten"). IH was supported by the NRW Graduate School in Bioinformatics and Genome Research, Bielefeld. The simulations were performed at the "Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen" and at the cluster of the Center for Biotechnology at the university of Bielefeld. Allocation of computing time is gratefully acknowledged.

Author information

Authors and Affiliations

Laboratoire MAP5 (UMR CNRS 8145), Université Paris Descartes, 75006, Paris, France
Stefan Wolfsheimer
Technische Fakultät, Universität Bielefeld, 33501, Bielefeld, Germany
Inke Herms
Department of Computer Science, TU Dortmund, Germany
Sven Rahmann
Institut für Physik, Universität Oldenburg, D-26111, Oldenburg, Germany
Alexander K Hartmann

Authors

Stefan Wolfsheimer
View author publications
You can also search for this author in PubMed Google Scholar
Inke Herms
View author publications
You can also search for this author in PubMed Google Scholar
Sven Rahmann
View author publications
You can also search for this author in PubMed Google Scholar
Alexander K Hartmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Wolfsheimer.

Additional information

Authors' contributions

SW developed the simulation program for (FQPS) and HMM based on an earlier version [18], ran the simulations and performed data analysis. SR and AKH designed the project. IH and SW developed the details for the TMHMM. All authors contributed to the manuscript and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wolfsheimer, S., Herms, I., Rahmann, S. et al. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics 12, 47 (2011). https://doi.org/10.1186/1471-2105-12-47

Download citation

Received: 26 February 2010
Accepted: 03 February 2011
Published: 03 February 2011
DOI: https://doi.org/10.1186/1471-2105-12-47

Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

Abstract

Background

Results

Conclusions

Background

Previous work

Our contributions and paper outline

Methods

Importance sampling

Metropolis-Hastings sampling

Implementation

Proposal densities for (FQPS) and (RQGS)

Proposal densities for the (HMM)

Wang-Landau sampling

Improvements

Estimation of the statistical error

Results

Discussion and conclusions

Appendix: modified Gumbel parameters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us