Laboratoire MAP5 (UMR CNRS 8145), Université Paris Descartes, 75006 Paris, France

Technische Fakultät, Universität Bielefeld, 33501 Bielefeld, Germany

Department of Computer Science, TU Dortmund, Germany

Institut für Physik, Universität Oldenburg, D-26111 Oldenburg, Germany

Abstract

Background

Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.

Results

In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach.

Conclusions

The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.

Background

A large amount of molecular biological data is stored in form of sequences of symbols in huge data bases, e.g., DNA sequences or the primary structure of proteins. It is one main task of Bioinformatics

An interpretation becomes possible when we specify a probabilistic null model for the input: Then the similarity score becomes a random variable

In this paper, we explain and extend an

Previous work

We start by introducing some necessary formal notations. For a full description, please refer to Ref.

Most of the existing statistical work for pairwise sequence comparison focuses on null models where both sequences are random and at each position a symbol σ ∈ Σ is chosen independently of the other positions ("i.i.d. model"), with a given frequency _{k}
_{k}
_{k }
_{k }
_{k}
_{k}
_{l}
_{l}
_{k }
_{l }
_{k }
_{l}

two gaps of lengths two and three appear.

Scores for individual pairs of symbols are given by a constant (position-independent) symmetric Σ × Σ scoring matrix with negative expected score, such as BLOSUM62

For gapless pairwise local sequence alignment, the raw score distribution can be derived numerically by Markov chain analysis

where the parameters λ > 0 and _{Q }
_{S}
_{Q}
_{S }

For gapped pairwise local sequence alignment, which is the most relevant case in database queries there exist no universal analytic results, with the exception of few special cases

The (RQGS) model is convenient, because the problem of computing significance values reduces to the estimation of only two parameters, which can be precomputed for each scoring scheme. However, there are also several problems. For instance, even if one considers just the gapless case, it is in general not easy to extend the analytic asymptotic theory to more complex null models. Furthermore, for practical applications where finite sequence lengths are considered, of even more importance is: The p-values reported by (the original) BLAST only depend on the raw score, the query and the subject length, and not on the actual query sequence. This leads to large distortions when the composition of the query sequence does not match the composition of the null model. For example, when we run a homology search for the human transmembrane protein rhodopsin (UniProt accession P08100) with BLAST (BLOSUM 62, gap-init 12, gap-extend 1, no composition adjustment, no filtering), we find a possibly remote homolog ^{-8}. The E-value for score

The statistics of position-dependent scoring and/or gap-cost schemes, as used in PSI-BLAST

In all these cases EVDs Eq. (2) may still used heuristically by fitting the parameters of the EVD to the simulated data. This can be achieved by generating pairs of random sequences according to the given null model while recording the histogram of observed alignment scores. Using such a "simple sampling" approach, the large-probability region of the score distribution can be investigated, e.g., for probabilities about > 10^{-4 }when generating 10^{4 }sequence pairs. Such an approach is implemented, e.g. in the ^{-4}), although this part of the distribution is most important for the estimation of the statistical significance.

Our motivation for a simulation-based method that makes no initial parametric assumption refers to the approach

Bipartite scoring scheme

**Bipartite scoring scheme**. Bipartite scoring scheme for the detection of homologous transmembrane proteins from Ref.

Our contributions and paper outline

We present a general framework for efficient estimation of raw score distributions in sequence comparison problems. In particular the rare-event tail for large scores can be accessed. We only make the following assumptions:

1. We are able to sample pairs

2. We have an efficient algorithm

3. The scores are rational numbers with a common denominator. Hence, without loss of generality, they can be assumed to be integers.

4. Optionally for the (HMM) approach, we have an efficient algorithm

Our framework is readily applicable to the (RQGS), (FQPS) and (HMM) models, but also to more exotic settings, such as normalized alignment

In the current stage of the methodology, the computation of an accurate "on the fly" p-value for each particular database query might be impracticable as each full calculation is not achieved within a few minutes.

We will illustrate the approach for the HMM for TM proteins (TMHMM

The rest of the paper is organized as follows. The following section presents the mathematical background on importance sampling and Markov chain Monte Carlo methods which are fundamental to the methods used to obtain the score distribution, in particular in the rare-event tail, for different null models. Next, we present a description of the methodology. Section "Results" shows computational results on transmembrane protein similarity statistics in (RQGS), (FQPS) and (HMM). A discussion closes the paper.

Methods

Importance sampling

Importance sampling is a general technique to reduce the variance in the estimation of quantities that can be written as an expectation _{1}, ..., _{n }

In our setting, to estimate the score distribution (and then p-values), we consider the state space ^{(1) }, ^{(1)}} ,..., {^{(n) }, ^{(n)}}).

These pairs are then aligned by a given algorithm ^{(i)}, ^{(i)}) are computed. To formally write a histogram as an expectation value, we consider the family of indicator functions _{s}
_{s}

This means, we approximate the unknown exact probability Prob(^{-9}, when using simple sampling, we need about 10^{12 }samples to estimate it with reasonable precision. Thus, for very rare events, this sampling quickly becomes infeasible.

Importance sampling generates the "interesting" events more often by sampling from a different distribution and correcting for this bias afterward, which results in a more accurate estimate with a reasonable number of samples. Let

where each pair (^{(i)}, ^{(i)}) is sampled from the pmf

Metropolis-Hastings sampling

If we need to generate samples from a discrete (or continuous) distribution

Extensive introductions to such so-called Monte Carlo simulations can be found, e.g., in Refs. _{(x,y), (x',y') }as being the next configuration. The proposal is

in which case (_{(x,y),(x', y')}

The acceptance criterion Equation (3) is quite general. By using a symmetric proposal probability matrix, _{(x',y') (x, y) }= _{(x,y), (x',y')}, the relationship simplifies to

Since the distribution

Equation (4) and its generalization Equation (3) describe _{(x, y), (x', y') }= _{(x, y), (x', y')}·

For an appropriate choice of the neighborhoods _{(x, y), (x', y')}, the so-constructed Markov chain is ergodic (each configuration can be reached from any starting configuration with finite probability). Furthermore, one can show that the detailed balance condition

which is fulfilled due to the choice of

We say that the chain has reached

Implementation

In this section we show how the sampling algorithm for pairs of sequences is actually designed, based on the background given in the two preceding sections, such that the tails of the probability distributions for the scores can be addressed. The crucial point of the Metropolis-Hastings update is the choice of an appropriate neighborhood ^{+ }that assign each score value of interest a weight and secondly the null probability, i.e.

Note that we will leave _{s }

with the normalization constant

For the (HMM) a single distribution of scores is not sufficient: Each query is a member of a certain sub-class characterized by the number of transmembrane regions "# of TM helices" to be determined by the Viterbi algorithm (see below). Thus, each class has its own probability _{n}
_{TM}). In order to take this property into account, we deal with the joint probability Prob(_{TM}). Accordingly, the weights have a two dimensional domain, we write _{s }
_{s,n }
_{s,n}
_{TM}. The sampling distribution is generalized to

and the the reweighting relationship reads as

with, in this case,

Generally the occurrence of two sequences

This simple factorization allows us to draw proposals for the query and for the subject independently. Hence, for simplicity, a neighboring configuration will leave one of the two sequences unchanged. Thus, for selecting a neighboring configuration, first one of the two sequences is chosen at random with probability 1/2. In the case of (FQPS) the subject is always chosen. Then one sequence is chosen from the neighborhood of the selected sequence, as described next. Formally, this means for (RQGS) and (HMM) we use the factorized proposal densities _{(x, y), (x', y') }= 0.5_{
x, x' }
**1**
_{
y, y' }or _{(x, y) (x', y') }= **1**
_{
x, x'
}
_{
y, y'
}(**1**
_{
y, y'
}denotes the indicator function which is only one if _{
x, x'
}denotes the proposal of a single sequence) depending on the choice of sequence in the first step.

Proposal densities for (FQPS) and (RQGS)

In the simplest case either both sequences are i.i.d. or the query is fixed (to some sequence

and of course in both cases

Due to the factorization that occurs in Eq. (10) it is possible to draw sequences from ^{
iid
}(x) _{
x,x'
}= ^{
iid
}(_{
x', x
}is fulfilled by the following set of Monte Carlo moves (see also Figure

Monte Carlo moves used in the simulation

**Monte Carlo moves used in the simulation**. (a) substitution, (b) insertion with left shift, (c) insertion with right shift,(d) deletion with right shift and (e) deletion with left shift.

Monte Carlo operations

**operation**

**resulting sequence**

substitution of D at position 5

insertion of D at position 5 with left shift

insertion of D at position 5 with right shift

deletion at position 5 with left shift

deletion at position 5 with right shift

Valid Monte Carlo operations for input sequence

a) substitution of a single symbol at position

b) insertion of a single new symbol at position

c) insertion of a single new symbol at position

d) deletion of a single symbol at position

e) deletion of a single symbol at position

Operation a) appears with probability 1/2 and the other ones with probability 1/2 · 1/4 each. This is one possible choice that guarantees detailed balance.

Note that all sequences in _{σ }

With this construction the Metropolis-Hastings ratio Eq. (3) simplifies to the special case of the Metropolis algorithm, i.e.

The right part in the second line cancels, because all contributions to

Proposal densities for the (HMM)

In contrast to the approach presented in the previous section, the generalized method we use here also works for null models that do not allow for direct sampling from

M

**Input: **Sequences ^{query}(^{subject}(

**Output: **Possibly new values for

1: Draw (

2: compute

3: compute ^{query }(^{subject}(

4: compute

5: Compute

▷

_{(x', y'), (x,y) }= _{(x,y),(x', y')}

6: With probability min {1,

Let (

7: **return **(

The algorithm is applicable to all models that allow for a rapid calculation of the null probabilities

In the probabilistic framework of HMMs one assumes a sequence of "observed" symbols (the protein sequences here) which is generated conditioned on a sequence of "hidden" states. For the case of TM proteins, the state corresponds to the physical region where the corresponding amino acid is located in, as detailed below. Within a modeling using HMMs, this state sequence, also called

• a finite set ∑ of (output) symbols (in our case the amino acid alphabet),

• a finite set Γ of (hidden) states,

• initial state probabilities _{
μ
}for all _{
μ∈Γ }
_{
μ
}= 1,

• emission probabilities

• a stochastic transition probability matrix _{
μ,τ
})_{
μ,τ ∈ Γ}, i.e.

Given these model parameters, the "most natural" application of a HMM is to generate a sequence of hidden states by a stochastic process and, in parallel, to generate a random sequence of symbols given the generated states. Hence, the stochastic process describes pairs of states and symbols. But also given a fixed state _{1 }... _{L}
_{1 }... _{L }

For the Monte Carlo sampling as needed here, it is not possible to simulate a HMM directly to generate symbol sequences, since importance sampling changes the underlying sequence probabilities. Nevertheless, one still needs to compute the probabilities ^{HMM}(^{2}) time using the well known _{μ}
_{1 }... _{i }
_{i }
_{μ }
_{1 }... _{i }
_{1 }... _{i}
_{i }
^{HMM}(_{
μ∈Γ }
_{μ }
_{μ }

with initial conditions

Within the same time complexity the

For this purpose one uses a different set of auxiliary variables: Let _{μ }
_{1}, ..., _{i}

with boundary condition

For the approach discussed in this section, the subject sequences are drawn almost as above, see below. The HMM approach we use to sample transmembrane queries is the TMHMM developed by Sonnhammer et. al.

• Helix core,

• two different groups of caps on either side,

• loops on the cytoplasmic side,

• short and long loops on the non-cytoplasmic side,

• globular domains.

The internal structure of the helix core and loop module allows modeling different lengths of the corresponding protein domain by assigning jump probabilities. The globular domains have a self-looping structure and hence may also have various lengths. The other modules have fixed lengths. The overall number of model parameters is 216. Figure

The layout of the HMM for transmembrane proteins

**The layout of the HMM for transmembrane proteins**. The layout of the HMM for transmembrane proteins according to Sonnhammer et.al.

The following Metropolis-Hastings update consists of two steps: First, the proposal of a new configuration from the neighborhood

The current and new number of TM regions _{TM }and _{TM }are determined by the Viterbi algorithm applied on the sequence

This approach allows us to sample noni.i.d. sequences with appropriate weights and to predict transmembrane helical regions that can be used in the position specific alignment scheme (as described in

Wang-Landau sampling

The idea of importance sampling is to choose the weights ^{flat}.

Of course the ^{flat }to a suitable accuracy. The achieved score histogram becomes only approximatively flat. The true (unknown) distribution can then be estimated by reweighting the histogram of visited states using the importance sampling formula Eq. (7) for _{s}

Many iterative sampling schemes to achieve initial guesses had been developed in the 1990ies, for example entropic sampling ^{flat }as input for Metropolis-Hastings sampling.

The Wang-Landau algorithm explicitly violates detailed balance by dynamically updated weights depending on the visited states in the following way: First, a score range of interest [_{min}, _{max}] is chosen. The algorithm basically employs a histogram _{TM}, i.e. _{TM}) and _{TM}). Furthermore, real valued parameters _{i }
_{TM}) are set to 0 in the desired range and all weights _{TM}) to a constant, say 1. For the first iteration, _{i }
^{1}. Then, a simulation is performed using acceptance ratio Eq. (12) or Eq. (14). After each step, corresponding to one step of a (biased) _{TM}) is updated as _{TM}) ← _{TM}) × _{i}
_{TM }the sub-class of the current state. Also the histogram H is updated by one _{TM}) ← _{TM}) + 1. In the literature this is often continued until an "approximately flat histogram" is achieved. A possible flatness criterion might be _{TM}. Once the histogram is " flat", _{min}, _{max}].

To summarize, we have the following recipe:

W_{final},

**Input: **Initial guess _{final}, number of samples for production run

**Output: **Histogram of visited scores,

1: ▷

2: Pick any ^{query}(^{subject}(

3: compute

4: compute

5: **while **
_{final }
**do**

6:

7: **while **
**do**

8: (

M

9:

10: **end while**

11:

12: **end while**

13: ▷

14:

15: **for **
**do**

16:

17: **repeat**

18: (

M

19: **until **mixing has occurred

20: **end for**

21: **return **counts

Due to the decreasing rule _{0 }= exp(0.1) ≈ 1.105 to _{final }= exp(0.0002) ≈ 1.0002 has been proven valuable.

Since detailed balance is violated explicitly, the convergence of the algorithm can not be proven. For this reason one should always use the Wang-Landau part as a precomputation step just to obtain weights suitable

Improvements

Of course there is much room for improvement. For example, consider the time evolution of the histogram _{Q }
_{S }
_{max }= 500 with Prob(_{max}) ≈ 10^{-65 }in Figure

Dynamics of the Wang-Landau algorithm

**Dynamics of the Wang-Landau algorithm**. Typical time evolution of the histogram of visited states when starting with different initial guesses. The model parameters are _{Q }_{S }_{Q }= 348, _{S }= 200) has been used as an initial guess. The histogram becomes flatter within remarkable less computational effort. Inset: a detailed balance simulation (

When starting with an initial guess ^{5 }Monte-Carlo steps for a _{min }= 23 to the highest one _{max }= 600 and back. The duration of a round trip is a measure of the mixing time of the corresponding Markov chain. Hence, the shorter the round trip is time, the faster the chain convergences. During the first round trip, the weights have been improved such that the second round trip (and further round trips) needed only 13% of the computational effort of the first one. Once the random walker has performed its first round trip, the typical round trip time does not change significantly. This tight bottleneck in the very early stage of the algorithm can be overcome by suitable initial guesses of _{Q }
_{S }
_{Q }
_{S }
^{5 }(i.e. 22% of the value for the naive guess _{Q }
_{S }
_{S }
_{S }

Estimation of the statistical error

Statistical analysis of Markov-chain Monte-Carlo data requires a careful inspection of correlation effects because the events depend on the history of the chain. This correlations vanish within a typical timescale: Events that are separated by a sufficient number of steps can be assumed to be independent. However, since Monte-Carlo methods are only approximative, an assignment of statistical errors are requisite. In this study we used Flyvbjerg and Peterson's

Results

To our knowledge we present the first highly accurate score statistics for alignments with position-specific scoring schemes. The alignment scores were calculated with the standard Smith-Waterman algorithm with the BLOSUM62 matrix for the (RQGS) and a bipartite version BLOSUM62/SLIM for (FQPS) and (HMM) (see Figure

We discuss four different transmembrane proteins as queries (see Table _{Q }

A selection of transmembrane proteins

**ID**

**AC**

**Description**

**Organism**

**Length**

OPSD_HUMAN

P08100

Rhodopsinm

H. sapiens

348

AGTR2_HUMAN

P50052

type-2 angiotension II receptor

H. sapiens

363

YXX5_CAEEL

Q18179

putative neuropeptide Y receptor

C. elegans

455

ADA1A_HUMAN

P35348

Alpha-1A adrenergic receptor

H. sapiens

466

A selection of transmembrane proteins. ID: UniProt identifier; AC: accession number.

Score distributions for (RQGS) and (FQPS) models

**Score distributions for (RQGS) and (FQPS) models**. Score distributions for (RQGS) (classical) and (FQPS) models where the subject length equals the query length. In order to compare the shape, the distributions have been shifted by the center _{0}. (a): Linear view; all distributions from the (RQGS) agree outside the tails (only two lengths are shown). The shape of the (FQPS) distributions is more variable. (b): Logarithmic view; significant differences between the two models appear in the tail of the distribution. High scores are more probable for the (FQPS) alignment. Furthermore the curvature, i.e. the deviation from the Gumbel form, is much larger for (FQPS) than for the classical model.

Here we observe in Figure

The asymptotic theory for i.i.d. sequences predicts an EVD of the form of Eq. (2). The parameters λ > 0 and _{Q }
_{S}
_{Q}
_{S }

A better t to the empirical distribution is obtained by determining parameters _{0}, λ > 0, λ_{2 }> 0 for a "modified" Gumbel distribution with

where _{0 }can be interpreted as the center of the distribution. This corresponds to a EVD multiplied with a Gaussian correction factor, given by the last term. The parameter λ_{2 }is generally small (and thus shows its effect only in the far right tail). It vanishes for sequences of equal length as the length tends to infinity. Previously, such a correction has been proposed for (RQGS) statistics and has been computed for different parameter sets of BLOSUM62 and PAM250 with a ne gap costs

More pronounced differences are seen in the behavior of the tail (Figure _{S }
_{Q}

Note that the Gaussian correction for local alignment parameterized by λ_{2 }is purely heuristic. Looking at the data, the shape in Figure

Next, we discuss the usefulness of the (FQPS) statistics in terms of retrieval performance. For this purpose we considered the ASTRAL compendium

Retrieval performance for the (FQPS) statstics

**Retrieval performance for the (FQPS) statstics**. ROC curves (true positive rate vs. false positive rate) when searching TM proteins from the ASTRAL reference set against the complete ASTRAL set. Different symbols indicate different p-value thresholds being used. Inset: sensitivity for (FQPS) compared with BLAST search. The plot shows the averaged number of observed helical proteins as a function of the rank in the result set.

The ROC curve in Figure _{2 }plays a role are not essential for this purpose. The modified Gumbel statistics however affect a possible ranking of database search results, especially for sequences of different lengths. To illustrate this, we used BLAST to retrieve homologs of our four example proteins from the current Swissprot database. The scores were recomputed via the position specific Smith-Waterman algorithm for (FQPS). We computed the corresponding p-values from our simulation data and ranked the result set by the p-value based on

1. the Gumbel distribution (λ_{2 }= 0) and

2. the accurate distribution (λ_{2 }> 0).

For subject sequence lengths that are not directly governed by our simulation directly we used interpolated fit parameters. In Table _{2 }= 0 and λ_{2 }≠ 0 we measured a rank correlation

Change of ranking when using the modified Gumbel distribution

**FQPS λ _{2 }= 0**

**FQPS λ _{2 }≠ 0**

**Query**

**rank**

**Subject**

**L _{S}**

**p-value**

**rank**

**Subject**

**L _{S}**

**p-value**

P08100

433

Q90456

287

1.1 × 10 ^{-21}

445

Q8N6U8

529

7.2 × 10 ^{-29}

_{Q }

476

Q8N6U8

529

2.1 × 10 ^{-21}

483

Q90456

287

2.1 × 10 ^{-28}

P50052

79

P32250

308

1.1 × 10 ^{-37}

64

P34975

380

1.2 × 10 ^{-57}

_{Q }

100

P34975

380

1.8 × 10 ^{-37}

111

P32250

308

1.3 × 10 ^{-56}

Q18179

772

P18901

446

2.2 × 10 ^{-21}

790

P79291

228

9.2 × 10 ^{-27}

_{Q }

837

P79291

228

1.1 × 10 ^{-20}

794

P18901

446

9.8 × 10 ^{-27}

P35348

825

Q8HYN8

297

9.8 × 10 ^{-24}

826

O70432

167

5.2 × 10 ^{-30}

_{Q }

937

O70432

167

1.3 × 10 ^{-21}

847

Q8HYN8

297

1.9 × 10 ^{-29}

Examples of blast hits for the four proteins used for FQPS. The original result sets have been re-ranked according the the FQPS statistics. Left column: Gumbel assumption (λ_{2 }= 0). Right column: Modified Gumbel distribution (λ_{2 }≠ 0).

To investigate the impact of dissimilar query and subject lengths _{Q }
_{S }
_{S }
_{2 }as functions of the ratio _{S}
_{Q }
_{2 }one has to distinguish between _{S }
_{Q }
_{S }
_{Q}
_{2 }decreases, which is not surprising, since the correction term describes a finite-size effect and should vanish for increasing sequence lengths.

Fit parameters for (RQGS) and (FQPS) models

**Fit parameters for (RQGS) and (FQPS) models**. Dependence of the modified Gumbel parameters on the subject/query length ratio _{S}_{Q}_{S }_{Q}_{S }_{Q}_{2 }characterizes the curvature of the pmf in the tail (see Figure 5b). Large differences between (RQGS) and (FQPS) show up in the case where _{S }_{Q}_{2 }becomes subject-length independent for _{S }_{Q}

Once the subject length exceeds the query length, the search space is still growing, but the finite length of the query enforces subject size independent edge effects.

For the (HMM), we approximate the score distribution within each class (number of helices = ^{2 }value for distributions with a small number of helices. Also a visual inspection of the fit to the data supports this argument.

Score distributions for different alignment models

**Score distributions for different alignment models**. Score distributions for different alignment models (i.i.d., fixed query and TMHMM) with _{S }_{Q }

The rare-event tail shows clear differences between the different sub-classes of the model over several orders of magnitude. In Figure _{S}
_{Q }
_{2 }in Figure

Fit parameters for different alignment models

**Fit parameters for different alignment models**. Fit parameters for score distributions _{Q }= 348 and various subject lengths _{S}. Both shape parameters λ and λ_{2 }decrease with increasing number of helices. The dependency on the subject length is stronger for λ_{2 }than for λ. For _{S }>_{Q }the dependency of λ_{2 }on the subject length is only of marginal order. The bars show the distribution of the number of transmembrane helices obtained by direct simulations of the (HMM). (c),(d): The _{S}/_{Q }dependency of λ and λ_{2 }extracted from the same data as (a),(b). The lines are guide to the eyes only. Dashed lines show the corresponding scaling behavior for the (FQRS) and (RQGS) models. The result for

In analogy to (RQGS) and (FQPS), the curvature remains constant when _{S }
_{Q}

Discussion and conclusions

We have presented a simple universal numerical method to accurately sample the far right tail of the score distribution of various sequence comparison algorithms. It appears to be the first method that is applicable to all classical local alignment statistics, query-specific and position-dependent score statistics, HMM calibration, statistics of normalized alignments, and many more. To sample the distribution using computer simulations, we use Markov-chain Monte Carlo simulations, in particular the Wang-Landau approach is connection with the Metropolis-Haistings algorithm. Apriori, the Wang-Landau approach does not require any assumption on the shape of the distribution (for example the parameters of the Gumbel distribution). The parameters can be estimated a posteriori by fitting the simulated distribution to an appropriate parametric form like Eq. (15). Here, we observed that for the (FQPS) model, the Gumbel distribution should be replaced by a more negatively curved one.

The method has a disadvantage: Because of the high number of samples required for non-parametric estimation of the distribution, it can presently not be used in on-line database search web services, such as a BLAST server. For example, generating the 16,777,216 samples for Figure _{Q }
_{S }

This is not as bad as it seems, though: Both the implementation and the design of the Markov chain have much room for improvement, e.g. we can choose different neighborhoods

While this still prohibits interactive use, we see a lot of potential for our method to provide an improved version of the

During the preparation of this manuscript we came aware of a new related importance sampling method which is suitable for efficient p-value computations for alignment statistics

Authors' contributions

SW developed the simulation program for (FQPS) and HMM based on an earlier version

Appendix: modified Gumbel parameters

Table _{2 }and

Fit parameters for (FQPS) and (RQGS)

**FQPS**

**corresponding RQGS**

**
L
**

**
L
**

**λ**

**10 ^{4 }λ_{2}**

**
K
**

**λ**

**10 ^{4 }λ_{2}**

**
K
**

P08100

50

0.3016 ± 0.40%

7.5741 ± 0.77%

0.0654 ± 3.34%

348

100

0.1747 ± 0.19%

3.2202 ± 0.32%

0.0132 ± 1.49%

0.2829 ± 0.17%

3.6884 ± 0.36%

0.0463 ± 4.09%

200

0.1617 ± 0.09%

1.7968 ± 0.18%

0.0100 ± 1.31%

0.2685 ± 0.15%

1.8498 ± 0.40%

0.0315 ± 2.77%

300

0.1478 ± 0.14%

1.3962 ± 0.21%

0.0059 ± 2.20%

0.2664 ± 0.14%

1.1900 ± 0.47%

0.0292 ± 3.49%

320

0.1466 ± 0.15%

1.3775 ± 0.28%

0.0056 ± 2.33%

0.2674 ± 0.11%

1.1059 ± 0.51%

0.0295 ± 2.05%

348

0.1432 ± 0.22%

1.4131 ± 0.33%

0.0051 ± 2.69%

0.2681 ± 0.10%

0.9909 ± 0.43%

0.0307 ± 2.18%

360

0.1426 ± 0.17%

1.4322 ± 0.22%

0.0047 ± 3.17%

0.2678 ± 0.10%

0.9883 ± 0.42%

0.0302 ± 2.49%

400

0.1418 ± 0.10%

1.4201 ± 0.17%

0.0047 ± 1.43%

0.2648 ± 0.12%

1.0238 ± 0.50%

0.0248 ± 3.89%

500

0.1399 ± 0.26%

1.4517 ± 0.35%

0.0043 ± 3.94%

0.2638 ± 0.17%

1.0248 ± 0.65%

0.0255 ± 5.65%

600

0.1405 ± 0.16%

1.4392 ± 0.20%

0.0047 ± 2.87%

0.2650 ± 0.14%

0.9917 ± 0.74%

0.0245 ± 3.85%

P50052

50

0.3024 ± 0.85%

7.4294 ± 1.70%

0.0657 ± 6.19%

363

100

0.1795 ± 0.16%

3.1869 ± 0.26%

0.0132 ± 1.42%

0.2818 ± 0.25%

3.6993 ± 0.55%

0.0458 ± 3.44%

200

0.1660 ± 0.18%

1.8701 ± 0.30%

0.0096 ± 1.98%

0.2698 ± 0.21%

1.8027 ± 0.58%

0.0341 ± 4.60%

300

0.1550 ± 0.22%

1.3995 ± 0.36%

0.0066 ± 2.97%

0.2643 ± 0.14%

1.2232 ± 0.42%

0.0273 ± 3.55%

330

0.1512 ± 0.12%

1.4130 ± 0.23%

0.0057 ± 1.30%

0.2654 ± 0.18%

1.0822 ± 0.68%

0.0274 ± 5.32%

363

0.1509 ± 0.18%

1.3881 ± 0.27%

0.0057 ± 3.53%

0.2687 ± 0.24%

0.9676 ± 1.00%

0.0332 ± 7.75%

380

0.1489 ± 0.12%

1.4138 ± 0.19%

0.0051 ± 1.17%

0.2651 ± 0.30%

0.9806 ± 1.28%

0.0270 ± 11.76%

400

0.1474 ± 0.20%

1.4335 ± 0.32%

0.0048 ± 3.27%

0.2634 ± 0.15%

0.9773 ± 0.75%

0.0271 ± 11.41%

500

0.1471 ± 0.08%

1.4350 ± 0.16%

0.0049 ± 1.13%

0.2613 ± 0.21%

0.9998 ± 1.05%

0.0226 ± 7.60%

600

0.1457 ± 0.28%

1.4640 ± 0.54%

0.0046 ± 3.24%

0.2662 ± 0.15%

0.9498 ± 0.79%

0.0250 ± 7.76%

Q18179

50

0.3008 ± 0.70%

7.6673 ± 1.23%

0.0625 ± 5.34%

455

100

0.1798 ± 0.33%

3.7190 ± 0.59%

0.0103 ± 2.84%

0.2845 ± 0.16%

3.5814 ± 0.35%

0.0485 ± 2.86%

200

0.1723 ± 0.16%

1.9839 ± 0.32%

0.0087 ± 1.50%

0.2685 ± 0.14%

1.8391 ± 0.49%

0.0302 ± 3.81%

300

0.1609 ± 0.25%

1.4302 ± 0.40%

0.0059 ± 4.49%

0.2632 ± 0.16%

1.2382 ± 0.53%

0.0262 ± 4.69%

420

0.1569 ± 0.27%

1.3665 ± 0.52%

0.0050 ± 2.90%

0.2636 ± 0.17%

0.8441 ± 0.59%

0.0222 ± 9.17%

450

0.1590 ± 0.25%

1.3225 ± 0.61%

0.0052 ± 2.86%

0.2611 ± 0.13%

0.8203 ± 0.43%

0.0209 ± 4.93%

455

0.1548 ± 0.26%

1.4038 ± 0.52%

0.0049 ± 2.76%

0.2655 ± 0.12%

0.7670 ± 0.49%

0.0246 ± 8.35%

480

0.1557 ± 0.38%

1.3664 ± 0.67%

0.0051 ± 7.10%

0.2610 ± 0.10%

0.7929 ± 0.41%

0.0197 ± 6.70%

500

0.1521 ± 0.45%

1.4145 ± 0.77%

0.0044 ± 5.30%

0.2615 ± 0.17%

0.7783 ± 0.62%

0.0204 5.09%

600

0.1540 ± 0.25%

1.3886 ± 0.43%

0.0043 ± 3.72%

0.2596 ± 0.14%

0.7706 ± 0.60%

0.0174 ± 5.71%

P35348

50

0.3046 ± 0.61%

7.3443 ± 1.17%

0.0668 ± 4.85%

466

100

0.1809 ± 0.18%

3.1996 ± 0.28%

0.0135 ± 2.06%

0.2839 ± 0.22%

3.6314 ± 0.49%

0.0465 ± 2.49%

200

0.1625 ± 0.12%

1.8687 ± 0.18%

0.0079 ± 1.63%

0.2696 ± 0.15%

1.8030 ± 0.48%

0.0315 ± 3.97%

300

0.1643 ± 0.10%

1.2089 ± 0.15%

0.0086 ± 2.23%

0.2620 ± 0.13%

1.2472 ± 0.47%

0.0241 ± 5.52%

400

0.1510 ± 0.24%

1.2641 ± 0.39%

0.0051 ± 2.76%

450

0.1521 ± 0.33%

1.2357 ± 0.55%

0.0050 ± 5.39%

0.2647 ± 0.16%

0.7874 ± 0.67%

0.0246 ± 3.93%

466

0.1485 ± 0.17%

1.2982 ± 0.35%

0.0046 ± 2.93%

480

0.1517 ± 0.23%

1.2359 ± 0.34%

0.0056 ± 5.27%

0.2609 ± 0.25%

0.7981 ± 1.25%

0.0207 ± 9.36%

500

0.1492 ± 0.22%

1.2845 ± 0.35%

0.0048 ± 3.64%

0.2668 ± 0.09%

0.7124 ± 0.49%

0.0265 ± 6.00%

600

0.1509 ± 0.28%

1.2383 ± 0.40%

0.0050 ± 3.86%

Fit parameters λ, λ_{2 }and

Fit parameters for the HMM

**HMM n = 0**

**HMM n = 1**

**
L
**

**
L
**

**λ**

**10 ^{4 }λ_{2}**

**10 ^{3} K**

**λ**

**10 ^{4 }λ_{2}**

**10 ^{3} K**

348

150

0.2890 ± 0.85%

49.4722 ± 7.27%

0.2310 ± 9.32%

21.4600 ± 66.56%

200

0.2894 ± 2.84%

50.0796 ± 24.47%

0.2274 ± 1.74%

20.1017 ± 13.25%

300

0.2895 ± 2.69%

53.3472 ± 24.00%

0.2240 ± 4.86%

17.8934 ± 37.22%

348

0.2988 ± 3.24%

72.2356 ± 30.15%

0.2234 ± 2.39%

16.8704 ± 18.79%

360

0.2895 ± 1.79%

51.9056 ± 16.04%

0.2220 ± 2.14%

16.3757 ±16.52%

400

0.2859 ± 3.49%

48.4496 ± 31.10%

0.2232 ± 2.40%

17.5141 ± 18.94%

500

0.2912 ± 6.63%

54.0687 ± 61.22%

0.2182 ± 2.39%

14.7371 ± 19.10%

600

0.2901 ± 3.38%

51.9412 ± 31.74%

0.2180 ± 2.59%

14.2439 ± 20.86%

HMM n = 2

HMM n = 3

_{Q}

_{S}

λ

10^{4 }λ_{2}

λ

10^{4 }λ_{2}

348

150

0.1968 ± 0.70%

2.9247 ± 1.37%

12.0400 ± 6.48%

0.1767 ± 0.44%

2.6797 ± 1.01%

7.4435 ± 3.72%

200

0.1947 ± 2.12%

9.8704 ± 14.29%

0.1795 ± 0.46%

2.3586 ± 0.92%

8.5733 ± 3.87%

300

0.1937 ± 3.60%

9.9597 ± 25.32%

0.1863 ± 0.41%

2.0008 ± 0.94%

11.7859 ± 5.63%

348

0.1888 ± 3.19%

8.1338 ± 22.42%

0.1876 ± 0.32%

1.9328 ± 0.89%

12.1223 ± 3.83%

360

0.1926 ± 3.17%

9.7957 ± 22.82%

0.1853 ± 0.27%

1.9530 ± 0.65%

10.8640 ± 2.65%

400

0.1934 ± 1.05%

9.9321 ± 8.22%

0.1757 ± 1.64%

7.1756 ± 11.58%

500

0.1919 ± 1.61%

9.3630 ± 12.32%

0.1783 ± 0.98%

7.7945 ± 7.18%

600

0.1912 ± 1.70%

9.3303 ± 13.25%

0.1768 ± 1.01%

7.4165 ± 8.19%

HMM n = 4

HMM n = 5

_{Q}

_{S}

λ

10^{4 }λ_{2}

10^{3}

λ

10^{4 }λ_{2}

10^{3}

348

150

0.1732 ± 0.47%

2.2119 ± 1.14%

7.4991 ± 6.08%

0.1710 ± 0.38%

2.0698 ± 0.92%

8.1950 ± 3.70%

200

0.1686 ± 0.28%

2.1187 ± 0.72%

6.4162 ± 3.14%

0.1657 ± 0.39%

1.8231 ± 1.14%

6.9148 ± 3.82%

300

0.1682 ± 0.36%

1.9635 ± 0.79%

6.5436 ± 4.22%

0.1599 ± 0.37%

1.7836 ± 0.79%

5.4451 ± 3.85%

348

0.1685 ± 0.35%

1.9408 ± 0.74%

7.3851 ± 3.34%

0.1580 ± 0.28%

1.7930 ± 0.68%

5.3049 ± 2.61%

360

0.1678 ± 0.42%

1.9421 ± 0.92%

6.5775 ± 4.07%

0.1605 ± 0.23%

1.7481 ± 0.50%

5.7512 ± 2.89%

400

0.1662 ± 0.18%

1.9782 ± 0.40%

6.4164 ± 2.32%

0.1587 ± 0.28%

1.7828 ± 0.73%

5.4513 ± 2.57%

500

0.1693 ± 0.24%

1.9047 ± 0.51%

7.0735 ± 2.11%

0.1587 ± 0.16%

1.7957 ± 0.40%

5.4770 ± 2.31%

600

0.1693 ± 0.17%

1.8994 ± 0.39%

7.1112 ± 2.06%

0.1575 ± 0.29%

1.8330 ± 0.58%

5.2125 ± 2.68%

HMM n = 6

HMM n = 7

_{Q}

_{S}

λ

10^{4 }λ_{2}

10^{3}

10^{4 }λ_{2}

10^{3}

348

150

0.1663 ± 0.49%

2.1403 ± 1.04%

7.9392 ± 5.83%

0.1646 ± 0.30%

2.1396 ± 0.65%

8.7088 ± 4.21%

200

0.1614 ± 0.25%

1.7767 ± 0.65%

6.7568 ± 2.30%

0.1574 ± 0.41%

1.7687 ± 1.17%

6.5219 ± 3.81%

300

0.1551 ± 0.28%

1.5986 ± 0.80%

5.2551 ± 3.18%

0.1514 ± 0.26%

1.4638 ± 0.62%

5.0238 ± 4.34%

348

0.1531 ± 0.20%

1.5993 ± 0.55%

4.9132 ± 2.71%

0.1482 ± 0.33%

1.4755 ± 0.77%

4.4535 ± 4.13%

360

0.1536 ± 0.34%

1.6036 ± 1.02%

4.9160 ± 3.41%

0.1490 ± 0.39%

1.4479 ± 0.93%

4.6858 ± 3.28%

400

0.1537 ± 0.27%

1.5713 ± 0.62%

4.9524 ± 3.05%

0.1494 ± 0.24%

1.4328 ± 0.70%

4.6867 ± 2.08%

500

0.1519 ± 0.23%

1.6229 ± 0.67%

4.6812 ± 2.14%

0.1472 ± 0.29%

1.4706 ± 0.63%

4.2881 ± 2.50%

600

0.1489 ± 0.15%

1.7148 ± 0.33%

4.2283 ± 2.16%

0.1460 ± 0.18%

1.5193 ± 0.49%

4.2679 ± 1.74%

HMM n = 8

HMM n = 9

_{Q}

_{S}

λ

10^{4 }λ_{2}

10^{3}

λ

10^{4 }λ_{2}

10^{3}

348

150

0.1595 ± 0.47%

2.2162 ± 1.01%

7.5355 ± 4.01%

0.1603 ± 0.23%

2.1517 ± 0.48%

8.0273 ± 2.17%

200

0.1534 ± 0.55%

1.8019 ± 1.46%

5.9224 ± 5.25%

0.1508 ± 0.14%

1.7854 ± 0.28%

6.3535 ± 1.89%

300

0.1473 ± 0.47%

1.3916 ± 1.24%

4.8483 ± 4.01%

0.1413 ± 0.12%

1.4118 ± 0.35%

4.2141 ± 1.43%

348

0.1458 ± 0.32%

1.3409 ± 0.85%

4.6141 ± 3.69%

0.1398 ± 0.10%

1.3281 ± 0.33%

3.9661 ± 1.44%

360

0.1469 ± 0.34%

1.2868 ± 0.90%

4.9271 ± 2.73%

0.1400 ± 0.16%

1.2888 ± 0.43%

4.0126 ± 1.79%

400

0.1440 ± 0.34%

1.3591 ± 1.05%

4.0064 ± 3.48%

0.1382 ± 0.25%

1.2954 ± 0.67%

3.7257 ± 2.14%

500

0.1433 ± 0.29%

1.3382 ± 0.85%

3.9952 ± 2.70%

0.1352 ± 0.14%

1.3472 ± 0.42%

3.1780 ± 1.68%

600

0.1416 ± 0.33%

1.3760 ± 0.94%

3.7782 ± 3.14%

0.1359 ± 0.13%

1.3399 ± 0.38%

3.3536 ± 1.49%

HMM n = 10

HMM n = 11

_{Q}

_{S}

λ

10^{4 }λ_{2}

10^{3}

λ

10^{4 }λ_{2}

10^{3}

348

150

0.1552 ± 0.14%

2.2225 ± 0.30%

6.7936 ± 2.08%

0.1455 ± 0.14%

2.3813 ± 0.15%

4.9660 ± 3.82%

200

0.1459 ± 0.22%

1.8336 ± 0.37%

5.7585 ± 3.30%

0.1417 ± 0.17%

1.8428 ± 0.35%

5.1264 ± 2.07%

300

0.1370 ± 0.22%

1.4024 ± 0.56%

3.8087 ± 1.79%

0.1324 ± 0.27%

1.3842 ± 0.68%

3.2129 ± 2.79%

348

0.1353 ± 0.15%

1.2962 ± 0.38%

3.5507 ± 1.68%

0.1316 ± 0.22%

1.2518 ± 0.69%

3.1546 ± 1.94%

360

0.1343 ± 0.13%

1.2830 ± 0.36%

3.4674 ± 1.39%

0.1297 ± 0.25%

1.2737 ± 0.52%

2.9445 ± 2.81%

400

0.1334 ± 0.16%

1.2602 ± 0.38%

3.2164 ± 1.71%

0.1302 ± 0.20%

1.2160 ± 0.56%

2.9704 ± 1.59%

500

0.1307 ± 0.16%

1.3013 ± 0.46%

2.8331 ± 1.22%

0.1280 ± 0.30%

1.2426 ± 0.86%

2.7433 ± 2.73%

600

0.1305 ± 0.23%

1.3097 ± 0.56%

2.8239 ± 1.82%

0.1257 ± 0.22%

1.2908 ± 0.55%

2.4921 ± 1.79%

The table shows the fit parameters of the score distribution Prob(_{Q }_{2 }is left out, a suitable fit (with a small reduced χ^{2 }value) to the modified Gumbel distribution Eq. (15) was not possible and only the Gumbel parameters of the high probability region are shown.

Acknowledgements

SW was supported by the German