Department of Mathematics and Computer Science Freie Universität Berlin, Germany

Present address: Prundsbergstr. 23a, D-82064 Strasslach, Germany

Department of Bioinformatics, Biozentrum, Universität Würzburg, Am Hubland, D-97074 Würzburg, Germany

Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73, D-14195 Berlin, Germany

Abstract

Background

Profile Hidden Markov Models (pHMMs) are a widely used tool for protein family research. Up to now, however, there exists no method to visualize all of their central aspects graphically in an intuitively understandable way.

Results

We present a visualization method that incorporates both emission and transition probabilities of the pHMM, thus extending sequence logos introduced by Schneider and Stephens. For each emitting state of the pHMM, we display a stack of letters. The stack height is determined by the deviation of the position's letter emission frequencies from the background frequencies. The stack width visualizes both the probability of reaching the state (the hitting probability) and the expected number of letters the state emits during a pass through the model (the state's expected contribution).

A web interface offering online creation of HMM Logos and the corresponding source code can be found at the Logos web server of the Max Planck Institute for Molecular Genetics

Conclusions

We demonstrate that HMM Logos can be a useful tool for the biologist: We use them to highlight differences between two homologous subfamilies of GTPases, Rab and Ras, and we show that they are able to indicate structural elements of Ras.

Background

Introduction

Many existing gene or protein sequences in different organisms are related through evolution and can be grouped into families. One way of representing such a family is through a

For

If we ignore the position-specific insertion and deletion probabilities of a pHMM, we can treat is as a PSSM and visualize it with a sequence logo (the makelogo tool of the SAM software package

Profiles and sequence logos

Let Σ be an alphabet and |Σ| its cardinality. For DNA, |Σ| = 4, and the letters of the alphabet are the four nucleotides A, C, G, and T. For proteins, |Σ| = 20, and the letters are the twenty amino acids.

A _{ij}) (_{ij }≥ 0 for all _{j∈Σ }_{ij }= 1 for all

A multiple sequence alignment of _{ij }be the number of occurrences of letter _{i }≔ Σ_{j ∈ Σ }_{ij }≤ _{ij }≔ _{ij}/_{i}. When the multiple alignment contains only few sequences, ML estimation results in many "impossibilities" (zero probabilities) in the profile and hence in over-fitting the model to the small sample. To counteract this problem, the profile is regularized, either by using Dirichlet mixture priors

The _{i }at the _{i}) = -Σ_{j ∈ Σ }_{ij }log_{2 }_{ij}. The entropy _{i}) is always nonnegative. It vanishes if and only if _{i }is a Dirac distribution, i.e., if the whole mass is accumulated at a single letter. The entropy takes its maximal value of log_{2 }|Σ| bits (2 bits for DNA, approximately 4.32 bits for proteins) when _{i }is the uniform distribution, i.e., when _{ij }= 1/|Σ| for all _{2}, the unit of the entropy is called a "bit". When we use the natural logarithm, it is called a "nat", and for log_{10}, it is called a "dit".

We may define the _{i}) of position

The information content is a number between 0 and log_{2 }|Σ| bits and measures the conservation of a position in a profile.

Since conserved positions in sequence families are considered to be functionally or structurally important, they should stand out when the profile is visualized. Schneider and Stephens _{i}).

While this method works well on DNA alignments, additional considerations must be made for protein sequences. Amino acids naturally occur with different "background" frequencies. For example, tryptophan (W) occurs much less frequently than leucine (L). The background frequencies might be computed by counting amino acid occurrences in all known proteins, or only in the proteins of the superfamily under consideration. Assume that the background frequency of amino acid _{j }> 0. Then the important positions are those whose distribution differs from π. Therefore it has become common practice to consider the _{i }and π,

where 0·log_{2}(0/π_{j}) ≔ 0 by continuity as long as π_{j }> 0.

Note that for the uniform distribution π_{j }= 1/|Σ|, we have _{i }|| π) = _{i}). Thus the information content of _{i}, as defined above, is its relative entropy distance from the uniform distribution.

In a classical Sequence Logo, the stack height at position _{i }|| π), the height of letter _{ij }_{i }|| π), the letters are stacked in sorted order, the largest letter being on top of the stack, and colors may be used to highlight different properties of different letters. The HMM Logo inherits all of these characteristics, but also has additional ones to represent the additional information contained in a pHMM.

Profile HMMs

An HMM is a discrete time Markov chain that emits a letter from the alphabet Σ whenever a state is visited. The central idea is that only the emitted letters can be observed, but that the state sequence is hidden. A comprehensive review on the topic of HMMs can be found in the literature

Figure _{i }models the distribution _{i }of emitted letters at that position; it corresponds exactly to the profile distribution _{i}. An "insert" state _{i }allows for insertion of one or more letters between positions _{i }is non-emitting and allows to pass the corresponding match state _{i}, resulting in a deletion at the i-th alignment position. The part consisting of the _{i}, _{i}, and _{i }states, flanked by the

A model of a profile HMM of length 3

A model of a profile HMM of length 3. Transitions marked by solid arrows constitute the Plan7 model used by HMMER

A path through the main model starts in the silent (non-emitting) begin state _{s,t }for the transition probability _{s,t }≥ 0, Σ_{t }_{s,t }= 1 for all _{s,t }= 0 whenever no arrow _{i }→ _{i}, _{i }→ M_{i+1}, _{i }→ _{i+1}; _{i }→ _{i+1}, _{i }→ _{i+1}; _{i }→ _{i+1}, and the self-loop _{i }→ _{i}.

There are two major pHMM software packages, HMMER _{i }→ _{i }and _{i }→ _{i+1 }are possible.

Results

HMM Logo concepts

The relevant information contained in a pHMM of length

• letter background frequencies π = (π_{j}),

• emission probabilities _{ij}) for match states (_{i}),

• emission probabilities _{i}),

• state transition probabilities _{s,t}).

Sequence Logos can already take care of visualizing the emission probabilities in comparison to the background frequencies. We shall use the remaining dimension of a stack, its

Each path _{i }or _{i}, but not both. When a path hits an insert state _{i}, several letters may be emitted before it moves on to _{i+1}. This leads to the following definitions.

**Definition 1 (Hitting probability). **Let

**Definition 2 (Contribution). **Let

Computation of hitting probabilities

The hitting probability of a state equals the sum of probabilities of all paths

**Proposition 1. **Define _{i-1 }exits into _{i}. Then 1 - μ_{i }is the probability of exiting into _{i}. For the Plan7 model disallowing the _{i-1 }→ _{i }transition we have μ_{i }= 1. For the general SAM-type pHMM model allowing all 9 transitions, the hitting probabilities are

• at the first position given by

• at the following positions

_{1}) and _{1}) are obvious from Figure _{i}) = 1 - _{i}) because each path passes either through _{i-1 }or _{i-1}.

For _{i}), _{i}. The first term accounts for paths that come directly from _{i-1}, the second term similarly accounts for direct entries from _{i-1}, and the last term accounts for paths that enter via _{i-1}. A similar argument applies to the insert state hitting probabilities, for which there are only two ways of entry. All probabilities can be expressed solely in terms of _{i-1}) as shown. □

Computation of expected contributions

The expected contribution of each state is easily derived from its hitting probability. Since delete states are non-emitting, their contribution is zero.

**Proposition 2 (Expected contribution). **We have

• _{i}) = _{i}),

•

_{i }is hit, it contributes _{i}) = 1·_{i}) + 0·(1 - _{i})) = _{i}).

If an insert state _{i }is hit, its contribution has a geometric distribution with "success parameter" (probability of leaving the state)

**Proposition 3 (Expected number of emitted letters). **The expected number of emitted letters during a walk from

We find it logical to set the width of the stack of an emitting state

HMM Logo layout

The final definition of an HMM logo is as follows; see Figure

Partial logo (positions 172–209) of the Pfam pkinase model

Partial logo (positions 172–209) of the Pfam pkinase model. Positions with narrow match state stacks are likely to be deleted in typical family members. The total width of a red-shaded (dark+light) stack visualizes the expected number of inserted letters. The left dark-shaded part of the stack's width represents the probability that at least one letter is inserted. The difference is illustrated by comparing I_{173 }with I_{176}: Both states have approximately the same expected contribution, but the hitting probability of I_{176 }is higher. The insertion stack height is zero for all shown examples because the emission probabilities correspond to the background frequencies.

• HMM Logos consist of alternating stacks for match and insert states for all positions 1,...,_{1}, _{1}, _{2}, _{2},...,_{L-1}, _{L}.

• The total height of a stack is the relative entropy

• The relative height of letter _{j}.

• The letters are stacked in sorted order, the largest letter being on top of the stack.

• The total width of a stack

• The background of an insert state's stack is shaded in two different colors for a total width of

• The upper left corner of the logo shows a horizontal bar representing a contribution of 1 letter.

• Insert state stacks are always displayed with a width of at least one pixel, thus making consecutive positions easier to distinguish.

• Letters are drawn in different colors. The color scheme depends on the alphabet; amino acids are colored to represent structural or functional similarity.

• The position number is displayed on the _{i }{_{i}|| π),

Visualization of subfamily-specific sites

Since profile HMMs are predominantly used for protein family and domain modeling, we present examples that illustrate the utility of HMM Logos in this area.

While building an HMM for a domain, one usually tries to cover all homologous sequences. But, with ongoing experimental characterization, it often becomes clear that a single domain family consists of multiple, functionally divergent subfamilies.

Identifying these subfamilies and characterizing their determinants is an important step in protein function prediction. Creation of subfamily-specific profile HMMs is a first step in this direction performed by domain databases like SMART

Combining sequence and structure analysis, Pereira-Leal and Seabra identified five regions which distinguish the Rab proteins from Ras like members

Comparison of the HMM Logos of the small GTPases Ras and Rab from SMART

Comparison of the HMM Logos of the small GTPases Ras and Rab from SMART

Highlighting of loop regions

An important feature distinguishing HMM Logos from standard Sequence Logos is their ability to visualize regions with long expected inserts. These insertions usually do not happen within conserved structural elements, that is alpha helices or beta sheets, as this would influence and possibly break the structure of the whole domain. Instead, insertions are more likely to occur within loop regions.

Therefore the presence of frequent insertions at a given site can indicate that the site itself and its neighbors lie within a loop region. Figure

Mapping of structural elements to a region of the Ras family HMM Logo

Mapping of structural elements to a region of the Ras family HMM Logo. The mapping was obtained by aligning the sequence of p21 ras, the structure of which has been solved, to the Ras family pHMM. Below the logo, insert regions are highlighted by vertical arrows, and the secondary structure of p21 ras is indicated (alpha helices: barrels; beta sheets: horizontal arrows).

Discussion

The examples in the previous section illustrate the potential utility of HMM Logos, but they also point out a particularity of the HMMER software: In all Pfam and SMART pHMMs we looked at, the stack height in all insertion states is zero. This seems to be a consequence of HMMER's hmmbuild program: Insert states receive a very high emission prior that is equal to the background. This makes sense to allow the insertion of variable sequence parts of varying lengths at a position, i.e., in an insert state with high expected contribution. In order to change the emission probabilities away from the background, one would have to observe a consistent insertion that is common to several family members at the same position. Then however, hmmbuild would model this conserved "insertion" as a match state and model the sequences skipping this position via the delete state, even if this is the majority of the family members. This is immediately obvious from the numerous narrow match states shown in Figure

While we hope that HMM Logos can help to compare families visually, the RAS-RAB example (Figure

Conclusion

We have developed a method to visualize profile HMM specific information and demonstrated its utility for the biologist who wants to

A PERL package for parsing and visualizing HMMER pHMMs is available under the GNU General Public License from the authors and can be downloaded from the Logos server of the Max Planck Institute for Molecular Genetics

This will display a logo of the Pfam entry "ATPase family associated with various cellular activities" (AAA), using the default settings. Finally, the logos can be directly accessed from the Pfam website by pressing the "View HMM Logo" button on each domain's or family's overview page.

Authors' contributions

SR had the initial idea to use the stack width to visualize the insertion and deletion probabilities. BSB implemented the software and the web server and invented the two-colored scheme for visualizing both hitting probability and expected contribution of an insert state. This work is part of his Bachelor's degree at the Free University of Berlin. JS examined the small GTPases with HMM Logos. All authors read and approved the final manuscript.

Acknowledgements

We would like to express our gratitude to the PERL community; in particular to the creators of the PERL Data Language (PDL) and to the authors of the modules HMMERViewer (Robin Dowell), Imager (Arnar M. Hrafnkelsson and Tony Cook), and TFBS (Boris Lenhard). We thank Andrea Weiβe, Niels Köhler and Victoria Ornelas for valuable comments and discussions, David Studholme from the Sanger Center for information about direct Pfam access, Wilhelm Rüsing for help with the web server, and Martin Vingron for supervising the thesis of BSB. One of the anonymous referees provided valuable comments about applying the method to SAM-style HMMs.