CNRS UMR 6214 – INSERM 1083, Faculté de Médecine, 3 rue Haute de Reculée, Angers, 49045, France

The University of Texas at Dallas, School of Behavioral and Brain Sciences, 800 West Campbell Road, Richardson, TX, 75080-3021, USA

Abstract

Background

The distance matrix computed from multiple alignments of homologous sequences is widely used by distance-based phylogenetic methods to provide information on the evolution of protein families. This matrix can also be visualized in a low dimensional space by metric multidimensional scaling (MDS). Applied to protein families, MDS provides information complementary to the information derived from tree-based methods. Moreover, MDS gives a unique opportunity to compare orthologous sequence sets because it can add supplementary elements to a reference space.

Results

The R package

Conclusions

The

Background

The multiple alignment of homologous sequences provides important information on the evolution and the sequence-function relationships of protein families. Two types of methods, tree-based or space-based methods, can be used to compare sequences (reviewed in

The completion of the genome sequencing of a wide variety of organisms has paved the way to the comparison of protein families from different species. A very interesting property of MDS is the possibility to project supplementary elements onto a reference or “active” space. The positions of the supplementary elements (a.k.a. “out of sample” elements) are obtained from their distance to the active elements

MDS is based on the eigen-decomposition (i.e., principal component analysis) of a cross-product matrix derived from the distance matrix

Thus, we have developed the R package

Implementation

Main features

The

While it is possible to use available packages such as

Functionalities

Here, we present the main functionalities of

Data import

Multidimensional scaling relies on distance matrices. The user can provide these distance matrices or compute them from multiple sequence alignments in the FASTA or MSF formats. Sequence alignments are read in with the

MDS computation

Briefly, given a matrix of (squared) distances between elements, MDS transforms this matrix of squared distances into a cross-product matrix whose eigen-decomposition provides the factor score matrix giving the coordinates of the elements on the principal components

Clustering

The MDS representation of the sequence space can be analyzed by

Visualization

The package

Results and discussion

In this section, we show and discuss the results obtained by typical MDS analyses. The input consists of non-redundant sets of non-olfactory class A G-protein-coupled receptors (GPCRs) from different species

Analysis of paralogous sequences

The human set includes 283 aligned sequences of GPCRs

3D representation of the GPCR sequence space.

**3D representation of the GPCR sequence space.** A typical multiple sequence alignment of 283 GPCRs from

Different distance matrices can be computed from a multiple sequence alignment. These matrices are based either on a difference score or on a dissimilarity score obtained with an amino acid substitution matrix. In MDS, the distance matrix should be Euclidean or close to a Euclidean matrix. Distances equal to the square roots of the difference scores are Euclidean

Scoring method

% negative components

The percent of negative components represents the weight of negative components in the variance of the data.

Difference

Square root

0

Difference

3.2

Dissimilarity

BLOSUM30

4.1

BLOSUM45

3.5

BLOSUM62

3.6

BLOSUM80

3.5

PAM40

4.3

PAM80

4.8

PAM120

5.1

PAM160

5.6

PAM250

6.3

GONNET

3.5

JTT

6.5

JTT_TM

6.7

PHAT

4.1

The sequence spaces of human GPCRs obtained with the different distance matrices do not reveal dramatic differences and the overall patterns are maintained (Figure

Comparison of scoring methods.

**Comparison of scoring methods.** The 2D sequence space of human GPCRs, defined by the first two components of the MDS analysis, was obtained with distances equal to the square roots of the difference scores (a), to the difference scores (b), to the dissimilarity scores calculated from the BLOSUM45 matrix (c) or from the JTT_TM matrix (d). The color code refers to the different sub-families of human GPCRs, with unclassified receptors colored in black. Plots drawn with the

The “noise” of the data can be estimated from the MDS analysis of a random sequence alignment (Figure

Sequence space of random sequences.

**Sequence space of random sequences.** The 2D sequence space of random sequences corresponds to the first two components of the MDS analysis of a random sequence alignment with the same properties as the human GPCR set, obtained with the

Comparison of orthologous sequence sets

In the example shown in Figure

Projection of supplementary elements.

**Projection of supplementary elements.** Multiple sequence alignments of GPCRs from

Figures

The usefulness of the projection technique is illustrated by the example of the somatostatin/opioid receptor sub-family (SO). The input consists of two sets of aligned sequences: the human set that includes 14 SO receptors, and a set of receptors from

Evolutionary drift.

**Evolutionary drift.** The active sequence space (open circles) was obtained by the MDS analysis of the human GPCR set. Human receptors from the SO and PEP sub-families are indicated by red and green circles, respectively. The supplementary sequences correspond to SO receptors from

Conclusions

The R package

Availability and requirements

Project name: bios2mds

Project home page:

Operating systems: Platform independent

Programming language: R 2.12

Other requirements: requires the

License: GNU General Public License

Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JP and JMB contributed equally to this work. JP and MC conceived the package. HA provided the projection method and support to implement it in R. JP and JMB wrote the software, which JP, MC and JMB tested and debugged. MC wrote the first draft of the manuscript, which HA revised and all authors approved.

Acknowledgements

We thank NEC Computers Services SARL (Angers, France) for the kind provision of a multiprocessor server. We thank the Conseil Général de Maine-et-Loire for JP’s fellowship and the Centre Hospitalier Universitaire of Angers and the CNRS for JMB’s studenship. We thank Dr P. Guardiola (Angers, France) for stimulating discussion and advice.