Proteins with internal repeat structures present particular challenges to methods of classification. Major repeat patterns are straightforward to identify and tend to dominate the annotation of sequences conforming to them. However, it may be difficult to find sub-levels into such patterns that can be correlated to specific functions. Leucine-rich repeat (LRR) proteins provide a typical example. Their canonical repeat pattern is well established but it still remains difficult to establish specific markers for subcategories. Different protein databases (SMART, InterPro, PRINTS, Pfam...) usually define the canonical leucine-rich repeat but in addition they describe different subtypes of repeats to account for specific characteristics: bacterial type, cysteine-rich type, ribonuclease inhibitor type, etc. [1,2]. Many LRR proteins contain characteristic Cys-rich capping motifs conserved across species and lineages, with the most common N-terminal and C-terminal LRR-capping motifs having been described in different databases. Recently we determined the crystal structure of decorin , which is the archetypal representative of the extracellular LRR subfamily of small leucine-rich repeat proteins and proteoglycans (SLRP). The decorin structure shows a unique C-terminal capping motif that does not conform to the most commonly observed type . We have been able to define a consensus pattern that correctly and uniquely identify all known sequences containing such capping motif, which we propose is the defining characteristic of the entire SLRP subfamily. The collection of sequences allows us to trace the evolutionary path of SLRPs across the vertebrate lineage (Figure 1). This pattern will be useful in automatic sequence-annotation of LRR proteins belonging to the SLRP subfamily.
Figure 1. Unrooted tree of LRR protein containing the SLRP Cys-capping motif.