RSpred, a set of Hidden Markov Models to detect and classify the RIFIN and STEVOR proteins of Plasmodium falciparum

Joannin, Nicolas; Kallberg, Yvonne; Wahlgren, Mats; Persson, Bengt

doi:10.1186/1471-2164-12-119

Methodology article
Open access
Published: 18 February 2011

RSpred, a set of Hidden Markov Models to detect and classify the RIFIN and STEVOR proteins of Plasmodium falciparum

Nicolas Joannin¹,
Yvonne Kallberg^2,3,
Mats Wahlgren¹ &
…
Bengt Persson^2,3

BMC Genomics volume 12, Article number: 119 (2011) Cite this article

4506 Accesses
10 Citations
3 Altmetric
Metrics details

Abstract

Background

Many parasites use multicopy protein families to avoid their host's immune system through a strategy called antigenic variation. RIFIN and STEVOR proteins are variable surface antigens uniquely found in the malaria parasites Plasmodium falciparum and P. reichenowi. Although these two protein families are different, they have more similarity to each other than to any other proteins described to date. As a result, they have been grouped together in one Pfam domain. However, a recent study has described the sub-division of the RIFIN protein family into several functionally distinct groups. These sub-groups require phylogenetic analysis to sort out, which is not practical for large-scale projects, such as the sequencing of patient isolates and meta-genomic analysis.

Results

We have manually curated the rif and stevor gene repertoires of two Plasmodium falciparum genomes, isolates DD2 and HB3. We have identified 25% of mis-annotated and ~30 missing rif and stevor genes. Using these data sets, as well as sequences from the well curated reference genome (isolate 3D7) and field isolate data from Uniprot, we have developed a tool named RSpred. The tool, based on a set of hidden Markov models and an evaluation program, automatically identifies STEVOR and RIFIN sequences as well as the sub-groups: A-RIFIN, B-RIFIN, B1-RIFIN and B2-RIFIN. In addition to these groups, we distinguish a small subset of STEVOR proteins that we named STEVOR-like, as they either differ remarkably from typical STEVOR proteins or are too fragmented to reach a high enough score. When compared to Pfam and TIGRFAMs, RSpred proves to be a more robust and more sensitive method. We have applied RSpred to the proteomes of several P. falciparum strains, P. reichenowi, P. vivax, P. knowlesi and the rodent malaria species. All groups were found in the P. falciparum strains, and also in the P. reichenowi parasite, whereas none were predicted in the other species.

Conclusions

We have generated a tool for the sorting of RIFIN and STEVOR proteins, large antigenic variant protein groups, into homogeneous sub-families. Assigning functions to such protein families requires their subdivision into meaningful groups such as we have shown for the RIFIN protein family. RSpred removes the need for complicated and time consuming phylogenetic analysis methods. It will benefit both research groups sequencing whole genomes as well as others working with field isolates. RSpred is freely accessible via http://www.ifm.liu.se/bioinfo/.

Background

Many pathogens have evolved strategies to survive within the hosts they infect. One strategy consists of varying the antigens the pathogen exposes to its host immune system, usually resulting in the proliferation of multicopy protein families, commonly named Variable Surface Antigens (VSA) [1]. In the case of the malaria parasite Plasmodium falciparum, there are three major VSA that allow the parasite to avoid the host's immune system and establish chronic infections: the Plasmodium falciparum Erythrocyte Membrane Protein 1, RIFIN and STEVOR proteins (reviewed in [2, 3]).

The RIFIN and STEVOR families are groups of VSA proteins that are unique to the Plasmodium falciparum and P. reichenowi parasites [4–9]. They are only present in two species, but they number more than 200 copies per genome. Although the genome of Plasmodium falciparum has been fully sequenced [6], the information obtained for the reference strain does not represent the full knowledge of these antigenic variant protein families. Field isolates investigated for their repertoire of rif and stevor genes show an extensive variability [10, 11]. This hypervariability makes these proteins difficult to study and their primary function(s) remain to be discovered. A recent analysis of the whole rif gene repertoire, which encode for RIFIN proteins, from the reference genome has concluded that this family can be sub-divided into functionally distinct groups [12]. One of these sub-groups, A-RIFIN, as well as the STEVOR proteins are predominantly exposed to the host's immune system at the surface of the infected red blood cell (RBC) [4, 7, 8].

Sequestration of infected RBCs is a virulence factor that allows the parasite to avoid passage through the spleen, therefore increasing its chances of survival. A recent analysis of gene expression of VSA of a P. falciparum strain isolated from a splenectomized patient showed that A-rif and stevor genes were not expressed [13], whereas, in isolates from normal patients, these genes are expressed [4, 7, 10, 11]. The authors relate this loss of expression to the loss of the sequestration phenotype. Conversely, B-rif genes are expressed regardless of the absence of this virulent phenotype [13]. These differences in phenotype as well as in the localization of these proteins [4, 11, 14, 15] and the predicted sub-functionalization of RIFIN proteins [12] demonstrate the importance of distinguishing each of these sub-groups.

Figure 1 shows a schematic representation of A-RIFIN, B-RIFIN and STEVOR proteins, including the potential signal peptide (SP?), variable regions (V1 and V2),Plasmodium export element motif (PEXEL) [16, 17], conserved regions (C1 and C2) and finally the two predicted transmembrane regions, first a questionable one (TM?) and second a highly probable one (TM).

Currently, the RIFIN and STEVOR protein families are represented by the Pfam domain PF02009 [18]. However, this hidden Markov model (HMM) fails to distinguish RIFIN from STEVOR proteins. There are TIGRFAMS HMMs [19] that do separate RIFIN and STEVOR proteins, but they fail to classify the RIFIN or STEVOR proteins into sub-groups. Although STEVOR, A-RIFIN and the different B-RIFIN groups are identifiable by experts, they require cumbersome phylogenetic methods to be divided into their respective sub-groups [12]. In this study we report the development of a tool, consisting of a set of HMMs and an evaluation program, to automatically sort RIFIN and STEVOR proteins according to their sub-groups. We have named the tool RSpred for RIFIN and STEVOR predictor.

Results

Curation of the RIFIN and STEVOR repertoires of the Plasmodium falciparum DD2 and HB3 genomes

We have carried out manual curation of the RIFIN and STEVOR repertoires in the DD2 and HB3 draft genomes. We used BLAST to detect the DD2 and HB3 sequences, using the entire 3D7 rif and stevor gene repertoire as query and the DD2 and HB3 supercontigs as databases. This allowed us to detect all potential rif and stevor genes.

We compared these BLAST hits with the automatically generated annotations provided by the Broad Institute. Although most of our manually curated genes correspond to automatic annotations, we have revised the exon-intron boundaries for more than 25% of them (three examples shown in Figure 2A). In addition to these modifications, we have found some odd predictions: four of our manually curated genes had automatic predictions as two genes, interrupted by a frame shift or stop codon, and one had been predicted as a shorter hypothetical gene on the opposite strand (data not shown). Finally, we have detected 30 genes that had no automatic predictions at all (example shown in Figure 2B). The naming system of the DD2 and HB3 predicted genes uses the format PFDG_XXXXX and PFHG_XXXXX, where XXXXX is a number. Currently, there are 5380 and 5623 predicted genes for DD2 and HB3, respectively. We have decided to annotate the new genes using incremental numbering from 5381 for DD2 and 5624 for HB3, i.e. PFDG_05381 and PFHG_05624. Additionally, we have appended all the RIFIN and STEVOR genes, manually curated for this study from DD2 and HB3, with "-NJ" in order to distinguish them from the original and future annotations. All curated genes from the DD2 and HB3 draft genomes (193 and 178, respectively) are deposited in the antigenic variation database varDB [20].

Sub-grouping, a new take on the matter

We needed curated data sets of sequences belonging to each group in order to train the HMMs. STEVOR and RIFIN proteins share little similarity, which makes them easy to distinguish from one another after completion of multiple sequence alignment with known STEVOR and RIFIN sequences. Full-length A-RIFIN and B-RIFIN proteins are easily recognized, upon visual inspection of multiple sequence alignments, based on the presence (A-RIFIN) or absence (B-RIFIN) of a fairly conserved 25 amino acid residue indel in the conserved region (Figure 1). However, the sub-groups within the B-RIFIN cluster are not so easily sorted without the help of phylogenetic analysis.

Previous research, based on the RIFIN repertoire of the reference genome, describes three sub-groups in the B-RIFIN cluster: B1-, B2- and B3-RIFIN [12]. Our present analysis confirms the integrity of the B1- and B2-RIFIN sub-groups. However, we find that there is too little coherence (less than 50% average pairwise identity in the reference strain, and low confidence bootstrap scores in the phylogenetic trees) within the B3-RIFIN cluster to make it form a defined sub-group. We propose to redefine these sequences simply as B-RIFIN.

We also investigated the homogeneity of the STEVOR family. In phylogenetic trees, derived from multiple sequence alignments of STEVOR proteins of sequences obtained from the three P. falciparum genomes, 3D7, HB3 and DD2, the majority of STEVOR proteins forms a cluster. However, a small group of proteins, which we call STEVOR-like, cluster separately from the main STEVOR group (Figure 3). These sequences differ from typical STEVOR proteins by different amino acid compositions from the signal sequence through the majority of the conserved domain. Also, the variable domain's length is less consistent than in most STEVOR proteins. Regardless of these differences, STEVOR-like proteins share short amino acid motifs throughout the protein, as well as the entirety of the very typical C-terminus, with STEVOR proteins.

Sorting out the results and limits of detection

A program was created to evaluate the results obtained when the five HMMs were used in database searches. This program uses cut-offs to determine the proper call for each sequence (Figure 4). Since there are several cut-offs, our method includes several limits of detection (LOD).

The first LOD is the detection of sequences as True or False: whether they are RIFIN or STEVOR sequences or neither. Any score <20 is considered False, i.e. not a RIFIN or a STEVOR. Of all the curated sequences in our dataset, three have scores <20: PFDG_05381, PFDG_04771 and PFDG_04350. The first protein, PFDG_05381, is an extremely short protein derived from a gene at the end of the supercontig 1.45. The sequencing coverage and assembly of contig ends are often questionable, generating erroneous sequences; therefore it is not surprising that this protein is not detected with the STEVOR HMM. The second protein, PFDG_04771, is one of the three sequences of the rifA2 group described by Wang et al.[21]. The two other rifA2 sequences, PFD0070c and PFHG_03700, are among the proteins with the lowest of all the positive A-RIFIN HMM scores (60.9 and 63.8 respectively). These three sequences are extremely similar to each other with the exception of a short variable region preceding the C-terminal transmembrane domain. In the case of PFDG_04771, it is a low complexity repeat of a SSGGS motifs. Additionally, this sequence is missing its N-terminal end. We assume that these circumstances, as well as the divergence of the rifA2 proteins from the basic RIFIN type, reduced its score below the detection limit. Although these sequences are full length (with the exception of PFDG_04771), all other low scoring (but higher than any rifA2) A-RIFIN sequences are fragments, again stressing the atypical properties of rifA2. The third protein below the first LOD, PFDG_04350, is a partial sequence (119 residues) covering only the C-terminal part of the protein. It is most similar to PFL2585c, a protein with very atypical N- and C-terminal ends, although the majority of the protein is typical of A-RIFIN proteins. The limited length and odd sequence of PFDG_04350 prevent its recognition as a RIFIN protein. Thus the three proteins failing to reach the first LOD have too little sequence similarity to be identified as RIFIN or STEVOR sequences.

The second LOD is specific to STEVOR proteins: if the score against the STEVOR HMM is higher than the True/False cut-off, but <120, then the sequence is reliably related to STEVOR proteins, but either differs from typical STEVOR sequences or is too fragmented to reach a high enough score. We refer to these potential STEVOR sequences as STEVOR-like proteins. The protein fragment PFHG_05644 is an example of low confidence sequence (score < 120) that we assign as STEVOR-like, although it probably is a valid STEVOR fragment. Among the sequences that score <120 with the STEVOR HMM are two identical sequences, PFC0045w and PFDG_03056, found in the 3D7 and DD2 strains, respectively. The PlasmoDB version 7.1 annotation for the PFC0045w protein is "RIFIN". However, although they are distinct from STEVOR proteins, our phylogenetic analysis clearly shows that these sequences are not RIFIN proteins, as they tend to cluster separately from the RIFIN and closer to STEVOR proteins. Until we can accumulate more sequences of this type, RSpred will predict these proteins to be similar to STEVOR and will assign them the STEVOR-like tag.

The third LOD is specific to RIFIN proteins: if the score against either the A-RIFIN or the B-RIFIN HMM is higher than the score against the STEVOR HMM, but <300, then the sequence is reliably a RIFIN protein, but it is not possible to identify its sub-group. Typical examples are fragments of proteins, e.g. PFDG_04007, PFHG_05281 and A1KQT0 (from DD2, HB3 and Uniprot respectively). In several cases, the short length of the sequence and the absence of determining properties (e.g. the 25 amino acid residues indel) result in these sequences having low scores against both the A-RIFIN and the B-RIFIN HMMs. Some rare proteins include enough of the conserved C1 region to identify them as A- or B-RIFIN, but nevertheless score <300 and are thus sorted into the RIFIN group. These sequences are most often truncated sequences or contain very odd amino acid composition, e.g. PFDG_02116 and PFHG_03477, respectively, possibly caused by low sequencing coverage or genome assembly problems.

Finally, the fourth limit of detection concerns B1- and B2-RIFIN proteins: if the score against the B-RIFIN HMM is >300, but the B1- and B2-RIFIN HMMs do not reach the cut-offs, then the protein will be evaluated as B-RIFIN instead of its proper sub-group. Among all the sequences from our curated dataset, we have not detected any false negative B1- or B2-RIFIN sequences.

Automatic detection of RIFIN and STEVOR sub-groups in draft genomes

We applied our HMMs to all coding sequences (CDS) equal to or longer than 100 amino acids from 15 draft genomes (downloaded from the Broad Institute of Harvard and MIT [22] and the Welcome Trust Sanger Institute [23]) that do not have available annotations. The screening of these CDS gave variable results, depending on the genome, from 76 to 286 RIFIN and STEVOR sequences detected (see Table 1 for the distribution per sub-group). Although most of these genomes have been sequenced to a very low coverage (1.25×), each sub-group was detected in almost all genomes. The only exceptions are the 7G8 genome in which B1-RIFIN proteins were not found and FCC-2_hainan in which B2-RIFIN proteins were not detected. Interestingly the Plasmodium reichenowi genome had the highest number of hits.

Table 1 Prediction of RIFIN and STEVOR proteins in 15 draft genome datasets

Full size table

Negative datasets

Currently, RIFIN and STEVOR proteins have only been found in Plasmodium falciparum and the related P. reichenowi. Neither Pfam nor TIGRFAMs detect these proteins in any other known species. Additionally, orthology prediction tools and databases do not yield any RIFIN or STEVOR homologues in any other species [24–26]. Finally, the investigation of other Plasmodium multigene families have not detected any RIFIN or STEVOR homologous proteins [27, 28]. Hence, we decided to use other Plasmodium species as negative controls. No RIFIN or STEVOR sequences were predicted in P. vivax, P. yoelii, P. berghei, P. knowlesi or P. chabaudi. RSpred was also run against the entire Uniprot database, but there were no RIFIN or STEVOR sequences predicted, except for those belonging to P. falciparum.

Comparison with Pfam and TIGRFAMs

Other prediction methods exist for the RIFIN and STEVOR protein families, although each one has its limitations. Pfam [18] only predicts if a sequence is a RIFIN/STEVOR (PF02009) or not, while TIGRFAMs [19] only separates RIFIN (TIGR01477) from STEVOR (TIGR01478) proteins. Additionally, the TIGRFAMs were trained as global models and therefore do not detect sequence fragments. None of the two predict RIFIN sub-groups, as RSpred does.

In order to test the sensitivity of the three methods, we applied them to the set of RIFIN and STEVOR sequences that were not used for the training of RSpred. Out of 339 RIFIN/STEVOR sequences, RSpred identified 338 (99.7%) of them, whereas Pfam detected 332 (97.9%) and TIGRFAMs only detected 297 (87.6%). Both TIGRFAMs and Pfam fail to identify low scoring STEVOR, and the former also fails to identify fragments. The sorting of RIFIN and STEVOR proteins into sub-groups makes RSpred more specific than the other models. In addition, RSpred detects more sequences than Pfam and TIGRFAMs; it is therefore also the most sensitive of the three methods.

Discussion

Redefining the RIFIN and STEVOR sub-groups

Previous studies describe RIFIN and STEVOR sequences as a large group of related proteins unique to P. falciparum. Subsequent analysis of the RIFIN protein family, based on the reference genome, showed that the RIFIN family can be further sub-grouped into A- and B-RIFIN sequences and the latter divided into B1-, B2- and B3-RIFIN [12].

Our current analysis, which includes many more sequences, confirms the sub-division of RIFIN sequences into A-, B1- and B2-RIFIN groups, which all have defined characteristics. However, it is an overstatement to create a defined group for the remaining B-RIFIN sequences. These sequences represent a heterogeneous cluster (10 genes in the 3D7 reference strain) of sequences that are defined by the fact that they are not A-RIFIN sequences and have relatively little similarity to B1- and B2-RIFIN proteins. We have therefore decided to retrograde the B3-RIFIN sequences to the rank of B-RIFIN.

A recent study has defined potential sub-groups within the A-RIFIN sequences, rifA1 and rifA3. These groupings rely on sequence similarity of 71% and 84% and, for a large majority, their genomic location in a head-to-head orientation with group A var genes [21]. We have not trained HMMs to recognize these groups because of the low number of sequences available from the curated datasets. Also, we find that there are several other such sub-group candidates, but the small number of sequences within a single genome makes it difficult to distinguish between bona fide sub-groups and recently expanded genes.

These authors also defined a sub-group, rifA2, which is composed of one divergent RIFIN sequence that is present, with 78% conservation, in all genomes investigated [21]. The case of single copy genes that are very conserved between genomes are possibly better classified as conserved genes rather than sub-groups. Also, we have noted that the proteins that compose the rifA2 group score the lowest of all RIFIN sequences, with one of them predicted as "false". The fact that partial A-RIFIN protein sequences score higher than the full length rifA2 and the divergence of these sequences from typical RIFIN proteins strongly suggests that these are related to RIFIN proteins but have a different function not requiring multiple copies for the survival of the parasite.

In this study, we have only focused on the three genomes (3D7, HB3 and DD2) for which annotations are available as well as the Uniprot database that contains data from field studies. We confirm the finding, by Wang et al.[21], that several RIFIN sequences are relatively conserved across strains, however it is difficult to evaluate whether this represents a measure of the divergence of parasite populations or if they have been evolutionarily selected for specific functions.

Also, we have chosen to adopt a conservative approach to the STEVOR designation. All sequences that are clearly related to STEVOR sequences, but that do not score high enough will be tagged STEVOR-like by the RSpred program.

Ambiguous sequences

Four sequences predicted to be A-RIFIN proteins also had relatively high scores (> 300) with either the B1- or the B2-RIFIN HMM. Upon closer inspection of these sequences, applying phylogenetic analysis to alignments of each half of these proteins, it appears that their N-terminal half correspond well with A-RIFIN sequences whereas their C-terminal half is characteristic of B1- or B2-RIFIN proteins (data not shown). These sequences are hybrids between A- and B1/2-RIFIN proteins and confirm previous reports of recombination as a mean for the diversification of these VSA gene families [29].

Advantages, limits and utility of RSpred

We have named our set of HMMs and the evaluation program RSpred, for RIFIN and STEVOR predictor. We have shown that it efficiently detects RIFIN and STEVOR proteins and classifies them according to their sub-group. Although there are no false positive detections, RSpred is conservative with truncated and remotely related sequences. However, most of these sequences are at least recognized and predicted as RIFIN or STEVOR proteins. Finally, RSpred proves to be more sensitive than the existing Pfam and TIGRFAMs HMMs [18, 19], which are also limited in the scope of their classification, as they do not recognize RIFIN or STEVOR sub-groups.

We have applied RSpred to whole proteomes extracted from novel genome assemblies. Although these genomes are mostly sequenced to a very low coverage (1.25×), we were able to detect all sub-groups within these genomes. This resource will be increasingly useful as more genomes are being sequenced: in particular, there is a large Plasmodium genome sequencing project [30] that is scheduled to sequence over 100 Plasmodium parasite genomes, which will allow for meta-genomic analysis of the RIFIN and STEVOR protein families.

Conclusions

The analysis of proteins that are members of large families is often overwhelming due to the difficulty to assign proper classification. The RIFIN and STEVOR families are such groups of proteins: complications are in part due to their large diversity within each parasite's genome, but even more so with the extreme diversity between parasite populations [4, 5, 10, 11, 31]. Our prediction tool, RSpred, is designed to simplify the classification of these proteins into previously identified sub-groups [6, 12] with the following benefits:

It eliminates the need to manually retrieve reference sequences and perform multiple sequence alignments;
It eliminates the need for any prior knowledge of these protein families in order to sort them properly;
It out performs existing tools;
It identifies and sorts RIFIN proteins into RIFIN, A-RIFIN, B-RIFIN, B1-RIFIN and B2-RIFIN.

Although these groups probably have diverged in function [12], the sequence conservation between these proteins assumes that their respective functions are still closely related. Greater knowledge of the smaller sub-groups B1- and B2-RIFIN proteins will improve our understanding of the larger A-RIFIN and STEVOR groups that play a more preponderant role at the surface of the infected host cell [4, 13].

Methods

Data sets, retrieval and curation

We obtained sequence information from several sources, including PlasmoDB [32], Uniprot [33], the Welcome Trust Sanger Institute [23] and the Broad Institute of Harvard and MIT [22].

3D7 sequences

We used search functionalities of the PlasmoDB v6.3 to retrieve all proteins annotated as RIFIN and STEVOR (221 sequences) excluding MAL7P1.208 that is annotated as RIFIN-like but is more similar to Rhoptry Associated Membrane Antigen (RAMA) proteins.

DD2 & HB3 retrieval and curation

We downloaded all data files pertaining to the DD2 and HB3 genomes (version 1) from the Broad Institute website [22].

The Supercontigs of both DD2 and HB3 were searched against the 3D7 repertoire of rif and stevor genes using BLASTn [34]. The BLAST results were visualized using Artemis and ACT (Artemis Comparison Tool) [35, 36]. Each hit in the draft genomes was manually checked for the presence of a Broad Institute annotation (BIA). Generally, three case scenarios would occur:

1.
Either there was an annotated gene corresponding to the manually curated rif or stevor gene. In this case, the gene would take the BIA gene name.
2.
Or there was an annotated gene that did not quite overlap with the manual curation. In this case, the manually curated gene would take the BIA gene name.
3.
Or there was no annotated gene at or near those coordinates. In this case, a new gene would be annotated with a new name.

We detected 193 and 179 RIFIN and STEVOR sequences from DD2 and HB3, respectively.

Field isolate data

We retrieved all RIFIN and STEVOR protein sequences from the Uniprot Knowledgebase [33] (446 sequences). We then removed all sequences from the 3D7 reference genome (215 sequences after filtering).

Additional draft genomes

Finally, we retrieved additional draft genome sequences from the Broad Institute and Welcome Trust Sanger Institute websites [22, 23]. The additional genomes downloaded from the Broad Institute were Plasmodium falciparum supercontigs files of 7G8 nucleus, D10 nucleus, D6 nucleus, Fcc-2/Hainan nucleus, RO-33 nucleus, Santa Lucia (SL) nucleus, K1 nucleus, Senegal_V34.04 nucleus, VS/1 nucleus, IGH-CR14 nucleus, RAJ116 nucleus http://www.broadinstitute.org/annotation/genome/plasmodium_falciparum_spp/MultiDownloads.html and from the Welcome Trust Sanger Institute were the Plasmodium falciparum Ghanaian Isolate contigs version 20080302 ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/falciparum/Ghanaian_Isolate/ and IT strain supercontigs version 2007114.phusion ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/falciparum/IT_strain/Archive/, as well as the Plasmodium reichenowi contigs version 031104 ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/reichenowi/.

These sequence data were produced by the Broad Institute and Welcome Trust Sanger Institute, respectively.

At the time of writing, these genomes have no official annotations; therefore, using Artemis, we extracted from them all coding sequences (CDS) equal to or greater than 100 amino acids long, regardless of the presence of a start codon (see Table 1).

Sequence analysis for sub-group determination

All alignments were carried out using MAFFT or Kalign 2, with default parameters [37, 38]. We used Jalview and Bioedit for alignment visualization and editing [39, 40]. Phylogenetic analysis was carried out with Molecular Evolutionary Genetic Analysis 4 (MEGA 4) [41]. All phylogenetic trees were built with the Neighbor-Joining method, considering gaps and missing data as pairwise deletions and using the Amino: Poisson correction model. Phylogenetic trees were tested with 500 bootstrap replicates.

We first aligned all sequences together in order to distinguish STEVOR and RIFIN proteins from each other. During this process, we detected a small subset of sequences that are related to STEVOR proteins but do not have a high enough HMM score. These sequences will be tagged as STEVOR-like until the availability of more sequences will allow for better categorization.

The RIFIN sequences were subsequently sub-divided according to the classification described in Joannin et al.[12]. A first approximation of the sub-grouping relies on the presence or absence of the characteristic 25 amino acid sequence that is present in A-RIFIN but absent in B-RIFIN proteins [6, 12, 42]. Sequences, which were either truncated or contained large indels, that were not identifiable as A- or B-RIFIN according to this criterion, were gathered into an "Unknown RIFIN" group. The remaining RIFIN sequences (A- and B-RIFIN) were aligned and sorted into groups according to the resulting phylogenetic tree. Sequences were grouped into A-RIFIN, B-RIFIN, B1-RIFIN, B2-RIFIN, modified from Joannin et al.[12] with the B3-RIFIN sub-group here renamed as B-RIFIN (see Results), as well as an "Ambiguous" subgroup. The Ambiguous group gathered all sequences that were identifiable as A-or B-RIFIN sequences but were not resolved in the phylogenetic trees.

HMM training, testing and evaluation program

The HMMs for the five different groups of RIFIN and STEVOR sequences were built using HMMER2 [43]. Both global and local build options were tried and the local (hmmbuild-f) was found to perform best with this type of data, containing full length as well as truncated and fragmented sequences.

For the purpose of HMM training, all alignments were created using Mafft-linsi [37]. A number of protein sequences were either truncated compared to typical sequences or contained indels. We decided that sequences should be complete and typical from the PEXEL motif (Plasmodium Export Element motif) [16, 17] to the C-terminal transmembrane domain; the alignments were constrained to start at this motif as well. The five training sets were made non-redundant using FASTA [44], so that the final sets contained no sequence with more than 80% identity to any other. Outliers were removed using a jack-knifing test. During this test each sequence in the training set was excluded, one at a time, an alignment created and a new HMM built. The removed sequence was scored against this new HMM, together with every sequence from the other training sets (i.e. a negative dataset). If the excluded sequence did not score higher than every sequence from the negative dataset it was removed from the final training set. The final training sets consisted of 259 A-RIFIN, 96 B-RIFIN, 26 B1-RIFIN, 9 B2-RIFIN and 51 STEVOR sequences.

A program, written in C, was created to manage the results obtained when the five HMMs were used in database searches. Figure 4 displays the decision process and the cut-offs. The cut-offs were set using the manually curated dataset as 'truth', including the odd sequences (with respect to the amino acid composition or sequence length) removed from the final training set.

Control data sets

In order to test our HMMs for false positives, we retrieved the proteomes of several other Plasmodium species. All plasmodium specific datasets where downloaded from PlasmoDB version 7.1[32] and downloaded protein coding sequences from Plasmdium falciparum 3D7 (5418, version: 2010-06-01) [6], Plasmodium vivax Sal-1 (5393, version: 2007-06-13) [45], Plasmodium chabaudi chabaudi (5123, version: 2010-06-01), P. knowlesi strain H (5194, version: 2010-06-01) [46], P. yoelii yoelii strain 17XNL (7724, version: 2005-09-01) [47] and P. berghei strain ANKA (4857, version: 2010-06-01) [48]. Additionally, we used the original Broad Institute annotated protein sequences from the DD2 (5380, version: 2007-04-13) and HB3 (5623, version: 2007-03-16) genomes [22].

References

Deitsch KW, Lukehart SA, Stringer JR: Common strategies for antigenic variation by bacterial, fungal and protozoan pathogens. Nat Rev Microbiol. 2009, 7 (7): 493-503. 10.1038/nrmicro2145.
CAS PubMed PubMed Central Google Scholar
Deitsch KW, Hviid L: Variant surface antigens, virulence genes and the pathogenesis of malaria. Trends Parasitol. 2004, 20 (12): 562-566. 10.1016/j.pt.2004.09.002.
CAS PubMed Google Scholar
Rasti N, Wahlgren M, Chen Q: Molecular aspects of malaria pathogenesis. FEMS Immunol Med Microbiol. 2004, 41 (1): 9-26. 10.1016/j.femsim.2004.01.010.
CAS PubMed Google Scholar
Niang M, Yan Yam X, Preiser PR: The Plasmodium falciparum STEVOR Multigene Family Mediates Antigenic Variation of the Infected Erythrocyte. PLoS Pathog. 2009, 5 (2): e1000307-10.1371/journal.ppat.1000307.
PubMed PubMed Central Google Scholar
Jeffares DC, Pain A, Berry A, Cox AV, Stalker J, Ingle CE, Thomas A, Quail MA, Siebenthall K, Uhlemann A-C, et al: Genome variation and evolution of the malaria parasite Plasmodium falciparum. Nat Genet. 2007, 39 (1): 120-125. 10.1038/ng1931.
CAS PubMed Google Scholar
Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, et al: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419 (6906): 498-511. 10.1038/nature01097.
CAS PubMed Google Scholar
Fernandez V, Hommel M, Chen Q, Hagblom P, Wahlgren M: Small, clonally variant antigens expressed on the surface of the Plasmodium falciparum-infected erythrocyte are encoded by the rif gene family and are the target of human immune responses. J Exp Med. 1999, 190 (10): 1393-1404. 10.1084/jem.190.10.1393.
CAS PubMed PubMed Central Google Scholar
Kyes SA, Rowe JA, Kriek N, Newbold CI: Rifins: a second family of clonally variant proteins expressed on the surface of red cells infected with Plasmodium falciparum. Proc Natl Acad Sci USA. 1999, 96 (16): 9333-9338. 10.1073/pnas.96.16.9333.
CAS PubMed PubMed Central Google Scholar
Helmby H, Cavelier L, Pettersson U, Wahlgren M: Rosetting Plasmodium falciparum-infected erythrocytes express unique strain-specific antigens on their surface. Infect Immun. 1993, 61 (1): 284-288.
CAS PubMed PubMed Central Google Scholar
Albrecht L, Merino EF, Hoffmann EHE, Ferreira MU, de Mattos Ferreira RG, Osakabe AL, Dalla Martha RC, Ramharter M, Durham AM, Ferreira JE, et al: Extense variant gene family repertoire overlap in Western Amazon Plasmodium falciparum isolates. Mol Biochem Parasitol. 2006, 150 (2): 157-165. 10.1016/j.molbiopara.2006.07.007.
CAS PubMed Google Scholar
Blythe JE, Yam XY, Kuss C, Bozdech Z, Holder AA, Marsh K, Langhorne J, Preiser PR: Plasmodium falciparum STEVOR proteins are highly expressed in patient isolates and located in the surface membranes of infected red blood cells and the apical tips of merozoites. Infect Immun. 2008, 76 (7): 3329-3336. 10.1128/IAI.01460-07.
CAS PubMed PubMed Central Google Scholar
Joannin N, Abhiman S, Sonnhammer E, Wahlgren M: Sub-grouping and sub-functionalization of the RIFIN multi-copy protein family. BMC Genomics. 2008, 9 (1): 19-10.1186/1471-2164-9-19.
PubMed PubMed Central Google Scholar
Bachmann A, Esser C, Petter M, Predehl S, von Kalckreuth V, Schmiedel S, Bruchhaus I, Tannich E: Absence of erythrocyte sequestration and lack of multicopy gene family expression in Plasmodium falciparum from a splenectomized malaria patient. PLoS ONE. 2009, 4 (10): e7459-10.1371/journal.pone.0007459.
PubMed PubMed Central Google Scholar
Petter M, Bonow I, Klinkert M: Diverse Expression Patterns of Subgroups of the rif Multigene Family during Plasmodium falciparum Gametocytogenesis. PLoS ONE. 2008, 3 (11): e3779-10.1371/journal.pone.0003779.
PubMed PubMed Central Google Scholar
Petter M, Haeggström M, Khattab A, Fernandez V, Klinkert M-Q, Wahlgren M: Variant proteins of the Plasmodium falciparum RIFIN family show distinct subcellular localization and developmental expression patterns. Mol Biochem Parasitol. 2007, 156 (1): 51-61. 10.1016/j.molbiopara.2007.07.011.
CAS PubMed Google Scholar
Marti M, Good RT, Rug M, Knuepfer E, Cowman AF: Targeting malaria virulence and remodeling proteins to the host erythrocyte. Science. 2004, 306 (5703): 1930-1933. 10.1126/science.1102452.
CAS PubMed Google Scholar
Hiller NL, Bhattacharjee S, van Ooij C, Liolios K, Harrison T, Lopez-Estraño C, Haldar K: A host-targeting signal in virulence proteins reveals a secretome in malarial infection. Science. 2004, 306 (5703): 1934-1937. 10.1126/science.1102737.
CAS PubMed Google Scholar
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K: The Pfam protein families database. Nucleic Acids Res. 2010, D211-222. 10.1093/nar/gkp985. 38 Database
PubMed PubMed Central Google Scholar
Haft DH, Selengut JD, White O: The TIGRFAMs database of protein families. Nucleic Acids Res. 2003, 31 (1): 371-373. 10.1093/nar/gkg128.
CAS PubMed PubMed Central Google Scholar
Hayes C, Diez D, Joannin N, Honda W, Kanehisa M, Wahlgren M, Wheelock C, Goto S: varDB: a pathogen-specific sequence database of protein families involved in antigenic variation. Bioinformatics. 2008
Google Scholar
Wang C, Magistrado P, Nielsen M, Theander T, Lavstsen T: Preferential transcription of conserved rif genes in two phenotypically distinct Plasmodium falciparum parasite lines. Int J Parasitol. 2008
Google Scholar
The Broad Institute of Harvard and MIT - Plasmodium falciparum download page. [http://www.broadinstitute.org/annotation/genome/plasmodium_falciparum_spp/MultiHome.html]
The Welcome Trust Sanger Institute - Protozoan genomes. [http://www.sanger.ac.uk/resources/downloads/protozoa/]
Datta RS, Meacham C, Samad B, Neyer C, Sjölander K: Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Res. 2009, W84-89. 10.1093/nar/gkp373. 37 Web Server
Chen F, Mackey AJ, Stoeckert CJ, Roos DS: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006, D363-368. 10.1093/nar/gkj123. 34 Database
Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010, D196-203. 10.1093/nar/gkp931. 38 Database
PubMed PubMed Central Google Scholar
Janssen CS, Barrett MP, Lawson D, Quail MA, Harris D, Bowman S, Phillips RS, Turner CM: Gene discovery in Plasmodium chabaudi by genome survey sequencing. Mol Biochem Parasitol. 2001, 113 (2): 251-260. 10.1016/S0166-6851(01)00224-9.
CAS PubMed Google Scholar
Cunningham D, Lawton J, Jarra W, Preiser P, Langhorne J: The pir multigene family of Plasmodium: antigenic variation and beyond. Mol Biochem Parasitol. 2010, 170 (2): 65-73. 10.1016/j.molbiopara.2009.12.010.
CAS PubMed Google Scholar
Freitas-Junior LH, Bottius E, Pirrit LA, Deitsch KW, Scheidig C, Guinet F, Nehrbass U, Wellems TE, Scherf A: Frequent ectopic recombination of virulence factor genes in telomeric chromosome clusters of P. falciparum. Nature. 2000, 407 (6807): 1018-1022. 10.1038/35039531.
CAS PubMed Google Scholar
Group TPW: Plasmodium White Paper V8.
Volkman SK, Sabeti PC, DeCaprio D, Neafsey DE, Schaffner SF, Milner DA, Daily JP, Sarr O, Ndiaye D, Ndir O, et al: A genome-wide map of diversity in Plasmodium falciparum. Nat Genet. 2007, 39 (1): 113-119. 10.1038/ng1930.
CAS PubMed Google Scholar
Aurrecoechea C, Brestelli J, Brunk B, Dommer J, Fischer S, Gajria B, Gao X, Gingle A, Grant G, Harb O, et al: PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 2009, 37 (suppl 1): D539-D543. 10.1093/nar/gkn814.
CAS PubMed Google Scholar
Consortium U: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, D142-148. 10.1093/nar/gkp846. 38 Database
McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004, W20-25. 10.1093/nar/gkh435. 32 Web Server
Carver TJ, Rutherford KM, Berriman M, Rajandream M-A, Barrell BG, Parkhill J: ACT: the Artemis Comparison Tool. Bioinformatics. 2005, 21 (16): 3422-3423. 10.1093/bioinformatics/bti553.
CAS PubMed Google Scholar
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell BG: Artemis: sequence visualization and annotation. Bioinformatics. 2000, 16 (10): 944-945. 10.1093/bioinformatics/16.10.944.
CAS PubMed Google Scholar
Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinformatics. 2008, 9 (4): 286-298. 10.1093/bib/bbn013.
CAS PubMed Google Scholar
Lassmann T, Frings O, Sonnhammer E: Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 2009, 37 (3): 858-865. 10.1093/nar/gkn1006.
CAS PubMed Google Scholar
Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java alignment editor. Bioinformatics. 2004, 20 (3): 426-427. 10.1093/bioinformatics/btg430.
CAS PubMed Google Scholar
Hall T: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic acids symposium series. 1999, 41: 95-98.
CAS Google Scholar
Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24 (8): 1596-1599. 10.1093/molbev/msm092.
CAS PubMed Google Scholar
Bultrini E, Brick K, Mukherjee S, Zhang Y, Silvestrini F, Alano P, Pizzi E: Revisiting the Plasmodium falciparum RIFIN family: from comparative genomics to 3D-model prediction. BMC Genomics. 2009, 10: 445-10.1186/1471-2164-10-445.
PubMed PubMed Central Google Scholar
Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.
CAS PubMed Google Scholar
Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85 (8): 2444-2448. 10.1073/pnas.85.8.2444.
CAS PubMed PubMed Central Google Scholar
Carlton JM, Adams JH, Silva JC, Bidwell SL, Lorenzi H, Caler E, Crabtree J, Angiuoli SV, Merino EF, Amedeo P, et al: Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 2008, 455 (7214): 757-763. 10.1038/nature07327.
CAS PubMed PubMed Central Google Scholar
Pain A, Böhme U, Berry AE, Mungall K, Finn RD, Jackson AP, Mourier T, Mistry J, Pasini EM, Aslett MA, et al: The genome of the simian and human malaria parasite Plasmodium knowlesi. Nature. 2008, 455 (7214): 799-803. 10.1038/nature07306.
CAS PubMed PubMed Central Google Scholar
Carlton J, Silva J, Hall N: The genome of model malaria parasites, and comparative genomics. Current issues in molecular biology. 2005, 7 (1): 23-37.
CAS PubMed Google Scholar
Hall N, Karras M, Raine JD, Carlton JM, Kooij TWA, Berriman M, Florens L, Janssen CS, Pain A, Christophides GK, et al: A comprehensive survey of the Plasmodium life cycle by genomic, transcriptomic, and proteomic analyses. Science. 2005, 307 (5706): 82-86. 10.1126/science.1103717.
CAS PubMed Google Scholar

Download references

Acknowledgements

This study was supported by PREGVAX (FP7-Health-2007-A-201588), the Kungl.Vetenskapsakademin, T. och R. Söderbergs Professur, the Karolinska Institutet(Distinguished Professor Award), Linköping University and the Swedish Research Council. Several of the sequence data used in this study was generated by the Welcome Trust Sanger Institute and the Broad Institute of Harvard and MIT (see text for details). Finally, we would like to thank the three anonymous reviewers whom have helped us improve the clarity of this article.

Author information

Authors and Affiliations

Department of Microbiology, Cell and Tumor biology (MTC), Karolinska Institutet, SE-17177, Stockholm, Sweden
Nicolas Joannin & Mats Wahlgren
Department of Cell and Molecular Biology (CMB), Karolinska Institutet, SE-17177, Stockholm, Sweden
Yvonne Kallberg & Bengt Persson
IFM Bioinformatics and Swedish e-Science Research Centre (SeRC), Linköping University, SE-58183, Linköping, Sweden
Yvonne Kallberg & Bengt Persson

Authors

Nicolas Joannin
View author publications
You can also search for this author in PubMed Google Scholar
Yvonne Kallberg
View author publications
You can also search for this author in PubMed Google Scholar
Mats Wahlgren
View author publications
You can also search for this author in PubMed Google Scholar
Bengt Persson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nicolas Joannin or Mats Wahlgren.

Additional information

Authors' contributions

NJ participated in the conception and design of the study; he performed the data collection and curation, the phylogenetic analysis and analyzed all results; he drafted and revised the manuscript. YK participated in the design of the study; she trained the HMMs and made the evaluation program as well as analyzed all results; she revised the manuscript. MW revised the manuscript. BP participated in the design of the study and revision of the manuscript. All the authors have read and approved of the final manuscript.

Nicolas Joannin, Yvonne Kallberg contributed equally to this work.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Joannin, N., Kallberg, Y., Wahlgren, M. et al. RSpred, a set of Hidden Markov Models to detect and classify the RIFIN and STEVOR proteins of Plasmodium falciparum. BMC Genomics 12, 119 (2011). https://doi.org/10.1186/1471-2164-12-119

Download citation

Received: 17 October 2010
Accepted: 18 February 2011
Published: 18 February 2011
DOI: https://doi.org/10.1186/1471-2164-12-119

RSpred, a set of Hidden Markov Models to detect and classify the RIFIN and STEVOR proteins of Plasmodium falciparum

Abstract

Background

Results

Conclusions

Background

Results

Curation of the RIFIN and STEVOR repertoires of the Plasmodium falciparum DD2 and HB3 genomes

Sub-grouping, a new take on the matter

Sorting out the results and limits of detection

Automatic detection of RIFIN and STEVOR sub-groups in draft genomes

Negative datasets

Comparison with Pfam and TIGRFAMs

Discussion

Redefining the RIFIN and STEVOR sub-groups

Ambiguous sequences

Advantages, limits and utility of RSpred

Conclusions

Methods

Data sets, retrieval and curation

3D7 sequences

DD2 & HB3 retrieval and curation

Field isolate data

Additional draft genomes

Sequence analysis for sub-group determination

HMM training, testing and evaluation program

Control data sets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us