We propose a phenotype-driven analysis of encrypted exome data to facilitate the widespread implementation of exome sequencing as a clinical genetic screening test.
Twenty test-patients with varied syndromes were selected from the literature. For each patient, the mutation, phenotypic data, and genetic diagnosis were available. Next, control exome-files, each modified to include one of these twenty mutations, were assigned to the corresponding test-patients. These data were used by a geneticist blinded to the diagnoses to test the efficiency of our software, PhenoVar. The score assigned by PhenoVar to any genetic diagnosis listed in OMIM (Online Mendelian Inheritance in Man) took into consideration both the patient’s phenotype and all variations present in the corresponding exome. The physician did not have access to the individual mutations. PhenoVar filtered the search using a cut-off phenotypic match threshold to prevent undesired discovery of incidental findings and ranked the OMIM entries according to diagnostic score.
When assigning the same weight to all variants in the exome, PhenoVar predicted the correct diagnosis in 10/20 patients, while in 15/20 the correct diagnosis was among the 4 highest ranked diagnoses. When assigning a higher weight to variants known, or bioinformatically predicted, to cause disease, PhenoVar’s yield increased to 14/20 (18/20 in top 4). No incidental findings were identified using our cut-off phenotypic threshold.
The phenotype-driven approach described could render widespread use of ES more practical, ethical and clinically useful. The implications about novel disease identification, advancement of complex diseases and personalized medicine are discussed.
Keywords:Genome; Exome; Encrypted; Sequencing; Clinic; PhenoVar; I-MPOS; I-MPOSE
Exome Sequencing (ES) allows simultaneous screening for variants in the coding portion of all genes present in a patient’s genome. Over the last few years, ES has aided in the elucidation of the genetic basis of multiple genetic syndromes (for a review of some examples see Ku et al. ). The relatively low cost of ES and its’ high diagnostic yield have stimulated discussion about its promising role in clinic [2-4]. However, despite the unprecedented success of ES as a research tool, its utilization as a genetic screening test in clinic remains largely prohibitive due to challenges associated with consent, incidental findings, and the management of the massive amounts of data generated (see “Challenges of integrating ES in clinic” subsection). Furthermore, in many families there is a single affected individual available, which adds further complexity to the analysis of the results , unless the genetic variant responsible for the disease is not present in the parents.
Challenges of integrating ES in clinic
Adapted from Trakadis , published with the permission of author.
1. Meaningful patient informed-consent may not be feasible
• Possibility of incidental findings,
• Multiple findings of uncertain clinical significance,
• Multiple issues to discuss leading to prohibitive requirements in time & resources
2. Potential emotional distress over disease risk even among healthy individuals
3. Genomic information is a powerful personal identifier
• Raising concerns about privacy, confidentiality, genetic discrimination
4. Very large amounts of genetic information generated
• Limited number of clinical geneticists for data interpretation and clinical care
• Substantial time and cost for data analysis and genetic counselling
• Dynamic/evolving nature of the interpretation as new knowledge is gained
• Duty to re-contact patients as knowledge changes over time
To address these challenges, variant prioritization using bioinformatic tools (e.g. Berg et al. ; Berg et al. ) and practice guidelines/recommendations (e.g. Christenhusz et al. ; ACMG Policy statement on Genomic Sequencing, May 2012 [10,11]) have been suggested. These approaches, however, do not adequately address all the challenges summarized in the “Challenges of integrating ES in clinic” subsection (e.g. incidental findings, findings of uncertain clinical significance, risk for genetic discrimination, requirements in time & resources). Moreover, they are limited by the efficiency of the bioinformatic tools to accurately predict the clinical impact of different variants [12,13]. At present, different tools often lead to opposite predictions about the functional impact of the same variant . Nonetheless, the ability of ES to facilitate diagnosis and inform therapy will likely lead to its premature introduction in clinic using an approach similar to the one followed for chromosomal microarray [15-20].
In the light of rapid developments in genomic technologies, medical genetics is shifting from the present “phenotype-first” medical model to a “data-first” model, which leads to multiple complexities. An alternative phenotype-driven approach was recently put forward . This approach, namely Individualized Mutation-weighed Phenotype On-line Search (I-MPOS), aims to address the above mentioned issues and facilitate widespread clinical utilization of ES. We hereby present PhenoVar, a software consistent with this phenotype-driven approach, and provide preliminary evidence of its potential benefits.
PhenoVar and phenotype-driven analysis of exome data
Figure 1 summarizes the overall workflow of PhenoVar. In brief, PhenoVar automatically prioritizes diagnoses for validation based on both the phenotypic and genomic information of a proband. It calculates a patient-specific diagnostic score for each OMIM entry (Online Mendelian Inheritance in Man; http://www.ncbi.nlm.nih.gov/omim webcite) with known molecular basis. The diagnostic score assigned to a given syndrome is the sum of its phenotypic and genotypic weight.
Figure 1. Workflow of PhenoVar. PhenoVar automatically prioritizes diagnoses for validation based on both the phenotypic and genomic information of a proband. It calculates a patient-specific diagnostic score for each OMIM entry with known molecular basis. The diagnostic score assigned to a given syndrome is the sum of its phenotypic and genotypic weight. For each syndrome listed in the HPO database the phenotypic weight is determined by calculating the similarity between the proband and the different patients available in a local database (Phenobase). Phenobase includes simulated patients using HPO and real patients (here denoted as “local patients”). The genotypic weight for each syndrome corresponds to the (predicted) pathogenicity of any variants present in the proband’s exome specifically in the gene(s) causing the respective syndrome. When no variation is found in these genes, the genotypic weight for that syndrome is automatically set to null value. Otherwise, the variants are sorted into known disease-causing variants (DC var) versus possibly pathogenic variants (other var) and assigned a different score. The genotypic weight and phenotypic weight described above are summed to obtain the diagnostic score for each syndrome. The different syndromes are then ranked according to their diagnostic score.
Calculation of phenotypic weight
For each syndrome listed in the Human Phenotype Ontology (HPO; http://www.human-phenotype-ontology.org webcite) the phenotypic weight is determined by calculating the similarity between the proband and the different (simulated) patients available in a local database, as described below.
In order to compare the phenotype of a patient with an unknown diagnosis to phenotypes corresponding to known genetic syndromes, we simulated a large number of sample patients, hereafter referred to as simulated patients, using HPO and OMIM databases. For each syndrome listed in the HPO database, twenty to twenty-five simulated patients were randomly generated using the phenotypic traits corresponding to that diagnosis and the information was stored in a local database (Phenobase). On average, a total of 5 traits corresponding to the respective disease were assigned to each simulated patient. The probability of each trait to be present in the phenotype of a given simulated patient was chosen to be proportional to the prevalence of that trait in the respective disease, as available in the HPO database. Only simulated patients corresponding to a syndrome with a known molecular basis, according to the OMIM database, are considered in the subsequent steps of the analysis.
Each trait entered by the user for a given proband is analyzed by PhenoVar using the HPO ontology, a directed acyclic graph representing the relationships between the traits. For each node (trait) in the graph, previous (parent) nodes are more generic traits and forward (child) nodes are more accurate descriptions of the trait. This allows more flexibility in the terms used by the clinician, while still enabling recognition of similarity between the proband and each (simulated) patient in Phenobase. For a given syndrome, accurate and general hits are limited to one child or parent node, in relation with traits listed in HPO. The PhenoVar algorithm first compares the proband to all patients contained in Phenobase (currently including mostly simulated but also some real patients with known diagnoses) and calculates a phenotypic similarity weight for each patient in PhenoVar relative to the proband. For every syndrome (represented by different patients in Phenobase) the phenotypic similarity weights are summed and then averaged to obtain the final phenotypic weight for that syndrome. A higher weight correlates with a higher likelihood that the proband is affected by this syndrome, based on phenotype only.
The following formula summarizes the details of this process:
Where nbPatient is the number of simulated patients with the same syndrome; nbHiti is the number of traits shared between the ith simulated patient and the patient; nbAccurateHiti is the number of traits in the patient that correspond to more accurate versions of a trait in the ith simulated patient. nbGeneralHiti is the number of traits in the patient that correspond to more general versions of a trait in the ith simulated patient; nbMissi is the number of patient traits not matched with the ith simulated patient; KH, KA, KG and KP are constant parameters which were determined based on an independent cohort of test patients (KH, KA and KG =2, KP = 1).
Calculation of genotypic weight
The genotypic weight for each syndrome corresponds to the (predicted) pathogenicity of any variants present in the proband’s exome specifically in the gene(s) causing the respective syndrome. Hence, the genotypic weight is generated in parallel for each syndrome using the proband’s exome VCF (Variant Call Format) data file. When no variation is found in these genes, the genotypic weight for that syndrome is automatically set to null value. Predetermined values (weight) are assigned to each variation in the VCF file, according to SNPEff annotation or known disease-causing status (ClinVar, HGMD).
Two different models are used in the present paper. In the first model all filtered variations are assigned by PhenoVar the same weight (arbitrary value of 1000), irrespective of zygosity. The final genotypic weight for each syndrome corresponds to the greatest variation weight across all genes causing the disorder. When no variation is found in these genes, the weight is automatically set to null value.
The only difference in the second model is that the variants are filtered and sorted in two groups: known disease-causing variants (group 1) versus possibly pathogenic variants (group 2). The disease-causing variants listed in ClinVar and HGMD, as described above, are classified in group 1, while the genetic variations with moderate or high functional impact based on SNPEff predictions (i.e. frameshift, missense, non-sense and splice sites) are classified in group 2. In this model, variants in groups 1 and 2 are arbitrary assigned a weight of 1000 and 100, respectively, and the remaining variants a null value.
Phenotype and genotype score integration, ranking of possible diagnoses and filter for incidental findings
The genotypic weight and phenotypic weight described above are summed to obtain the diagnostic score for each syndrome. The different syndromes are then ranked according to their diagnostic score. The syndrome with the highest diagnostic score represents PhenoVar’s prediction of the most likely diagnosis. An option to filter the ranked syndromes based on the suspected mode of inheritance is also available. Finally, using an empirically determined phenotypic threshold, disorders unrelated to the proband’s phenotype were filtered out.
To determine this phenotypic threshold (cut-off: 0.9705), simulated patients whose exome VCF files were modified to include a pathogenic variant corresponding to their diagnoses but also an incidental finding were used.
Test-patients selection and preparation of variants files
Ten test-patients with different polymalformative genetic syndromes were randomly selected from previously published case reports (patients 1a to 10a in Table 1). For each patient selected, the mutation, phenotypic data, and genetic diagnosis were available in the published manuscript. Next, the VCF files of 10 control exomes were obtained from the National Institute of Environmental Health Sciences (NIEHS) Environmental Genome Project (EGP) (http://evs.gs.washington.edu webcite). These control exome VCF files were modified so that each of them subsequently included the disease causing variant corresponding to one of the test patients (1a to 10a) in Table 1.
Table 1. Characteristics of the test patients selected from the literature
Next, ten genetic syndromes whose phenotypic features had a documented prevalence in HPO were identified. Each of these syndromes was then searched in OMIM for previously published case reports and one patient representing each syndrome was selected from the literature (1b to 10b in Table 1). The ten original (unmodified) control exome-files were now modified so that each of them subsequently included the disease causing variant corresponding to one of the test-patients 1b to 10b in Table 1.
Each test patient was thus assigned a specific exome VCF file modified to include his/her disease causing variant. The resulting files were first annotated using SNPEff (version 2.0.5) for variation functional impact, and then for known disease-causing variants as classified in ClinVar database (clinical significance = “4”, probable pathogenic or “5”, pathogenic) and in the professional version of Human Gene Mutation Database (HGMD) (Disease mutation –“DM” variants). The files were further filtered to exclude non-disease-causing intronic or synonymous variants, as well as variations with >5% frequency listed in dbSNP (build 135).
A medical geneticist, “blinded” to the diagnoses of the test-patients, was provided with the clinical description of the twenty patients. Based on our previous experience PhenoVar performs best when three or more traits are used. Using terms in HPO, the geneticist selected for each case three traits that he perceived as significant and more specific. Subsequently, for each patient, he introduced the selected terms along with the respective modified exome VCF file in the web-based interface of PhenoVar. The results obtained were analyzed for the position of the correct diagnosis, by the members of the team aware of the diagnoses, to test the efficiency of PhenoVar.
Next, the medical geneticist, while still blinded to the diagnoses, was asked to select different keywords and ensure that 2–3 of the keywords selected were present in Phenobase and the analysis was repeated. Of note, the number of matching traits from the traits entered is evident after each analysis with the software.
For each test-patient, we reviewed genes known to be responsible for mendelian disorders which harboured previously reported diseases-causing variants (ClinVar, HGMD) or variants predicted to be likely pathogenic (non-sense, frameshift, consensus splice site). To report incidental findings, we focused mostly on the ACMG minimal list .
Moreover, the modified control exome VCF files mentioned above, which were further modified by introducing a known BRCA1 pathogenic variant (incidental finding) in each case, were analyzed. Entering this VCF file in PhenoVar along with corresponding set of 3 phenotypic traits summarized for each case in Table 1 allowed for testing the phenotype filter for incidental findings.
Analysis of real patients with unknown genetic disorders
To illustrate that PhenoVar can be used with data from real patients, four patients with multiples congenital anomalies, previously diagnosed via exome sequencing to have a known mendelian disorder, were used. With regards to exome sequencing, DNA libraries were prepared for each patient (TruSeq, Illumina), followed by target enrichment (Agilent SureSelect All Exon kit v4) and sequencing on a HiSeq2000 (Illumina) with 3 exomes per lane, giving an average coverage of ~100X. Analysis of sequencing data was done with the GATK v2 package as per the recommendations of the Broad Institute . SnpEff 2.0.5 (GRCh37/hg19) was used for variant annotation and ClinVar database for identification of known pathogenic variants. The variants identified were filtered using a cut off of 1% for minor allele frequency (dbSNP database built 137 and local control exomes). From the remaining variants, the ones predicted to alter amino acid sequences or consensus splice sites junctions, which were determined to not be tolerated by Polyphen2 or SIFT software, were manually reviewed under the supervision of a medical geneticist (SL). After confirming the diagnosis, the data analysis was repeated for these patients using PhenoVar, as illustrated in Table 2.
Table 2. Four real patients analyzed by PhenoVar
Table 1 summarizes the information of the patients selected from the literature, including only the phenotypic traits selected by the medical geneticist blinded to their diagnoses. The genetic syndromes represented in Table 1 include both autosomal recessive and dominant conditions. Table 3 lists the number of variants identified in the exome corresponding to each test patient and the ranking of the correct diagnosis by PhenoVar. On average 3942 variants were obtained per filtered exome, of which 53 and 23 were included as disease-causing in HGMD and ClinVar, respectively. When PhenoVar ranked the possible diagnoses based only on the phenotypic traits entered, its efficiency appeared to be similar to that of Phenomizer or OMIM search engines (data not shown). When ranking the possible diagnoses solely based on phenotypic weight, PhenoVar predicted the correct diagnoses in three patients but did not rank the correct diagnosis as part of the top 20 possible diagnoses in 9 out of 20 patients (Table 3, Column 3).
Table 3. Diagnosis prediction for test-patients using PhenoVar
By including both the patient’s phenotype and exome data, PhenoVar’s efficiency improved significantly. When assigning the same weight to all variants in the exome (Table 3, Column 4), PhenoVar predicted the correct diagnosis in 10 out of 20 patients, while in 15 out of 20 the correct diagnosis was among the 4 highest ranked possible diagnoses. The correct diagnosis was not ranked in the top 20 diagnoses in only two patients.
When using PhenoVar’s option to automatically assign a higher weight to variants known, or bioinformatically predicted, to cause disease, PhenoVar’s diagnostic yield increased to 14/20, with the correct diagnosis ranking in the top 4 highest ranked diagnoses in 18/20 patients. More specifically, in the second set of patients (1b-10b) the diagnosis was successfully predicted by PhenoVar in 8/10 patients and ranked within the top 4 diagnoses in the other two patients (Table 3, Column 5). Optimization of the selected keywords to ensure that two or three traits from the ones used were present in Phenobase further improved the diagnostic yield: PhenoVar successfully identified the correct diagnosis in 17 out of 20 patients, while 20/20 were ranked in the top 3 (data not shown).
With regards to incidental findings, two findings unrelated to the presenting complaint were identified when manually analyzing the VCF files of the test-patients. One of the variants was previously reported to cause Lynch syndrome (MLH3) and the other to cause Renal cell carcinoma/MODY type 3 (HNF1A). These incidental findings were not identified when using PhenoVar’s cut-off phenotypic threshold. Furthermore, when repeating the Phenovar analysis for all patients using the VCF files which had been further modified to include a variant known to cause the BRCA1 cancer syndrome, this incidental finding was not identified using our cut-off phenotypic threshold.
Table 2 demonstrates that PhenoVar can also be used with real patients’ data. All four patients’ diagnoses (previously identified by standard bioinformatics analysis of exome sequencing results) were ranked highly by PhenoVar. Moreover, the two incidental findings known to be present in the real patients (specifically, a disease-causing variant for pigmented nodular adrenocortical disease, OMIM 610475, and another one in BRCA2, found in patients B and D, respectively) were not found using our cut-off phenotypic threshold.
With conventional approaches multiple genetic tests are typically required before a molecular diagnosis is reached. This leads to increased cost and time delay. Widespread use of clinical ES could accelerate genetic diagnosis to an unprecedented scale at low cost.
PhenoVar prioritizes diagnoses (mendelian disorders whose molecular bases are known) for validation based on both the phenotypic and genomic information of a proband. It was shown to perform very well with a limited number of phenotypic traits being used (three traits). Our data demonstrate that, by taking into consideration both the patient’s phenotype and encrypted exome data, the correct diagnosis for patients with different clinical presentations was prioritized more efficiently than relying solely on the patient’s phenotype (as seen when comparing columns 3 and 5 in Table 3). This was true when using different variants databases (ClinVar or HGMD, data not shown) or, to a lesser but still significant degree, when assigning the same weight to all variants present in an exome. Selecting PhenoVar’s option to assign the same weight for all variations present in a patient’s exome, rather than prioritizing the known pathogenic variants, has an important advantage: it minimizes the impact that erroneous variant classification, as benign or pathogenic, has on the efficiency of the software.
Interestingly, PhenoVar performed equally well for the cases where the phenotypic keywords selected were not specific for the correct diagnosis. For instance, in the case of patient 3b, although the phenotype-based ranking was very poor (Table 3, Column 3), the final ranking by PhenoVar was not compromised (Table 3, Columns 4 and 5). This suggests that PhenoVar will be particularly helpful in the unfortunate occasions where an important clinical trait is missed during the genetic evaluation or when dealing with atypical presentations of known genetic syndromes. In turn, diagnosing more patients with atypical presentations of known genetic syndromes would potentially help to better define the spectrum of clinical characteristics of these conditions. Additional studies using real patient data, in collaboration with individuals who will be using this software, need to be prospectively performed to achieve this goal but also to further validate this analysis tool. Table 2 summarizes four examples of real patients whose diagnosis was facilitated using PhenoVar. This table illustrates that PhenoVar can be used with real patient data. However, a follow-up study using a large cohort of patients will be needed to evaluate how effective PhenoVar is in prioritizing the correct diagnosis in a real clinical setting.
One of PhenoVar’s major advantages it that it optimizes prioritization of possible diagnoses taking into consideration the patient’s exome data without requiring an increase in the bioinformatics human resources available in the clinical setting. This could potentially allow for a widespread use of ES in clinical practice, as a screening test for known mendelian conditions. Moreover, through the optional use of a cut-off phenotypic weight threshold, the clinician can focus the analysis on the genetic causes which can potentially explain the specific phenotype/medical-issue at hand, thus preventing the undesired discovery of incidental findings. As a result, the approach described simplifies pre-test counselling and informed consent for exome sequencing as a clinical screening tool. It does not contradict but rather complements the bining approach previously put forward [7,10]. Widespread usage of ES in clinic will help evaluate the significance of different variants, including their penetrance/expressivity. It will thus aid in the identification of appropriate genes to target for screening [42,43], as well as, help improve the interpretation of incidental findings of interest to the patient.
As illustrated by cases 5a, 6a, 8a (Table 3) correct gene annotation (e.g. correct exon/intro borders) is crucial for the software to run properly, albeit, this limitation is not specific to PhenoVar, as it affects similarly the currently standard ES data analysis. Moreover, to optimize the efficiency of the software, the HPO database and Phenobase need to be properly curated. For instance, including in HPO the prevalence of the phenotypic features for different genetic syndromes has a significant impact on the efficiency of the software, as illustrated by PhenoVar’s efficiency in patients 1b-10b (Table 3). Including more real patients in Phenobase will, in time, overcome this problem. Also, an option to explore the presence or absence of a given variant in affected/unaffected family members (based on simultaneous comparison of their encrypted genomes) could allow for adjusting the weight assigned to different variants and further improve PhenoVar’s efficiency. Finally, the VCF files in our study were filtered to include variations with >5% frequency listed in dbSNP (build 135). The filter was intentionally set higher than usual to illustrate the efficiency of PhenoVar. However, since most genetic conditions are rare (low carrier frequency), one could opt to use a filter with a lower threshold (e.g. 1%). This would filter out more benign variants and thus improve the efficiency of PhenoVar.
The proposed approach follows the existing “phenotype-first” medical model and allows for better prioritization of the genes to be tested in a clinical lab. It is particularly useful in phenotypes caused by multiple different genes (e.g. evaluation of global developmental delay). Recent studies have provided evidence for the high diagnostic yield of exome sequencing [44-46]. Using ES as a screening test can increase the diagnostic yield of a clinical evaluation in a cost-effective fashion and decrease the time to diagnosis [2-4,47,48]. If used properly, PhenoVar can help address many of the challenges associated with integrating genomic technologies into clinical practice (see “Challenges of integrating ES in clinic” subsection). It remains the responsibility of the physician to seek confirmatory clinical diagnostic test targeting the suspected diagnosis and, for the unresolved cases, to clinically prioritize testing using the whole spectrum of clinical genetic testing modalities available.
In the future, Phenobase can be expanded to incorporate special databases containing phenotypic and genomic data of real patients [49,50], thus enabling encrypted on-line data sharing from consenting individuals. By including both patients with known and not-yet identified diagnoses in such interconnected encrypted databases, the method described above may have numerous benefits. First, when a large cohort of real patient data is made available in Phenobase, PhenoVar’s diagnostic yield for known syndromes would likely improve. This would be particularly true when atypical features are present. Secondly, continuously upgrading the database could allow the software to evolve and contribute in the identification of new syndromes as illustrated in Figure 2. In brief, a patient evaluated in clinic (target patient) could be automatically matched by the software with patients in the database based on phenotypic similarity. The software could then perform an automatic comparison of the exomes of the matched patients and that of the target patient. When the target patient’s exome file contains a variant shared by the phenotypically matched patients in the database whose diagnosis is known, his/her diagnosis could be inferred. In cases where the diagnosis of the phenotypically matched patients in the database is not yet known, an identical variant shared by these patients and the target patient could point to a candidate gene explaining their shared phenotype, a potentially newly described syndrome. This can be achieved while maintaining the genomic information of all participating patients in the database encrypted at all times.
Figure 2. Future directions about Phenovar or similar software using the I-MPOS approach and data from real patients. A database containing phenotypic and encrypted genomic information of real patients with known or not-yet identified diagnoses can be made available (1). A patient with an unknown diagnosis presents in clinic. His encrypted ES data are obtained and his phenotype is assessed (2 α, β). The software automatically searches the “Encrypted Patients Database” using target patient’s assessed phenotype (3) thereby providing a first ranking of possible genetic conditions based on “phenotype weight” (4), (5). For all patients in the database meeting a specific phenotype-similarity threshold in relation to the proband, the software will compare the changes present in their genomes against the ones present in the genome of the patient seen in clinic (6). Matching the proband with the phenotypically similar subjects in the database based on similarity of their genetic changes (“mutation weight”) forms the basis of adjusting the first ranking to calculate the second ranking (6), (7). Subjects sharing adequate phenotypic characteristics who also share a genetic variant cluster together. As a result, a given match is indicative of the possibility that the target subject shares the same genetic condition with the matched other subject(s). After the second ranking, the information about the shared phenotype and genotype of the patients clustering together is accessible and can aid in reaching the diagnosis. It should be noted that “phenotype” (steps 2–5) is not limited to clinical traits but also refers to other levels of phenotype, such as a metabolomic profile. Also, the word “mutation” (steps 6, 7) can refer to variants in more than one genetic loci which are simultaneously present in all matched patients allowing one to explore the possibility of gene-gene interaction.
Such an approach would also have implications about the delineation of heterogeneous complex genetic diseases with high heritability (e.g. schizophrenia [51-54]) into more homogenous endophenotypes based on subgroups of patients present in the database. Finally, the software could ultimately evolve to use, besides clinical traits, other levels of phenotypic information (e.g. metabolomic, transcriptomic, miRNomic data) when matching the target patient with patients in the database. The metabolome and/or other levels of phenotype, which can be accurately quantified and followed over time, constitute the downstream effect of unknown gene-gene or gene-environment interactions. If the metabolomic profiles of the patients in the database and the target patient were made available, such a software could integrate this information during the phenotype matching step. This approach would thus indirectly factor in differences in the genomic backgrounds and environmental exposures (both potentially influencing the pathogenic role of a specific shared variant). Hence, the affected patients with the shared variant would be prioritized, accounting for variations in penetrance and/or expressivity of different genetic conditions. As a result, this approach could facilitate screening, even in the newborn period, for genetic diseases whose biochemical phenotype (e.g. metabolomic profile) precedes the clinical presentation. Similarly, in time, such a tool could potentially be used at regular intervals in a patient’s lifetime through routine visits to a general clinic and facilitate the transition towards a more personalized practice of medicine .
PhenoVar follows the existing “phenotype-first” medical model and facilitates the diagnostic approach by taking into consideration both the patient’s phenotype and all variations present in his exome, when ranking possible diagnoses (see Additional file 1). It is particularly useful in phenotypes caused by multiple different genes (e.g. evaluation of global developmental delay). Besides addressing many of the challenges associated with integrating genomic technologies into clinical practice, it can potentially provide in the future the infrastructure needed to further advance these tools safely and effectively.
Availability and requirements
Project name: PhenoVar project.
Operating system(s): Platform independent.
Programming language: python.
Other requirements: No other requirement for the web bases version.
License: GNU GPL.
Any restrictions to use by non-academics: licence needed.
ES: Exome Sequencing; VCF: Variant Call Format; HGMD: Human Gene Mutation Database; HPO: Human Phenotype Ontology; I-MPOS: Individualized Mutation-weighed Phenotype On-line Search; OMIM: Online Mendelian Inheritance in Man.
Patent application for the approach used is in process (by Y.J. Trakadis): to ensure that the I-MPOS paradigm can be properly implemented in a timely fashion. As long as the necessary academic collaborations for the programme are available, non-commercial applications of this method will remain free of charge.
Caroline Buote, Jean-François Therriault, Hugo Larochelle, Pierre- Étienne Jacques, and Sébastien Lévesque have no competing interests to declare.
Phenovar was designed by S. L., H. L. and Y. T. The experimental design of the study, interpretation of the results and the preparation of the first draft of the manuscript were performed by Y. T. and S. L. J-F T was responsible for programming; C. B. and P.E. J. for the bioinformatic analysis of the exome sequencing data of the real patients. C. B. also analysed the test-patients data. All authors read and approved the final manuscript.
This study was supported by a grant from the Fondation des Étoiles for paediatrics research, and institutional founding from Université de Sherbrooke and Centre de Recherche Clinique Étienne Le Bel. CB owns a master studentship award from the Faculty of Medicine and Health Sciences of Université de Sherbrooke. No writing assistance was provided. The authors wish to acknowledge the contribution of Sebastien Chenier regarding the clinical investigations performed for the four real patients prior to exome sequencing and the personnel of the McGill University and Genome Quebec Innovation Center (Quebec, Canada) for whole exome sequencing of the patients’ samples. Finally, we would like to thank Calcul Quebec, and in particular David Morais, Carol Gauthier, Maxime Lévesque and Alain Veilleux from the MP2 supercomputer of Université de Sherbrooke for providing the informatics infrastructure and their expert assistance.
Dixon-Salazar TJ, Silhavy JL, Udpa N, Schroth J, Bielas S, Schaffer AE, Olvera J, Bafna V, Zaki MS, Abdel-Salam GH, Mansour LA, Selim L, Abdel-Hadi S, Marzouki N, Ben-Omran T, Al-Saana NA, Sonmez FM, Celep F, Azam M, Hill KJ, Collazo A, Fenstermaker AG, Novarino G, Akizu N, Garimella KV, Sougnez C, Russ C, Gabriel SB, Gleeson JG: Exome sequencing can improve diagnosis and alter patient management.
Bettencourt C, López-Sendón J, García-Caldentey J, Rizzu P, Bakker I, Shomroni O, Quintáns B, Dávila J, Bevova M, Sobrido MJ, Heutink P, de Yébenes J: Exome sequencing is a useful diagnostic tool for complicated forms of hereditary spastic paraplegia.
Clin Genet 2013.
Feb 25. doi: 10.1111/cge.12133. [Epub ahead of print]. PMID: 23438842
Christenhusz GM, Devriendt K, Vermeesch J, Dierickx K: Why genomics shouldn’t get too personal: in favor of filters: re: invited comment by Holly K. Tabor et al. in American Journal of Medical Genetics part a volume 155.
Am J Med Genet A 2012, 158A(10):2641-2642.
author reply 2643–2644PubMed Abstract | Publisher Full Text
ACMG Policy statement on Genomic Sequencing. 2012.
Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire AL, Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS, Williams MS, Biesecker LG: ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing.
Dorfman R, Nalpathamkalam T, Taylor C, Gonska T, Keenan K, Yuan XW, Corey M, Tsui LC, Zielenski J, Durie P: Do common in silico tools predict the clinical consequences of amino-acid substitutions in the CFTR gene?
Reiff M, Ross K, Mulchandani S, Propert KJ, Pyeritz RE, Spinner NB, Bernhardt BA: Physicians’ perspectives on the uncertainties and implications of chromosomal microarray testing of children and families.
Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, Serpe JM, Dasu T, Tschannen MR, Veith RL, Basehore MJ, Broeckel U, Tomita-Mitchell A, Arca MJ, Casper JT, Margolis DA, Bick DP, Hessner MJ, Routes JM, Verbsky JW, Jacob HJ, Dimmock DP: Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease.
Wallis DE, Roessler E, Hehr U, Nanni L, Wiltshire T, Richieri-Costa A, Gillessen-Kaesbach G, Zackai EH, Rommens J, Muenke M: Mutations in the homeodomain of the human SIX3 gene cause holoprosencephaly.
Thiel C, Kessler K, Giessl A, Dimmler A, Shalev SA, von der Haar S, Zenker M, Zahnleiter D, Stöss H, Beinder E, Abou Jamra R, Ekici AB, Schröder-Kress N, Aigner T, Kirchner T, Reis A, Brandstätter JH, Rauch A: NEK1 mutations cause short-rib polydactyly syndrome type majewski.
Mäkitie O, Savarirayan R, Bonafé L, Robertson S, Susic M, Superti-Furga A, Cole WG: Autosomal recessive multiple epiphyseal dysplasia with homozygosity for C653S in the DTDST gene: double-layer patella as a reliable sign.
Piao X, Hill RS, Bodell A, Chang BS, Basel-Vanagaite L, Straussberg R, Dobyns WB, Qasrawi B, Winter RM, Innes AM, Voit T, Ross ME, Michaud JL, Déscarie JC, Barkovich AJ, Walsh CA: G protein-coupled receptor-dependent development of human frontal cortex.
Deardorff MA, Wilde JJ, Albrecht M, Dickinson E, Tennstedt S, Braunholz D, Mönnich M, Yan Y, Xu W, Gil-Rodríguez MC, Clark D, Hakonarson H, Halbach S, Michelis LD, Rampuria A, Rossier E, Spranger S, Van Maldergem L, Lynch SA, Gillessen-Kaesbach G, Lüdecke HJ, Ramsay RG, McKay MJ, Krantz ID, Xu H, Horsfield JA, Kaiser FJ: RAD21 mutations cause a human cohesinopathy.
Nishimura G, Nakashima E, Mabuchi A, Shimamoto K, Shimamoto T, Shimao Y, Nagai T, Yamaguchi T, Kosaki R, Ohashi H, Makita Y, Ikegawa S: Identification of COL2A1 mutations in platyspondylic skeletal dysplasia torrance type.
Kleefstra T, Brunner HG, Amiel J, Oudakker AR, Nillesen WM, Magee A, Geneviève D, Cormier-Daire V, van Esch H, Fryns JP, Hamel BC, Sistermans EA, de Vries BB, van Bokhoven H: Loss-of-function mutations in euchromatin histone methyl transferase 1 (EHMT1) cause the 9q34 subtelomeric deletion syndrome.
Pasutto F, Sticht H, Hammersen G, Gillessen-Kaesbach G, Fitzpatrick DR, Nürnberg G, Brasch F, Schirmer-Zimmermann H, Tolmie JL, Chitayat D, Houge G, Fernández-Martínez L, Keating S, Mortier G, Hennekam RC, von der Wense A, Slavotinek A, Meinecke P, Bitoun P, Becker C, Nürnberg P, Reis A, Rauch A: Mutations in STRA6 cause a broad spectrum of malformations including anophthalmia, congenital heart defects, diaphragmatic hernia, alveolar capillary dysplasia, lung hypoplasia, and mental retardation.
Johnston JJ, Teer JK, Cherukuri PF, Hansen NF, Loftus SK, Chong K, Mullikin JC, Biesecker LG, NIH Intramural Sequencing Center (NISC): Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate.
Amiel J, Rio M, de Pontual L, Redon R, Malan V, Boddaert N, Plouin P, Carter NP, Lyonnet S, Munnich A, Colleaux L: Mutations in TCF4, encoding a class I basic helix-loop-helix transcription factor, are responsible for Pitt-Hopkins syndrome, a severe epileptic encephalopathy associated with autonomic dysfunction.
Woods CG, Stricker S, Seemann P, Stern R, Cox J, Sherridan E, Roberts E, Springell K, Scott S, Karbani G, Sharif SM, Toomes C, Bond J, Kumar D, Al-Gazali L, Mundlos S: Mutations in WNT7A cause a range of limb malformations, including Fuhrmann syndrome and Al-Awadi/Raas-Rothschild/Schinzel phocomelia syndrome.
Mandel H, Shemer R, Borochowitz ZU, Okopnik M, Knopf C, Indelman M, Drugan A, Tiosano D, Gershoni-Baruch R, Choder M, Sprecher E: SERKAL syndrome: an autosomal-recessive disorder caused by a loss-of-function mutation in WNT4.
Brunetti-Pierri N, Corso G, Rossi M, Ferrari P, Balli F, Rivasi F, Annunziata I, Ballabio A, Russo AD, Andria G, Parenti G: Lathosterolosis, a novel multiple-malformation/mental retardation syndrome due to deficiency of 3beta-hydroxysteroid-delta5-desaturase.
Jenkins D, Seelow D, Jehee FS, Perlyn CA, Alonso LG, Bueno DF, Donnai D, Josifova D, Mathijssen IM, Morton JE, Orstavik KH, Sweeney E, Wall SA, Marsh JL, Nurnberg P, Passos-Bueno MR, Wilkie AO: RAB23 mutations in carpenter syndrome imply an unexpected role for hedgehog signaling in cranial-suture development and obesity.
Waterham HR, Koster J, Romeijn GJ, Hennekam RC, Vreken P, Andersson HC, FitzPatrick DR, Kelley RI, Wanders RJ: Mutations in the 3beta-hydroxysterol Delta24-reductase gene cause desmosterolosis, an autosomal recessive disorder of cholesterol biosynthesis.
Kurotaki N, Imaizumi K, Harada N, Masuno M, Kondoh T, Nagai T, Ohashi H, Naritomi K, Tsukahara M, Makita Y, Sugimoto T, Sonoda T, Hasegawa T, Chinen Y, Tomita Ha HA, Kinoshita A, Mizuguchi T, Yoshiura Ki K, Ohta T, Kishino T, Fukushima Y, Niikawa N, Matsumoto N: Haploinsufficiency of NSD1 causes Sotos syndrome.
Hellemans J, Coucke PJ, Giedion A, De Paepe A, Kramer P, Beemer F, Mortier GR: Homozygous mutations in IHH cause acrocapitofemoral dysplasia, an autosomal recessive disorder with cone-shaped epiphyses in hands and hips.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data.
Bol Oficina Sanit Panam 1968, 65(4):281-393. PubMed Abstract
Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, Braxton A, Beuten J, Xia F, Niu Z, Hardison M, Person R, Bekheirnia MR, Leduc MS, Kirby A, Pham P, Scull J, Wang M, Ding Y, Plon SE, Lupski JR, Beaudet AL, Gibbs RA, Eng CM: Clinical whole-exome sequencing for the diagnosis of mendelian disorders.
Gahl WA, Markello TC, Toro C, Fajardo KF, Sincan M, Gill F, Carlson-Donohoe H, Gropman A, Pierson TM, Golas G, Wolfe L, Groden C, Godfrey R, Nehrebecky M, Wahl C, Landis DM, Yang S, Madeo A, Mullikin JC, Boerkoel CF, Tifft CJ, Adams D: The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases.
Jacob HJ, Abrams K, Bick DP, Brodie K, Dimmock DP, Farrell M, Geurts J, Harris J, Helbling D, Joers BJ, Kliegman R, Kowalski G, Lazar J, Margolis DA, North P, Northup J, Roquemore-Goins A, Scharer G, Shimoyama M, Strong K, Taylor B, Tsaih SW, Tschannen MR, Veith RL, Wendt-Andrae J, Wilk B, Worthey EA: Genomics in clinical practice: lessons from the front lines.
Jimenez-Escrig A, Gobernado I, Garcia-Villanueva M, Sanchez-Herranz A: Autosomal recessive Emery-Dreifuss muscular dystrophy caused by a novel mutation (R225Q) in the lamin A/C gene identified by exome sequencing.
Ariani F, Mari F, Amitrano S, Di Marco C, Artuso R, Scala E, Meloni I, Della Volpe R, Rossi A, van Bokhoven H, Renieri A: Exome sequencing overrides formal genetics: ASPM mutations in a case study of apparent X-linked microcephalic intellectual deficit.
Girdea M, Dumitriu S, Fiume M, Bowdin S, Boycott KM, Chénier S, Chitayat D, Faghfoury H, Meyn MS, Ray PN, So J, Stavropoulos DJ, Brudno M: Phenotips: patient phenotyping software for clinical and research use.
The pre-publication history for this paper can be accessed here: