Clinical phenotype-based gene prioritization: an initial study using semantic similarity and the human phenotype ontology
1 Center for Biomedical Informatics, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
2 Department of Pediatrics, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
3 Department of Pathology and Laboratory Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
4 Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA
5 Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
6 Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
7 Berlin-Brandenburg Center for Regenerative Therapies, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
8 Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
9 Department of Pediatrics, Cincinnati Children’s Hospital and Medical Center, Cincinnati, OH, USA
10 Department of Biomedical Informatics, University of Cincinnati College of Medicine, Cincinnati, OH, USA
BMC Bioinformatics 2014, 15:248 doi:10.1186/1471-2105-15-248Published: 21 July 2014
Exome sequencing is a promising method for diagnosing patients with a complex phenotype. However, variant interpretation relative to patient phenotype can be challenging in some scenarios, particularly clinical assessment of rare complex phenotypes. Each patient’s sequence reveals many possibly damaging variants that must be individually assessed to establish clear association with patient phenotype. To assist interpretation, we implemented an algorithm that ranks a given set of genes relative to patient phenotype. The algorithm orders genes by the semantic similarity computed between phenotypic descriptors associated with each gene and those describing the patient. Phenotypic descriptor terms are taken from the Human Phenotype Ontology (HPO) and semantic similarity is derived from each term’s information content.
Model validation was performed via simulation and with clinical data. We simulated 33 Mendelian diseases with 100 patients per disease. We modeled clinical conditions by adding noise and imprecision, i.e. phenotypic terms unrelated to the disease and terms less specific than the actual disease terms. We ranked the causative gene against all 2488 HPO annotated genes. The median causative gene rank was 1 for the optimal and noise cases, 12 for the imprecision case, and 60 for the imprecision with noise case. Additionally, we examined a clinical cohort of subjects with hearing impairment. The disease gene median rank was 22. However, when also considering the patient’s exome data and filtering non-exomic and common variants, the median rank improved to 3.
Semantic similarity can rank a causative gene highly within a gene list relative to patient phenotype characteristics, provided that imprecision is mitigated. The clinical case results suggest that phenotype rank combined with variant analysis provides significant improvement over the individual approaches. We expect that this combined prioritization approach may increase accuracy and decrease effort for clinical genetic diagnosis.