Open Access Highly Accessed Research article

Cluster analysis for identifying sub-groups and selecting potential discriminatory variables in human encephalitis

Jemila S Hamid12, Christopher Meaney3, Natasha S Crowcroft12, Julia Granerod4, Joseph Beyene5* and on behalf of the UK Etiology of Encephalitis Study Group

Author Affiliations

1 Surveillance and Epidemiology, Ontario Agency for Health Protection and Promotion, Toronto, Canada

2 Dalla Lana School of Public Health, University of Toronto, Toronto, Canada

3 Department of Family and Community Medicine, University of Toronto, Toronto, Canada

4 Health Protection Agency, Centre for Infections, London, UK

5 Population Genomics Program, Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Canada

For all author emails, please log on.

BMC Infectious Diseases 2010, 10:364  doi:10.1186/1471-2334-10-364

Published: 31 December 2010



Encephalitis is an acute clinical syndrome of the central nervous system (CNS), often associated with fatal outcome or permanent damage, including cognitive and behavioural impairment, affective disorders and epileptic seizures. Infection of the central nervous system is considered to be a major cause of encephalitis and more than 100 different pathogens have been recognized as causative agents. However, a large proportion of cases have unknown disease etiology.


We perform hierarchical cluster analysis on a multicenter England encephalitis data set with the aim of identifying sub-groups in human encephalitis. We use the simple matching similarity measure which is appropriate for binary data sets and performed variable selection using cluster heatmaps. We also use heatmaps to visually assess underlying patterns in the data, identify the main clinical and laboratory features and identify potential risk factors associated with encephalitis.


Our results identified fever, personality and behavioural change, headache and lethargy as the main characteristics of encephalitis. Diagnostic variables such as brain scan and measurements from cerebrospinal fluids are also identified as main indicators of encephalitis. Our analysis revealed six major clusters in the England encephalitis data set. However, marked within-cluster heterogeneity is observed in some of the big clusters indicating possible sub-groups. Overall, the results show that patients are clustered according to symptom and diagnostic variables rather than causal agents. Exposure variables such as recent infection, sick person contact and animal contact have been identified as potential risk factors.


It is in general assumed and is a common practice to group encephalitis cases according to disease etiology. However, our results indicate that patients are clustered with respect to mainly symptom and diagnostic variables rather than causal agents. These similarities and/or differences with respect to symptom and diagnostic measurements might be attributed to host factors. The idea that characteristics of the host may be more important than the pathogen is also consistent with the observation that for some causes, such as herpes simplex virus (HSV), encephalitis is a rare outcome of a common infection.