Centre Tecnològic de Telecomunicacions de Catalunya (CTTC), Av. Carl Friedrich Gauss 7, 08860 Castelldefels, Barcelona, Spain

Institut Català de la Salut (ICS), Sistema d’Informació dels Serveis d’Atenció Primària (SISAP), Gran Via de les Corts Catalanes, 587-589, 08007 Barcelona, Spain

Abstract

Background

Influenza is a well known and common human respiratory infection, causing significant morbidity and mortality every year. Despite Influenza variability, fast and reliable outbreak detection is required for health resource planning. Clinical health records, as published by the Diagnosticat database in Catalonia, host useful data for probabilistic detection of influenza outbreaks.

Methods

This paper proposes a statistical method to detect influenza epidemic activity. Non-epidemic incidence rates are modeled against the exponential distribution, and the maximum likelihood estimate for the decaying factor

Results

The main advantage with respect to other approaches is the adoption of a statistically meaningful test, which provides an indicator of epidemic activity with an associated probability. The detection algorithm was initiated with parameter _{0 }= 3.8617 estimated from the training sequence (corresponding to non-epidemic incidence rates of the 2008-2009 influenza season) and sequentially updated. Kolmogorov-Smirnov test detected the following weeks as epidemic for each influenza season: 50−10 (2008-2009 season), 38−50 (2009-2010 season), weeks 50−9 (2010-2011 season) and weeks 3 to 12 for the current 2011-2012 season.

Conclusions

Real medical data was used to assess the validity of the approach, as well as to construct a realistic statistical model of weekly influenza incidence rates in non-epidemic periods. For the tested data, the results confirmed the ability of the algorithm to detect the start and the end of epidemic periods. In general, the proposed test could be applied to other data sets to quickly detect influenza outbreaks. The sequential structure of the test makes it suitable for implementation in many platforms at a low computational cost without requiring to store large data sets.

Background

Influenza is a well known and common human respiratory infection. It is responsible of significant morbidity and mortality every year. The World Health Organization (WHO) estimates that annual epidemics result in about 3 to 5 million cases of severe illness and about 250,000 to 500,000 casualties worldwide

Sentinel networks covering less than 2% of the population have been the traditional surveillance system. More recently, electronic health records are widely implemented in some regions making available a significant amount of health related data. In Catalonia, primary care doctors have been routinely registering their activity in eCAP (an electronic health recording system) since 2006. This accounts for over 3,500 physicians collecting data of nearly 6 million people (80% of the population)

Among the large variety of tracked diseases, we focus on influenza. Influenza data on Diagnosticat has shown its validity as compared to the sentinel network data. Its main advantage is that its data is available faster

This paper presents a statistical surveillance system which provides an automated detection of influenza epidemics. Health resource planning is an application which could benefit from this tool. The proposed method is able to operate on-line^{a} and it is based on the statistical characterization of non-epidemic influenza incidence rates. A major advantage of this approach is its statistical meaningfulness, and thus detection is not only a binary result but the confidence in its outcome can be assessed in terms of probabilities resorting to hypothesis testing theory. Non-epidemic data is modeled with an exponential distribution, in the vein of

The remainder of the paper is as follows. Diagnosticat is introduced in the following section Then we provide insights on the statistical distribution of influenza incidence rates, as well as how the relevant parameters can be estimated from the observations. The general statistical detector is proposed, and the results for the catalan case study are presented.

Methods

Diagnosticat: an open epidemiological database

Diagnosticat is an open-access database which contains reports of many diseases occurring in Catalonia, such as influenza, papilloma or chickenpox. The information available in the Diagnosticat database includes all clinical influenza diagnoses codes (ICD-10 code) and is obtained weekly from eCAP through an automated process. The website is timely updated a few minutes after every finished epidemiological week (EW). After the extraction, a computer algorithm automatically creates the different tables with the information that is used in the website. No identifiable or personal information on patients is used, maintained or transferred through this system.

Currently, the Diagnosticat’s database is composed of data from 4 influenza seasons. Information is available starting on 2008 and is updated weekly since then. Data is presented as incidence rates per 10^{5 }population, a unit that allows comparison of diagnoses over different territories independently of the number of inhabitants. In this work, the EW is a group of seven days that begins on a Sunday and ends on a Saturday.

Although communicable diseases are in general yearly represented, influenza is represented by seasons due to the characteristics of the influenza epidemic. The epidemic usually starts at the end of the year and ends mid of the following year, peaking in December and January. For this reason and to be consistent with the influenza epidemic, Diagnosticat uses graphics by seasons beginning the EW 23 and ending at EW 22 the following year.

Statistical data analysis

In this section we aim at obtaining a statistical model for the recorded data of influenza cases. Particularly, we find out that an exponential distribution might be an appropriate way of modeling non-epidemic data. Let us consider that the rate of influenza cases in non-epidemic periods is a random variable _{
t
}as the set containing chronologically-ordered observations up to time

and 0 otherwise. Mean and variance are expressed in terms of the parameter ^{2}, respectively
_{
t
}follows (1), that is those observations taken in non-epidemic weeks. We define this subset as _{
t
}⊂_{
t
} and its elements
_{
t
}which were detected (using the herein proposed method for instance) as in non-epidemic periods. We use _{
t
} to denote the total number of non-epidemic weeks up to

Such statistical characterization means that, if not in an epidemic scenario, the rate of cases would be most likely close to zero and decreasing according to the exponential factor _{
t
}. Assuming that observations are independent, the likelihood function of _{
t
} is

whose optimization is equivalent to maximization of the log-likelihood. Derivative of the latter yields to

and equating to zero we obtain the ML estimate of the exponential factor as

which is unbiased and asymptotically consistent with variance
_{
t
}observations in the non-epidemic period up to the current week

We study the goodness-of-fit of the statistical model in our case study, i.e. the database in Diagnosticat. Figure
^{2 }goodness-of-fit test to validate the exponential assumption, obtaining a

Comparison of sample histogram and the estimated exponential PDF used to model the rate

**Comparison of sample histogram and the estimated exponential PDF used to model the rate.**

Sequential influenza detector

The objective is to build an autonomous detection algorithm that is able to determine whether influenza is active or not based on the records of influenza cases. In the case of Diagnosticat, these observations are weekly received, and thus a week-by-week detection is provided by the method. As mentioned in

The detector is based on the idea that we have a statistical characterization of the influenza incidence rates when not in the epidemic phase. As discussed earlier, these observations follow an exponential distribution with parameter

For a given instant _{
t
} comes from a distribution as specified in (1) with

We notice that the test has to be conducted using a single observation, _{
t
}, from the random variable and that the baseline distribution is continuous and completely specified. Although other alternatives might apply, in this type of decision problems we could obtain enhanced performance by resorting on Empirical Density Function (EDF) statistics to assess
_{{A}} is the indicator function of an event _{
X
}(^{−λx
} for the exponential distribution.

In general, the one-sample KS statistic is defined as

although in our case the statistic _{
t
} is used to construct the EDF, and thus the EDF can only take values 0 or 1:

where we used that |0−_{
X
}(^{−λx
}| = 1−^{−λx
}when

With this detector, hypothesis
_{
α
}, where the threshold satisfies
_{
X
}(

The result of the hypothesis test is binary. Another useful way of reporting the result of the test are _{
t
} is likely to be generated from _{
X
}(

To complete the sequential detection algorithm, we need to estimate the exponential factor _{
t
}non-epidemic weeks each time a new observation is recorded. The propose procedure is applied after the current observation _{
t
} is processed by the KS-based detector. If based on _{
t
} the method rejects
_{
t
}shows evidences that at the _{
t
} and use the following recursive expression

which is algebraically equivalent to (4). Using (8) instead of (4) has the advantage that new data is processed upon arrival, and thus there is no need to store the complete dataset nor reprocess all data for each new measurement. Initialization of the exponential factor
_{
t
} counts the number of times the null hypothesis was assessed valid up to

The pseudo-code of the sequential influenza detector can be consulted in the Algorithm. Recursive exponential fitting has also been included, note that
_{
t
}is recorded, it is tested against the exponential distribution and a new value of the statistic

Schematic representation of the sequential method

**Schematic representation of the sequential method.**

Algorithm Sequential detection of influenza epidemics

1: Initialization _{0 }= 1

2: At time _{
t
}

3: Compute statistic

4: **if**
_{
α
}**then**

5: Reject the null hypothesis ⇒ Flu detected at the

6: Keep
_{
t
}=_{
t−1}

7: **else if**
_{
α
}** then**

8: Accept the null hypothesis ⇒ Flu not detected at the

9: Set _{
t
}=_{
t−1} + 1, the number of observations used to calculate

10: Update

11: **end if**

12:

Results and discussion

We used the open database described earlier to test the detector. We also validated in this experiment the proposed ML data fitting that characterizes non-epidemic cases as exponentially distributed. At this time of writing, we had available data from the 2008-2009 to the 2011-2012 seasons.

The parameter ^{5}population above 20 were not considered. If not otherwise stated, we used 20/10^{5} to train the method, although later in this section we provide a sensitivity analysis with respect to this threshold. The resulting exponential factor was

We tested the observation corresponding to the ^{5} to estimate

Weekly influenza-detection results for 4 influenza seasons along with recorded data. ^{5} inhabitants (solid blue) and the detected influenza outbreaks output (red crosses)

**Weekly influenza-detection results for 4 influenza seasons along with recorded data. **
**
y
**

**2008-2009**

**2009-2010**

**2010-2011**

**2011-2012**

**Threshold**

Start

End

Start

End

Start

End

Start

End

Sensitivity to a number of thresholds on the first season to estimate

10/10^{5}

50

11

38

50

50

9

3

12

20/10^{5}

50

10

38

50

50

9

3

12

30/10^{5}

51

8

38

50

50

9

3

12

40/10^{5}

52

7

38

50

50

9

3

12

50/10^{5}

52

7

38

50

50

9

3

12

**Sentinel network**

51

8

41

51

51

9

4

12

It is important to notice that in the influenza season 2009-2010, the A(H1N1) Influenza virus pandemic occurred during autumn in Catalonia, fact that caused a different temporal pattern of influenza epidemics

Figure

Estimation of the exponential factor

**Estimation of the exponential factor **
**
λ
**

In order to provide more insight in the behavior of the detector, Figure

Sequence of

**Sequence of **
**
p
**

Indeed, the proposed method can be used in general to the detection of other infectious diseases whose statistical characterization is available. If non-epidemic periods could be characterized by an exponential distribution, the usage of the method is straightforward. Otherwise, if another distribution better fits the data, slight modification of the method should be performed to compare it with the EDF and to update the required parameters of the PDF. A limitation of the method is related to the data gathering method, which has to be continuously recording observations. Since the method is based on non-epidemic data to estimate the distributional parameters and to assess whether the disease is active or not, some systems like sentinels cannot straightforwardly benefit from this tool. Recall, for instance, that sentinels are likely to stop recording data in typically inactive periods of the disease, these inactive periods have an impact in the quality of the estimated distribution of non-epidemic data.

Conclusions

In this paper we proposed an automated method to detect influenza outbreaks from periodically recorded incidence rates. In contrast to setting yearly predefined thresholds to determine influenza outbreaks by data inspection, we presented a detector based on the statistical properties of non-epidemic data. The method can be useful to complement traditional surveillance methods. The algorithm provides a binary signal indicating epidemic activity as well as a quantitative measure (i.e., the

Timeliness is generally defined as the difference between the time an event occurs and the time the reference standard for that event occurs. Diagnosticat data is available the instant the epidemiologic week has endend, and thus advances in four days the publication of data over the sentinel network based system in Catalonia. Additionally, the proposed sequential detection method signals the event 1 week earlier on average for the tested data (excluding the AH1N1.pdm.2009 season).

Endnote

^{a} Note that in this work we use

Abbreviations

CDF: Cumulative Density Function; CLT: Central Limit Theorem; CTTC: Centre Tecnològic de Telecomunicacions de Catalunya; eCAP: Electronic health record in Catalonia; EDF: Empirical Density Function; EW: Epidemiological week; ICS: Catalan Institute of Health; KS: Kolmogorov-Smirnov; ML: Maximum Likelihood; PDF: Probability Density Function; SISAP: Information Systems for Primary Care Services; WHO: World Health Organization.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

PC designed the algorithm, performed the analysis and drafted the manuscript. EC and LM provided the data and helped draft the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

PC has been partially supported by the European Commission in the COST Action IC0803 (RFCSET).

Pre-publication history

The pre-publication history for this paper can be accessed here: