Excess count detection for epidemiological time series
Benedikt Zacher
2025-06-02
excode.RmdExcess count detection (excode) in epidemiological time series is an important part of public health surveillance systems. A variety of algorithms has been developed to identify events such as disease outbreaks or excess mortality. To this end, time series are analysed to detect unusually large (case) counts, that exceed what is normally expected in a certain time period and geographical region. The normal expectancy of cases in a current time period is usually calculated based on historic data. Depending on the time series of interest, the following features need to be taken into account by a model:
- Seasonal patterns: Many epidemiological times series that are of public health interest show periodic changes in cases depending on seasons or other calendar periods.
- Long-term time trends: The time series may show a long-term increase or decrease in case counts.
- Historic events: Events such as disease outbreaks may have caused an excess of case counts in historic data that is used for model estimation. This needs to be considered to avoid overestimation of the normal expectancy for the current time period.
The excode package provides a flexible framework that implements well established approaches to control for seasonality, long-term trends and historic events, but also allows the use of customized models. The user can choose between the Poisson and the Negative Binomial distribution, which are the most commonly used probability distributions for modeling count data. By combining hidden Markov models and generalized linear models, excode explicitly models normally expected case counts and expected excess case counts, i.e. each time point in a time series is labeled either as a normal state or as an excess state. A short overview of the statistical model used can be found at the end of this document.
Overwiew of available models in excode
There are five different model classes implemented in excode:
- Mean: This model does not account for seasonal or cyclical patterns. The expectancy of normal and excess case counts is modeled as the mean of the historic data.
- FarringtonNoufaily: This is a popular algorithm, which models seasonality by segmenting a year into a predefined number of shorter time periods, where each has a different expectancy of observed cases. It also allows integration of a long-term time trend.
- Harmonic: The ‘Harmonic’ model uses sine and cosine function to account for seasonal or cyclical patterns in the data. It also allows integration of long-term time trends.
- Custom: The ‘Custom’ model allows the user define their own model by providing a data frame containing covariate data.
All of the above mentioned models use two states - a ‘normal’ and an ‘excess’ state - to detect excess counts. However in some cases it might be necessary to allow multiple states, such as ‘normal’, ‘low excess’, ‘medium excess’, ‘high excess’, to account for a wide range of excess counts and detect all periods showing an excess in a time series.
- MultiState: The ‘MultiState’ model allows the use of multiple states. The number of states can be specified by the user. Like in the ‘Custom’ model, the user must provide covariate data to control for possible seasonal or long-term time trends. The use of this model is more advanced than the other models and requires some additional steps before fitting.
Figure 1 shows examples for the ‘Mean’, ‘Harmonic’ and ‘FarringtonNoufaily’ model.

Figure 1: Examples for the ‘Mean’, ‘Harmonic’ and ‘FarringtonNoufaily’ model.
Using excode for excess count detection
library(excode)The input data is a timeseries containing a date and an observed
number of cases. It can be a data.frame, an
sts object from the surveillance
R package or a list of sts objects.
The input dataset needs to contain the following variables:
-
date: The date of aggregation.
- observed: The observed number of cases.
- id: The name or id of the time series.
Optional variables can be:
- offset (optional): This could be e.g. the susceptible population.
- state (optional): If there are events, such as disease outbreaks, that caused excess counts or if there are time periods that are known to have normal case counts, these can be indicated here (e.g. 0=normal, 1=excess, NA=unknown).
The example data shadar_df in this section contains the
number of weekly cases of Salmonella hadar from 2001 until 2006
in Germany (taken from the surveillance
package). This dataset does not contain an offset
and state variable as we do not adjust for population
size and the true disease outbreak labels are not included.
## Load data an show first 6 rows
data("shadar_df")
kable(as_tibble(shadar_df[1:6, ])) %>%
kable_styling(font_size = 11)| date | observed | id |
|---|---|---|
| 2001-01-08 | 1 | S. Hadar Germany |
| 2001-01-15 | 6 | S. Hadar Germany |
| 2001-01-22 | 3 | S. Hadar Germany |
| 2001-01-29 | 3 | S. Hadar Germany |
| 2001-02-05 | 15 | S. Hadar Germany |
| 2001-02-12 | 5 | S. Hadar Germany |
Figure 2 shows that there is a cyclical pattern of case counts, with more cases during the summer and fewer cases cases during the winter There is also a decrease of reported cases over the years, except for the year 2006, where an outbreak lead to a surge in case numbers compared to previous years.
# Plot the time series
ggplot(shadar_df, aes(x = date, y = observed)) +
geom_col(color = "grey", fill = "lightgrey") +
theme_minimal() +
ylab("No. of cases") +
xlab("Week of notification") +
scale_x_date(date_labels = "%Y", date_breaks = "year")
Figure 2: Weekly Salmonella Newport cases.
Before running the excess count detection, the user needs to decide
which probability distribution and which model should be used. This is
done using the excodeFamily and excodeFormula
objects of the excode package. Both are then combined into an
excodeModel object, which stores all necessary parameters
and specifications for model fitting. In this example, the ‘Poisson’
distribution and - due to the cyclical and long-term time trend - the
‘Harmonic’ model is chosen.
# Use a 'Poisson' distribution in excodeFamily
excode_family_pois <- excodeFamily("Poisson")
# Define the 'Harmonic' model in excodeFormula
excode_formula_har <- excodeFormula("Harmonic")
# Combine both in the final excodeModel object
excode_har_pois <- excodeModel(
excode_family_pois,
excode_formula_har
)Printing excode_har_pois gives some basic information
about the model specifcations and shows that the model is an inital
model, which stores results for 0 time points:
excode_har_pois## Inital excodeModel
## excodeFamily: Poisson
## excodeFormula: Harmonic
## No. of timepoints: 0
The run_excode() function then performs the excess count
detection. This function accepts different data types for the
surveillance timeseries (sts, data.frame, list of sts objects). The
output is a fitted excodeModel object -
result_shadar_har - that stores results for 87 time
points.
# Run excode on time points 209-295
result_shadar_har <- run_excode(
shadar_df,
excode_har_pois,
209:295
)
result_shadar_har## Fitted excodeModel
## excodeFamily: Poisson
## excodeFormula: Harmonic
## No. of timepoints: 87
Results can be extracted using the summary()
function:
summary_shadar_har <- summary(result_shadar_har)
kable(as_tibble(summary_shadar_har[82:87, ])) %>%
kable_styling(font_size = 11)| posterior | pval | date | timepoint | observed | mu0 | mu1 | id | BIC | AIC | posterior_ub | pval_ub |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.993 | 0.002 | 2006-07-24 | 290 | 10 | 3.27 | 11.1 | S. Hadar Germany | 1095 | 1012 | 7 | 8 |
| 1.000 | 0.000 | 2006-07-31 | 291 | 16 | 3.32 | 11.5 | S. Hadar Germany | 1099 | 1016 | 7 | 8 |
| 1.000 | 0.000 | 2006-08-07 | 292 | 21 | 3.39 | 12.1 | S. Hadar Germany | 1106 | 1023 | 7 | 8 |
| 0.929 | 0.019 | 2006-08-14 | 293 | 8 | 3.27 | 11.6 | S. Hadar Germany | 1107 | 1024 | 7 | 8 |
| 1.000 | 0.000 | 2006-08-21 | 294 | 12 | 2.63 | 12.7 | S. Hadar Germany | 1103 | 1020 | 7 | 7 |
| 0.691 | 0.049 | 2006-08-28 | 295 | 6 | 2.60 | 11.8 | S. Hadar Germany | 1107 | 1023 | 7 | 7 |
The result is a data.frame with the following
variables:
- posterior: Probability that there is an excess of case counts at the current time point.
- pval: P-value for the null hypothesis that there is no excess at the current timepoint.
- date: The event date (e.g. reporting date or date of disease onset).
- timepoint: Timepoint (as integer) in the time series.
- observed: Number of cases at the current time point.
- mu0: Number of expected cases under normal conditions.
- mu1: Number of expected cases under excess conditions.
- id: Name of the time series.
- BIC: Bayesian information criterion.
- AIC: Aikake information criterion.
-
posterior_ub: A count threshold based on the
Posterior probability of the model. If the observed counts exceed this
value, the time point is considered to have excess counts. The default
probability to calculate this threshold is 0.5 and can be changed via
the parameter
prob_thresholdinsummary(). . -
pval_ub: A count threshold based on the
background/normal model. If the observed counts exceed this value, the
time point is considered to have excess counts. The default p-value to
calculate this threshold is 0.01 and can be changed via the parameter
pval_thresholdinsummary().
Figure 3 shows the input data with the estimated alarm thresholds (posterior_ub and pval_ub):
ggplot(
shadar_df %>% filter(year(date) >= 2003),
aes(x = date, y = observed)
) +
geom_col(color = "grey", fill = "lightgrey") +
theme_minimal() +
ylab("No. of cases") +
xlab("Week of notification") +
scale_x_date(date_labels = "%Y", date_breaks = "year") +
geom_step(aes(x = date, y = posterior_ub, color = "upper bound (posterior)"),
data = summary_shadar_har, linewidth = 1
) +
geom_step(aes(x = date, y = pval_ub, color = "upper bound (pval)"),
data = summary_shadar_har, linewidth = 1
) +
geom_point(aes(x = date, y = ifelse(observed > pval_ub, observed, NA), color = "Alarm (pval)"),
data = summary_shadar_har, shape = 3
) +
geom_point(aes(x = date, y = ifelse(observed > posterior_ub, observed, NA), color = "Alarm (posterior)"),
data = summary_shadar_har, shape = 4
) +
scale_color_manual(values = c("black", "black", "gold", "tomato")) +
theme(legend.position = "top", legend.title = element_blank()) +
guides(color = guide_legend(ncol = 2))
Figure 3: Weekly Salmonella Hadar cases with alarm thresholds (upper bounds).
Models ‘Mean’ and ‘FarringtonNoufaily’ can be applied likewise.
The ‘Custom’ model
The ‘Custom’ model can be used to fit a model with covariate data
selected by the user. To give an example, data of SARS-CoV-2
infections in Berlin-Neukölln (Germany) from March-July 2020 will be
used (downloaded on 2024-10-31). Figure 4 shows that after an inital
peak of infections in late March/early April, the number of infections
dropped, showing only few cases in May, followed by a rise in June which
could be attributed to several local
outbreaks. The daily number of reported SARS-CoV-2 differs over the
course of a week, showing higher number of cases reported during
weekdays and lower number of cases during weekends. In the following a
‘Custom’ excodeModel is used, that models the mean number
of reported cases during the past 8 weeks on workdays and weekends, to
detect the observed local outbreaks.
To define the necessary covariate data for the model, a
data.frame - weekday - is created, indicating
whether a current day is a workday or weekend:
data(sarscov2_df)
num_wday <- wday(sarscov2_df$date, week_start = 1)
weekday <- data.frame(day = factor(ifelse(num_wday <= 5, "workday", "weekend"),
levels = c("workday", "weekend")
))This is then used to create the ‘Custom’ excodeModel
object. Note that timepoints_per_unit = 7 indicates that
the repeating time periods are weeks (by default this is 52, i.e. yearly
periodic data).
excode_formula_custom <- excodeFormula("Custom", data = weekday, timepoints_per_unit = 7)
excode_custom_pois <- excodeModel(
excode_family_pois,
excode_formula_custom
)After this, the model is fitted, taking into account 8 weeks of
historic data (time_units_back = 8) and excluding the past
two days for fitting
(past_timepoints_not_included = 2).
result_sarscov2_custom <- run_excode(sarscov2_df,
excode_custom_pois,
93:154,
time_units_back = 8,
past_timepoints_not_included = 2
)
summary_sarscov2_custom <- summary(result_sarscov2_custom)The results can be plotted as follows:
ggplot(
sarscov2_df %>%
filter(date <= max(summary_sarscov2_custom$date)),
aes(x = date, y = observed)
) +
geom_col(color = "grey", fill = "lightgrey") +
theme_minimal() +
ylab("No. of cases") +
xlab("Day of reporting") +
scale_x_date(date_labels = "%Y-%m", date_breaks = "month") +
geom_step(aes(x = date, y = posterior_ub, color = "upper bound (posterior)"),
data = summary_sarscov2_custom, linewidth = 1
) +
geom_step(aes(x = date, y = pval_ub, color = "upper bound (pval)"),
data = summary_sarscov2_custom, linewidth = 1
) +
geom_point(aes(x = date, y = ifelse(observed > pval_ub, observed, NA), color = "Alarm (pval)"),
data = summary_sarscov2_custom, shape = 3
) +
geom_point(aes(x = date, y = ifelse(observed > posterior_ub, observed, NA), color = "Alarm (posterior)"),
data = summary_sarscov2_custom, shape = 4
) +
scale_color_manual(values = c("black", "black", "gold", "tomato")) +
theme(legend.position = "top", legend.title = element_blank()) +
guides(color = guide_legend(ncol = 2))
Figure 4: Daily reported SARS-CoV-2 infections in Berlin-Neukölln with alarm thresholds (upper bounds).
The ‘MultiState’ model
The ‘MultiState’ model allows multiple states, which can be necessary if the use of only one excess state is not sufficient to model the data. This can be the case when the range of excess counts is very large. To illustrate application of the ‘MultiState’ model, weekly all-cause mortality data reported to the German Federal Statistical Office is used (Figure 5). The ‘MultiState’ model with four states is applied to detect weeks with excess mortality in the time series.
data("mort_df_germany")
ggplot(mort_df_germany) +
geom_line(aes(x = date, y = observed), linewidth = 1) +
theme_minimal() +
ylab("Number of deaths") +
xlab("Week of death")
Figure 5: Weekly all-cause mortaility in Germany.
Initialization of the ‘MultiState’ model
For models with only two states, excode automatically
initializes and fits the model. However, when using multiple states,
initialization of the GLM part of the model needs to be done manually.
In order to do so, initial estimates for the expected number of cases
for each state need to be provided. First the baseline number of deaths
is estimated, which roughly follows the EuroMOMO protocol.
The baseline fit uses sine and cosine functions to control for
seasonality as well as a linear time trend. To avoid overestimation of
the expected background mortality, only weeks 15 to 26 (spring) and 36
to 45 (autumn) are included to estimate the background model, since
these are time periods where no excess mortality is expected. All
necessary variables are stored in mort_df_germany:
kable(as_tibble(mort_df_germany[1:6, ])) %>%
kable_styling(font_size = 11)| date | observed | id | timepoint | sin52 | cos52 | bckg_week |
|---|---|---|---|---|---|---|
| 2016-01-03 | 17222 | All cause mortality Germany | 0 | 0.000 | 1.000 | FALSE |
| 2016-01-10 | 18469 | All cause mortality Germany | 1 | 0.121 | 0.993 | FALSE |
| 2016-01-17 | 18440 | All cause mortality Germany | 2 | 0.239 | 0.971 | FALSE |
| 2016-01-24 | 18627 | All cause mortality Germany | 3 | 0.355 | 0.935 | FALSE |
| 2016-01-31 | 18707 | All cause mortality Germany | 4 | 0.465 | 0.885 | FALSE |
| 2016-02-07 | 18493 | All cause mortality Germany | 5 | 0.568 | 0.823 | FALSE |
Using a negative binomial model, the baseline mortality is estimated as follows:
# Fit background model
formula_bckg <- "observed ~ sin52 + cos52 + timepoint"
mort_glm_bckg <- glm.nb(formula_bckg, mort_df_germany %>%
filter(bckg_week))
# Extract prediction for baseline mortality
mort_bckg_mu <- predict(mort_glm_bckg,
newdata = mort_df_germany,
type = "response"
)
# store size parameter of ngeative binomial distribution
theta <- mort_glm_bckg$thetaexcode models the mean of the excess states as multiplicative increase compared to the mean of the normal/background state. Initial estimates for the mean of each excess state are chosen as the 75%-, 85%-, and 95%-quantile of the relative increase of the observed number of deaths compared to the estimated baseline, which corresponds to a 6%, 10% and 21% increase:
# Relative increase of observed mortality compared to background
p_increase <- mort_df_germany$observed / mort_bckg_mu
# Initial values for multipicative increase of model states
qcut <- c(0.75, 0.85, 0.95)
state_increase <- c(1, quantile(p_increase, prob = qcut))
names(state_increase) <- c("mu0", "mu1", "mu2", "mu3")
state_increase## mu0 mu1 mu2 mu3
## 1.00 1.06 1.10 1.21
The initial estimates of the mean are calculated as follows:
# Initial estimates of expected mortality of all states
initial_mu <- mort_bckg_mu %*% t(state_increase)
kable(as_tibble(initial_mu[1:6, ])) %>%
kable_styling(font_size = 11)| mu0 | mu1 | mu2 | mu3 |
|---|---|---|---|
| 18059.8 | 19089.8 | 19881.9 | 21897.1 |
| 18162.4 | 19198.2 | 19994.9 | 22021.5 |
| 18244.6 | 19285.0 | 20085.3 | 22121.1 |
| 18304.8 | 19348.7 | 20151.6 | 22194.2 |
| 18342.2 | 19388.2 | 20192.8 | 22239.5 |
| 18356.1 | 19402.9 | 20208.0 | 22256.3 |
Now the model can be defined using excodeFormula(). In
this example, the estimated baseline (mu0) is kept fixed during model
fitting. To achieve this, it is included as offset
(offset = TRUE) in the estimation and the model is
estimated without intercept (intercept = FALSE):
nStates <- 4
excode_formula_multistate <- excodeFormula("MultiState",
nStates = nStates,
intercept = FALSE,
offset = TRUE
)
excode_family_nb <- excodeFamily("NegBinom", nb_size = theta)
excode_multistate_nb <- excodeModel(excode_family_nb,
excode_formula_multistate,
initial_mu = initial_mu
)The last step before fitting the excode model is to define
the input data.frame containg all necessary variables. Note
that the offset variable is set to the estimated baseline mortality
(mort_bckg_mu) and timepoints where no excess is assumed
are included as 0s in the state variable (NA for other
weeks).
Estimation and visualization of the ‘MultiState’ model
The ‘MultiState’ model can now be fitted using
run_excode() as usual:
res_excode_multistate_nb <- run_excode(mort_df_germany_input,
excode_multistate_nb,
nrow(mort_df_germany_input),
return_full_model = TRUE,
time_units_back = 10
)
res_mort_germany <- summary(res_excode_multistate_nb)The results and weekly excess probablities can be plottad as follows:
p1 <- ggplot(res_mort_germany %>% mutate(
obs_excess = ifelse(1 - posterior0 > 0.5, observed, NA),
group = "Data and model fit"
)) +
geom_line(aes(x = date, y = observed, color = "Week with excess probabilty<=0.5"),
linewidth = 1
) +
geom_line(aes(x = date, y = mu0, color = "mu0 (normal)"), linewidth = 0.75) +
geom_line(aes(x = date, y = mu1, color = "mu1 (low excess)"),
linewidth = 0.75, alpha = 0.75,
linetype = "dashed"
) +
geom_line(aes(x = date, y = mu2, color = "mu2 (medium excess)"),
linewidth = 0.75, alpha = 0.75,
linetype = "dashed"
) +
geom_line(aes(x = date, y = mu3, color = "mu3 (high excess)"),
linewidth = 0.75, alpha = 0.75,
linetype = "dashed"
) +
geom_point(aes(x = date, y = obs_excess, color = "Week with excess probabilty>0.5")) +
geom_line(aes(x = date, y = obs_excess, color = "Week with excess probabilty>0.5"), linewidth = 1) +
scale_color_manual(values = c("lightgreen", "gold", "orange", "tomato", "black", "cornflowerblue")) +
facet_wrap(~group, strip.position = "right") +
xlab("") +
ylab("Number of deaths") +
theme_bw() +
theme(
axis.text.y = element_text(angle = 90, vjust = 1, hjust = 0.5), legend.position = "top",
legend.title = element_blank(), legend.key.width = unit(1, "cm"),
plot.margin = unit(c(0, 0.5, 0, 0.5), "cm")
) +
guides(color = guide_legend(override.aes = list(linewidth = 1.2)))
p2 <- ggplot(res_mort_germany %>% mutate(excess_prob = 1 - posterior0, group = "Excess probability")) +
geom_ribbon(aes(x = date, ymin = 0, ymax = excess_prob),
color = "cornflowerblue", fill = "cornflowerblue",
alpha = 0.75
) +
geom_hline(yintercept = 0.5, linetype = "dotted", alpha = 0.5) +
facet_wrap(~group, strip.position = "right") +
xlab("Week of death") +
ylab("Excess probability") +
theme_bw() +
theme(axis.text.y = element_text(angle = 90, vjust = 1, hjust = 0.5)) +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
theme(plot.margin = unit(c(0, 0.5, 0.5, 0.5), "cm"))
layout <- matrix(c(1, 1, 2), ncol = 1)
grid.arrange(grobs = list(p1, p2), nrow = 2, layout_matrix = layout)
Figure 6: Top: Weekly all-cause mortaility in Germany and expected mortality of the normal and three excess states are shown. Weeks with excess probabilty>0.5 are shown in blue. Bottom: Weekly excess probability.
A short overview of the statistical model
The excode package combines Generalized Linear Models (GLMs) with Hidden Markov Models (HMMs). The GLM part of the model allows to use regression models to model seasonal patterns and general trends in the time series. Besides seasonal patterns and general time trends, an algorithm for excess count detection needs to account for excess counts in the historic data used for model estimation. The HMM part of the statistical model used in excode allows to explicitly consider past disease outbreaks in the model. The HMM separates the time series in two (or more) states: the normal state and one (or multiple) excess state. In the normal state, observed case counts lie within the range of what is normally expected in a certain time period. In the excess state, case counts are higher than what is normally expected, i.e. there is a potential outbreak or excess mortality. This is in contrast most of other algorithms used for excess count detection, which either do not account for previous excess case counts or use some heuristic for down weighting past instances of excess counts.
har_model_pois <- run_excode(shadar_df, excode_har_pois, 285,
return_full_model = TRUE
)
plot(har_model_pois)
Figure 7: Components of an excode Harmonic model using a Poisson distribution.
The components of an example two state model are shown in Figure 7. Each position \(t\) in the time series of length \(T\) is in a state \(s_t \in \left\{ N,E\right\}, t\in \left[1;T\right]\) , where \(N\) represents the normal and \(E\) the excess state. The sequence of normal and excess states is modeled by the HMM using initial state and transition probabilites (left boxes in Figure 7):
- Initial state probabilities \(\Pr(s_{1}=N\bigr)\) and \(\Pr(s_{1}=E\bigr)\) give the probability that the first time point of the time series is in the respective state.
- Transition probabilities give the probability that time point \(t\) is in a specific state, given the state of the previous time point. For instance, \(\Pr\left(s_{t}=N\left|s_{t-1}=N\right.\right)\), \(t\in\left[1;T\right]\) gives the probability that the time series is in the normal state at time point \(t\), given that position \(t-1\) is also in the normal state. \(\Pr\left(s_{t}=E\left|s_{t-1}=N\right.\right)\) gives the probability to transition to the excess state at time point \(t\), given that \(t-1\) is in the normal state.
In the example shown in Figure 7, the case counts \(c_{t}, t\in\left[1;T\right]\) are assumed to follow a Poisson distribution: \(c_{t} \sim Pois \left(\mu_{t}\right)\). The expected case counts at position \(t\), \(\mu_t\) are modeled using a GLM using sine and cosine functions to control for seasonal patterns (which is one of the models available in excode):
\[ \log\mu_{t}=\beta_{0}+\beta_{1}t+ \beta_3 \cos(\frac{2\pi}{52}t) + \beta_4\sin(\frac{2\pi}{52}t) + \beta_{5}I\left(s_{t}=E\right) \]
Here, \(\beta_{0}\) is the intercept, \(\beta_{1}t\) a linear time trend, \(\beta_3\) and \(\beta_4\) model the seasonal patterns using \(\sin\) and \(\cos\) functions with weekly data (52 time points per year). The increase in cases is incorporated by \(\beta_{5}\).
Based on this model, excode calculates a probability that there is an excess of case counts at the current time point - the posterior probability (bottom middle box in Figure 7). Moreover, given the normally expected case counts in the model, a p-value for the null hypothesis, that there is no excess at the current timepoint is calculated (bottom right box in Figure 7).