Skip to contents

Excess count detection (excode) in epidemiological time series is an important part of public health surveillance systems. A variety of algorithms has been developed to identify events such as disease outbreaks or excess mortality. To this end, time series are analysed to detect unusually large (case) counts, that exceed what is normally expected in a certain time period and geographical region. The normal expectancy of cases in a current time period is usually calculated based on historic data. Depending on the time series of interest, the following features need to be taken into account by a model:

  • Seasonal patterns: Many epidemiological times series that are of public health interest show periodic changes in cases depending on seasons or other calendar periods.
  • Long-term time trends: The time series may show a long-term increase or decrease in case counts.
  • Historic events: Events such as disease outbreaks may have caused an excess of case counts in historic data that is used for model estimation. This needs to be considered to avoid overestimation of the normal expectancy for the current time period.

The excode package provides a flexible framework that implements well established approaches to control for seasonality, long-term trends and historic events, but also allows the use of customized models. The user can choose between the Poisson and the Negative Binomial distribution, which are the most commonly used probability distributions for modeling count data. By combining hidden Markov models and generalized linear models, excode explicitly models normally expected case counts and expected excess case counts, i.e. each time point in a time series is labeled either as a normal state or as an excess state. A short overview of the statistical model used can be found at the end of this document.

Overwiew of available models in excode

There are five different model classes implemented in excode:

  • Mean: This model does not account for seasonal or cyclical patterns. The expectancy of normal and excess case counts is modeled as the mean of the historic data.
  • FarringtonNoufaily: This is a popular algorithm, which models seasonality by segmenting a year into a predefined number of shorter time periods, where each has a different expectancy of observed cases. It also allows integration of a long-term time trend.
  • Harmonic: The ‘Harmonic’ model uses sine and cosine function to account for seasonal or cyclical patterns in the data. It also allows integration of long-term time trends.
  • Custom: The ‘Custom’ model allows the user define their own model by providing a data frame containing covariate data.

All of the above mentioned models use two states - a ‘normal’ and an ‘excess’ state - to detect excess counts. However in some cases it might be necessary to allow multiple states, such as ‘normal’, ‘low excess’, ‘medium excess’, ‘high excess’, to account for a wide range of excess counts and detect all periods showing an excess in a time series.

  • MultiState: The ‘MultiState’ model allows the use of multiple states. The number of states can be specified by the user. Like in the ‘Custom’ model, the user must provide covariate data to control for possible seasonal or long-term time trends. The use of this model is more advanced than the other models and requires some additional steps before fitting.

Figure 1 shows examples for the ‘Mean’, ‘Harmonic’ and ‘FarringtonNoufaily’ model.

**Figure 1:** Examples for the 'Mean', 'Harmonic' and 'FarringtonNoufaily' model.

Figure 1: Examples for the ‘Mean’, ‘Harmonic’ and ‘FarringtonNoufaily’ model.

Using excode for excess count detection

library(excode)

The input data is a timeseries containing a date and an observed number of cases. It can be a data.frame, an sts object from the surveillance R package or a list of sts objects.

The input dataset needs to contain the following variables:

  • date: The date of aggregation.
  • observed: The observed number of cases.
  • id: The name or id of the time series.

Optional variables can be:

  • offset (optional): This could be e.g. the susceptible population.
  • state (optional): If there are events, such as disease outbreaks, that caused excess counts or if there are time periods that are known to have normal case counts, these can be indicated here (e.g. 0=normal, 1=excess, NA=unknown).

The example data shadar_df in this section contains the number of weekly cases of Salmonella hadar from 2001 until 2006 in Germany (taken from the surveillance package). This dataset does not contain an offset and state variable as we do not adjust for population size and the true disease outbreak labels are not included.

## Load data an show first 6 rows
data("shadar_df")
kable(as_tibble(shadar_df[1:6, ])) %>%
  kable_styling(font_size = 11)
date observed id
2001-01-08 1 S. Hadar Germany
2001-01-15 6 S. Hadar Germany
2001-01-22 3 S. Hadar Germany
2001-01-29 3 S. Hadar Germany
2001-02-05 15 S. Hadar Germany
2001-02-12 5 S. Hadar Germany

Figure 2 shows that there is a cyclical pattern of case counts, with more cases during the summer and fewer cases cases during the winter There is also a decrease of reported cases over the years, except for the year 2006, where an outbreak lead to a surge in case numbers compared to previous years.

# Plot the time series
ggplot(shadar_df, aes(x = date, y = observed)) +
  geom_col(color = "grey", fill = "lightgrey") +
  theme_minimal() +
  ylab("No. of cases") +
  xlab("Week of notification") +
  scale_x_date(date_labels = "%Y", date_breaks = "year")
**Figure 2:** Weekly *Salmonella Newport* cases.

Figure 2: Weekly Salmonella Newport cases.

Before running the excess count detection, the user needs to decide which probability distribution and which model should be used. This is done using the excodeFamily and excodeFormula objects of the excode package. Both are then combined into an excodeModel object, which stores all necessary parameters and specifications for model fitting. In this example, the ‘Poisson’ distribution and - due to the cyclical and long-term time trend - the ‘Harmonic’ model is chosen.

# Use a 'Poisson' distribution in excodeFamily
excode_family_pois <- excodeFamily("Poisson")
# Define the 'Harmonic' model in excodeFormula
excode_formula_har <- excodeFormula("Harmonic")
# Combine both in the final excodeModel object
excode_har_pois <- excodeModel(
  excode_family_pois,
  excode_formula_har
)

Printing excode_har_pois gives some basic information about the model specifcations and shows that the model is an inital model, which stores results for 0 time points:

excode_har_pois
## Inital excodeModel
##  excodeFamily: Poisson
##  excodeFormula: Harmonic
## No. of timepoints: 0

The run_excode() function then performs the excess count detection. This function accepts different data types for the surveillance timeseries (sts, data.frame, list of sts objects). The output is a fitted excodeModel object - result_shadar_har - that stores results for 87 time points.

# Run excode on time points 209-295
result_shadar_har <- run_excode(
  shadar_df,
  excode_har_pois,
  209:295
)

result_shadar_har
## Fitted excodeModel
##  excodeFamily: Poisson
##  excodeFormula: Harmonic
## No. of timepoints: 87

Results can be extracted using the summary() function:

summary_shadar_har <- summary(result_shadar_har)
kable(as_tibble(summary_shadar_har[82:87, ])) %>%
  kable_styling(font_size = 11)
posterior pval date timepoint observed mu0 mu1 id BIC AIC posterior_ub pval_ub
0.993 0.002 2006-07-24 290 10 3.27 11.1 S. Hadar Germany 1095 1012 7 8
1.000 0.000 2006-07-31 291 16 3.32 11.5 S. Hadar Germany 1099 1016 7 8
1.000 0.000 2006-08-07 292 21 3.39 12.1 S. Hadar Germany 1106 1023 7 8
0.929 0.019 2006-08-14 293 8 3.27 11.6 S. Hadar Germany 1107 1024 7 8
1.000 0.000 2006-08-21 294 12 2.63 12.7 S. Hadar Germany 1103 1020 7 7
0.691 0.049 2006-08-28 295 6 2.60 11.8 S. Hadar Germany 1107 1023 7 7

The result is a data.frame with the following variables:

  • posterior: Probability that there is an excess of case counts at the current time point.
  • pval: P-value for the null hypothesis that there is no excess at the current timepoint.
  • date: The event date (e.g. reporting date or date of disease onset).
  • timepoint: Timepoint (as integer) in the time series.
  • observed: Number of cases at the current time point.
  • mu0: Number of expected cases under normal conditions.
  • mu1: Number of expected cases under excess conditions.
  • id: Name of the time series.
  • BIC: Bayesian information criterion.
  • AIC: Aikake information criterion.
  • posterior_ub: A count threshold based on the Posterior probability of the model. If the observed counts exceed this value, the time point is considered to have excess counts. The default probability to calculate this threshold is 0.5 and can be changed via the parameter prob_threshold in summary(). .
  • pval_ub: A count threshold based on the background/normal model. If the observed counts exceed this value, the time point is considered to have excess counts. The default p-value to calculate this threshold is 0.01 and can be changed via the parameter pval_threshold in summary().

Figure 3 shows the input data with the estimated alarm thresholds (posterior_ub and pval_ub):

ggplot(
  shadar_df %>% filter(year(date) >= 2003),
  aes(x = date, y = observed)
) +
  geom_col(color = "grey", fill = "lightgrey") +
  theme_minimal() +
  ylab("No. of cases") +
  xlab("Week of notification") +
  scale_x_date(date_labels = "%Y", date_breaks = "year") +
  geom_step(aes(x = date, y = posterior_ub, color = "upper bound (posterior)"),
    data = summary_shadar_har, linewidth = 1
  ) +
  geom_step(aes(x = date, y = pval_ub, color = "upper bound (pval)"),
    data = summary_shadar_har, linewidth = 1
  ) +
  geom_point(aes(x = date, y = ifelse(observed > pval_ub, observed, NA), color = "Alarm (pval)"),
    data = summary_shadar_har, shape = 3
  ) +
  geom_point(aes(x = date, y = ifelse(observed > posterior_ub, observed, NA), color = "Alarm (posterior)"),
    data = summary_shadar_har, shape = 4
  ) +
  scale_color_manual(values = c("black", "black", "gold", "tomato")) +
  theme(legend.position = "top", legend.title = element_blank()) +
  guides(color = guide_legend(ncol = 2))
**Figure 3:** Weekly *Salmonella Hadar* cases with alarm thresholds (upper bounds).

Figure 3: Weekly Salmonella Hadar cases with alarm thresholds (upper bounds).

Models ‘Mean’ and ‘FarringtonNoufaily’ can be applied likewise.

The ‘Custom’ model

The ‘Custom’ model can be used to fit a model with covariate data selected by the user. To give an example, data of SARS-CoV-2 infections in Berlin-Neukölln (Germany) from March-July 2020 will be used (downloaded on 2024-10-31). Figure 4 shows that after an inital peak of infections in late March/early April, the number of infections dropped, showing only few cases in May, followed by a rise in June which could be attributed to several local outbreaks. The daily number of reported SARS-CoV-2 differs over the course of a week, showing higher number of cases reported during weekdays and lower number of cases during weekends. In the following a ‘Custom’ excodeModel is used, that models the mean number of reported cases during the past 8 weeks on workdays and weekends, to detect the observed local outbreaks.

To define the necessary covariate data for the model, a data.frame - weekday - is created, indicating whether a current day is a workday or weekend:

data(sarscov2_df)
num_wday <- wday(sarscov2_df$date, week_start = 1)
weekday <- data.frame(day = factor(ifelse(num_wday <= 5, "workday", "weekend"),
  levels = c("workday", "weekend")
))

This is then used to create the ‘Custom’ excodeModel object. Note that timepoints_per_unit = 7 indicates that the repeating time periods are weeks (by default this is 52, i.e. yearly periodic data).

excode_formula_custom <- excodeFormula("Custom", data = weekday, timepoints_per_unit = 7)
excode_custom_pois <- excodeModel(
  excode_family_pois,
  excode_formula_custom
)

After this, the model is fitted, taking into account 8 weeks of historic data (time_units_back = 8) and excluding the past two days for fitting (past_timepoints_not_included = 2).

result_sarscov2_custom <- run_excode(sarscov2_df,
  excode_custom_pois,
  93:154,
  time_units_back = 8,
  past_timepoints_not_included = 2
)
summary_sarscov2_custom <- summary(result_sarscov2_custom)

The results can be plotted as follows:

ggplot(
  sarscov2_df %>%
    filter(date <= max(summary_sarscov2_custom$date)),
  aes(x = date, y = observed)
) +
  geom_col(color = "grey", fill = "lightgrey") +
  theme_minimal() +
  ylab("No. of cases") +
  xlab("Day of reporting") +
  scale_x_date(date_labels = "%Y-%m", date_breaks = "month") +
  geom_step(aes(x = date, y = posterior_ub, color = "upper bound (posterior)"),
    data = summary_sarscov2_custom, linewidth = 1
  ) +
  geom_step(aes(x = date, y = pval_ub, color = "upper bound (pval)"),
    data = summary_sarscov2_custom, linewidth = 1
  ) +
  geom_point(aes(x = date, y = ifelse(observed > pval_ub, observed, NA), color = "Alarm (pval)"),
    data = summary_sarscov2_custom, shape = 3
  ) +
  geom_point(aes(x = date, y = ifelse(observed > posterior_ub, observed, NA), color = "Alarm (posterior)"),
    data = summary_sarscov2_custom, shape = 4
  ) +
  scale_color_manual(values = c("black", "black", "gold", "tomato")) +
  theme(legend.position = "top", legend.title = element_blank()) +
  guides(color = guide_legend(ncol = 2))
**Figure 4:** Daily reported SARS-CoV-2 infections in Berlin-Neukölln with alarm thresholds (upper bounds).

Figure 4: Daily reported SARS-CoV-2 infections in Berlin-Neukölln with alarm thresholds (upper bounds).

The ‘MultiState’ model

The ‘MultiState’ model allows multiple states, which can be necessary if the use of only one excess state is not sufficient to model the data. This can be the case when the range of excess counts is very large. To illustrate application of the ‘MultiState’ model, weekly all-cause mortality data reported to the German Federal Statistical Office is used (Figure 5). The ‘MultiState’ model with four states is applied to detect weeks with excess mortality in the time series.

data("mort_df_germany")

ggplot(mort_df_germany) +
  geom_line(aes(x = date, y = observed), linewidth = 1) +
  theme_minimal() +
  ylab("Number of deaths") +
  xlab("Week of death")
**Figure 5:** Weekly all-cause mortaility in Germany.

Figure 5: Weekly all-cause mortaility in Germany.

Initialization of the ‘MultiState’ model

For models with only two states, excode automatically initializes and fits the model. However, when using multiple states, initialization of the GLM part of the model needs to be done manually. In order to do so, initial estimates for the expected number of cases for each state need to be provided. First the baseline number of deaths is estimated, which roughly follows the EuroMOMO protocol. The baseline fit uses sine and cosine functions to control for seasonality as well as a linear time trend. To avoid overestimation of the expected background mortality, only weeks 15 to 26 (spring) and 36 to 45 (autumn) are included to estimate the background model, since these are time periods where no excess mortality is expected. All necessary variables are stored in mort_df_germany:

kable(as_tibble(mort_df_germany[1:6, ])) %>%
  kable_styling(font_size = 11)
date observed id timepoint sin52 cos52 bckg_week
2016-01-03 17222 All cause mortality Germany 0 0.000 1.000 FALSE
2016-01-10 18469 All cause mortality Germany 1 0.121 0.993 FALSE
2016-01-17 18440 All cause mortality Germany 2 0.239 0.971 FALSE
2016-01-24 18627 All cause mortality Germany 3 0.355 0.935 FALSE
2016-01-31 18707 All cause mortality Germany 4 0.465 0.885 FALSE
2016-02-07 18493 All cause mortality Germany 5 0.568 0.823 FALSE

Using a negative binomial model, the baseline mortality is estimated as follows:

# Fit background model
formula_bckg <- "observed ~ sin52 + cos52 + timepoint"
mort_glm_bckg <- glm.nb(formula_bckg, mort_df_germany %>%
  filter(bckg_week))
# Extract prediction for baseline mortality
mort_bckg_mu <- predict(mort_glm_bckg,
  newdata = mort_df_germany,
  type = "response"
)
# store size parameter of ngeative binomial distribution
theta <- mort_glm_bckg$theta

excode models the mean of the excess states as multiplicative increase compared to the mean of the normal/background state. Initial estimates for the mean of each excess state are chosen as the 75%-, 85%-, and 95%-quantile of the relative increase of the observed number of deaths compared to the estimated baseline, which corresponds to a 6%, 10% and 21% increase:

# Relative increase of observed mortality compared to background
p_increase <- mort_df_germany$observed / mort_bckg_mu

# Initial values for multipicative increase of model states
qcut <- c(0.75, 0.85, 0.95)
state_increase <- c(1, quantile(p_increase, prob = qcut))
names(state_increase) <- c("mu0", "mu1", "mu2", "mu3")
state_increase
##  mu0  mu1  mu2  mu3 
## 1.00 1.06 1.10 1.21

The initial estimates of the mean are calculated as follows:

# Initial estimates of expected mortality of all states
initial_mu <- mort_bckg_mu %*% t(state_increase)
kable(as_tibble(initial_mu[1:6, ])) %>%
  kable_styling(font_size = 11)
mu0 mu1 mu2 mu3
18059.8 19089.8 19881.9 21897.1
18162.4 19198.2 19994.9 22021.5
18244.6 19285.0 20085.3 22121.1
18304.8 19348.7 20151.6 22194.2
18342.2 19388.2 20192.8 22239.5
18356.1 19402.9 20208.0 22256.3

Now the model can be defined using excodeFormula(). In this example, the estimated baseline (mu0) is kept fixed during model fitting. To achieve this, it is included as offset (offset = TRUE) in the estimation and the model is estimated without intercept (intercept = FALSE):

nStates <- 4
excode_formula_multistate <- excodeFormula("MultiState",
  nStates = nStates,
  intercept = FALSE,
  offset = TRUE
)
excode_family_nb <- excodeFamily("NegBinom", nb_size = theta)
excode_multistate_nb <- excodeModel(excode_family_nb,
  excode_formula_multistate,
  initial_mu = initial_mu
)

The last step before fitting the excode model is to define the input data.frame containg all necessary variables. Note that the offset variable is set to the estimated baseline mortality (mort_bckg_mu) and timepoints where no excess is assumed are included as 0s in the state variable (NA for other weeks).

# Create input data.frame
mort_df_germany_input <- mort_df_germany %>%
  mutate(
    state = ifelse(bckg_week, 0, NA),
    offset = mort_bckg_mu
  ) %>%
  dplyr::select(date, observed, offset, id, state)

Estimation and visualization of the ‘MultiState’ model

The ‘MultiState’ model can now be fitted using run_excode() as usual:

res_excode_multistate_nb <- run_excode(mort_df_germany_input,
  excode_multistate_nb,
  nrow(mort_df_germany_input),
  return_full_model = TRUE,
  time_units_back = 10
)
res_mort_germany <- summary(res_excode_multistate_nb)

The results and weekly excess probablities can be plottad as follows:

p1 <- ggplot(res_mort_germany %>% mutate(
  obs_excess = ifelse(1 - posterior0 > 0.5, observed, NA),
  group = "Data and model fit"
)) +
  geom_line(aes(x = date, y = observed, color = "Week with excess probabilty<=0.5"),
    linewidth = 1
  ) +
  geom_line(aes(x = date, y = mu0, color = "mu0 (normal)"), linewidth = 0.75) +
  geom_line(aes(x = date, y = mu1, color = "mu1 (low excess)"),
    linewidth = 0.75, alpha = 0.75,
    linetype = "dashed"
  ) +
  geom_line(aes(x = date, y = mu2, color = "mu2 (medium excess)"),
    linewidth = 0.75, alpha = 0.75,
    linetype = "dashed"
  ) +
  geom_line(aes(x = date, y = mu3, color = "mu3 (high excess)"),
    linewidth = 0.75, alpha = 0.75,
    linetype = "dashed"
  ) +
  geom_point(aes(x = date, y = obs_excess, color = "Week with excess probabilty>0.5")) +
  geom_line(aes(x = date, y = obs_excess, color = "Week with excess probabilty>0.5"), linewidth = 1) +
  scale_color_manual(values = c("lightgreen", "gold", "orange", "tomato", "black", "cornflowerblue")) +
  facet_wrap(~group, strip.position = "right") +
  xlab("") +
  ylab("Number of deaths") +
  theme_bw() +
  theme(
    axis.text.y = element_text(angle = 90, vjust = 1, hjust = 0.5), legend.position = "top",
    legend.title = element_blank(), legend.key.width = unit(1, "cm"),
    plot.margin = unit(c(0, 0.5, 0, 0.5), "cm")
  ) +
  guides(color = guide_legend(override.aes = list(linewidth = 1.2)))

p2 <- ggplot(res_mort_germany %>% mutate(excess_prob = 1 - posterior0, group = "Excess probability")) +
  geom_ribbon(aes(x = date, ymin = 0, ymax = excess_prob),
    color = "cornflowerblue", fill = "cornflowerblue",
    alpha = 0.75
  ) +
  geom_hline(yintercept = 0.5, linetype = "dotted", alpha = 0.5) +
  facet_wrap(~group, strip.position = "right") +
  xlab("Week of death") +
  ylab("Excess probability") +
  theme_bw() +
  theme(axis.text.y = element_text(angle = 90, vjust = 1, hjust = 0.5)) +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme(plot.margin = unit(c(0, 0.5, 0.5, 0.5), "cm"))

layout <- matrix(c(1, 1, 2), ncol = 1)
grid.arrange(grobs = list(p1, p2), nrow = 2, layout_matrix = layout)
**Figure 6:** *Top*: Weekly all-cause mortaility in Germany and expected mortality of the normal and three excess states are shown. Weeks with excess probabilty>0.5 are shown in blue. *Bottom*: Weekly excess probability.

Figure 6: Top: Weekly all-cause mortaility in Germany and expected mortality of the normal and three excess states are shown. Weeks with excess probabilty>0.5 are shown in blue. Bottom: Weekly excess probability.

A short overview of the statistical model

The excode package combines Generalized Linear Models (GLMs) with Hidden Markov Models (HMMs). The GLM part of the model allows to use regression models to model seasonal patterns and general trends in the time series. Besides seasonal patterns and general time trends, an algorithm for excess count detection needs to account for excess counts in the historic data used for model estimation. The HMM part of the statistical model used in excode allows to explicitly consider past disease outbreaks in the model. The HMM separates the time series in two (or more) states: the normal state and one (or multiple) excess state. In the normal state, observed case counts lie within the range of what is normally expected in a certain time period. In the excess state, case counts are higher than what is normally expected, i.e. there is a potential outbreak or excess mortality. This is in contrast most of other algorithms used for excess count detection, which either do not account for previous excess case counts or use some heuristic for down weighting past instances of excess counts.

har_model_pois <- run_excode(shadar_df, excode_har_pois, 285,
  return_full_model = TRUE
)
plot(har_model_pois)
**Figure 7:** Components of an *excode* Harmonic model using a Poisson distribution.

Figure 7: Components of an excode Harmonic model using a Poisson distribution.

The components of an example two state model are shown in Figure 7. Each position \(t\) in the time series of length \(T\) is in a state \(s_t \in \left\{ N,E\right\}, t\in \left[1;T\right]\) , where \(N\) represents the normal and \(E\) the excess state. The sequence of normal and excess states is modeled by the HMM using initial state and transition probabilites (left boxes in Figure 7):

  • Initial state probabilities \(\Pr(s_{1}=N\bigr)\) and \(\Pr(s_{1}=E\bigr)\) give the probability that the first time point of the time series is in the respective state.
  • Transition probabilities give the probability that time point \(t\) is in a specific state, given the state of the previous time point. For instance, \(\Pr\left(s_{t}=N\left|s_{t-1}=N\right.\right)\), \(t\in\left[1;T\right]\) gives the probability that the time series is in the normal state at time point \(t\), given that position \(t-1\) is also in the normal state. \(\Pr\left(s_{t}=E\left|s_{t-1}=N\right.\right)\) gives the probability to transition to the excess state at time point \(t\), given that \(t-1\) is in the normal state.

In the example shown in Figure 7, the case counts \(c_{t}, t\in\left[1;T\right]\) are assumed to follow a Poisson distribution: \(c_{t} \sim Pois \left(\mu_{t}\right)\). The expected case counts at position \(t\), \(\mu_t\) are modeled using a GLM using sine and cosine functions to control for seasonal patterns (which is one of the models available in excode):

\[ \log\mu_{t}=\beta_{0}+\beta_{1}t+ \beta_3 \cos(\frac{2\pi}{52}t) + \beta_4\sin(\frac{2\pi}{52}t) + \beta_{5}I\left(s_{t}=E\right) \]

Here, \(\beta_{0}\) is the intercept, \(\beta_{1}t\) a linear time trend, \(\beta_3\) and \(\beta_4\) model the seasonal patterns using \(\sin\) and \(\cos\) functions with weekly data (52 time points per year). The increase in cases is incorporated by \(\beta_{5}\).

Based on this model, excode calculates a probability that there is an excess of case counts at the current time point - the posterior probability (bottom middle box in Figure 7). Moreover, given the normally expected case counts in the model, a p-value for the null hypothesis, that there is no excess at the current timepoint is calculated (bottom right box in Figure 7).