These are examples of using Bayesian statistics to infer the impact of an intervention whose effect follows one of several patterns mentioned in 9.2 Intervention Analysis:

The calculations are done on PAWS, Wikimedia Cloud Services-hosted JupyterHub.

**Update (29 Oct 2018)**: I've published the report/data analysis which motivated me to put together these notes.

For any ARMA model of observed series $y_t$ with change-due-to-intervention $z_t$ and error series $\epsilon_t$:

$$ y_t = z_t + \frac{\Theta(B)}{\Phi(B)} \epsilon_t $$where $\Theta(B)$ is the moving average (MA) polynomial and $\Phi(B)$ is the autoregressive (AR) polynomial.

In all of these, $z_t$ will be some function of an indicator variable $I_t$ and intervention effect $\delta_0$.

To keep these toy examples simple and easy to follow (at least with respect to simulation of data and performing the inference), we will use a stationary ARMA(1,1) model:

$$ y_t = z_t + \phi_1 y_{t-1} + \theta_1 \epsilon_{t-1} + \epsilon_t $$If you're seeing this time series stuff for the first time and are interested in learning more, I recommend the following resources:

- Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (available online in its entirety)
- Time Series: Modelling, Computation and Inference by Raquel Prado and Mike West

We are primarily interested in inference on the effect of the intervention $\delta_0$. We will use the the Stan probabilistic programming language to accomplish this, but others like JAGS, PyMC3, Edward2 / TensorFlow Probability, greta, Pyro, and Turing.jl can be used as well.

If you're seeing this Bayesian stuff for the first time and are interested in learning more, I recommend the following resources:

- Statistical Rethinking by Richard McElreath
- Bayesian Data Analysis by Andrew Gelman, David Dunson, Donald Rubin, Hal S. Stern, and John B. Carlin
- Bayesian Methods for Hackers by Cam Davidson-Pilon (mostly available online as Jupyter notebooks)
- Bayesian Basics by Michael Clark

We're using RStan to interface with Stan in R but all of this can also be done in Python with PyStan or in Julia with Stan.jl.

In [1]:

```
library(magrittr) # piping
library(zeallot) # multi-assignment
library(tictoc) # timing
library(latex2exp) # TeX to R expressions
suppressPackageStartupMessages(library(rstan))
```

In [2]:

```
rstan_options(auto_write = TRUE)
options(
mc.cores = parallel::detectCores(),
repr.plot.width = 8, repr.plot.height = 4,
digits = 3
)
```

We will simulate $N = 200$ days of observations with intervention $T = 100$. The intervention effect is $\delta_0 = 10$ in all cases. In cases where there is a gradual increase/decrease, $\omega_1 = 0.9$.

The errors are i.i.d. from a Normal distribution ($\mu = 0, \sigma = 2$), and $\theta_1 = 0.6$ and $\phi_1 = -0.7$ as the MA(1) and AR(1) coefficients, respectively:

In [3]:

```
c(N, T, theta1, phi1, delta0, sigma, omega1) %<-% c(200, 100, 0.6, -0.7, 10, 2, 0.9)
# Helper function to simulate an ARMA(1,1) series which includes changes from intervention:
arma11_sim <- function(z) {
# Setup:
y <- numeric(0)
set.seed(0)
y[1] <- rnorm(1, 0, 5)
# Stochastic component:
errors <- c(0, rnorm(N - 1, 0, sd = sigma))
# Deterministic component:
for (t in 2:N) y[t] <- z[t] + phi1 * y[t - 1] + theta1 * errors[t - 1] + errors[t]
# Output:
return(y)
}
# Helper functions to plot time series and visualize the change from intervention:
plot_change <- function(z, highlight_idx) {
plot(z, type = "l", ylab = expression(z[t]), xlab = expression(t), main = "Change due to intervention")
highlight_idx <- c(highlight_idx[1] - 1, highlight_idx, highlight_idx[length(highlight_idx)] + 1)
lines(highlight_idx, z[highlight_idx], col = "red")
}
plot_ts <- function(y, z, highlight_idx) {
par(mfrow = c(1, 2))
plot(y, type = "l", ylab = expression(y[t]), xlab = expression(t), main = "Observed (simulated) series")
plot_change(z, highlight_idx)
}
# Helper function to compare point estimates & credible intervals from posterior to the truth:
compare <- function(stan_fit, actual, ...) {
fit_summary <- stan_fit %>%
broom::tidyMCMC(estimate.method = "median", conf.int = TRUE, conf.method = "HPDinterval", ...) %>%
dplyr::mutate(conf.95 = sprintf("(%0.2f, %0.2f)", conf.low, conf.high)) %>%
dplyr::select(-c(conf.low, conf.high))
return(cbind(fit_summary, actual = actual))
}
```

In [4]:

```
I_1 <- (1:N) >= T
z_1 <- delta0 * I_1
y_1 <- arma11_sim(z_1)
plot_ts(y_1, z_1, which(I_1))
```

The Stan manual authors recommend constraining the coefficients to enforce the estimation of a *stationary* ARMA process so that's what we're going to do here. Although they don't recommended that in practice in case the data is not stationary, in which case they suggest encouraging stationary parameter estimates with a prior favoring coefficient values near zero.

In [5]:

```
model_1 <- stan_model(model_name = "pattern_1", model_code = "
data {
int<lower = 1> T; // known time of intervention
int<lower = 1> N; // number observations in series
real y[N]; // observations
}
parameters {
real<lower = -1, upper = 1> phi1;
real<lower = -1, upper = 1> theta1;
real delta0;
real<lower = 0> sigma;
}
model {
vector[N] nu; // prediction at time t
vector[N] z; // effect of intervention at t
vector[N] epsilon; // diff between pred & obs at t
// priors:
phi1 ~ normal(0, 1);
theta1 ~ normal(0, 1);
delta0 ~ normal(0, 5);
sigma ~ cauchy(0, 5);
// likelihood:
z[1] = 0;
epsilon[1] = 0;
for (t in 2:N) {
z[t] = delta0 * (t >= T ? 1 : 0);
nu[t] = (phi1 * y[t - 1]) + (theta1 * epsilon[t - 1]) + z[t];
epsilon[t] = y[t] - nu[t];
}
epsilon ~ normal(0, sigma);
}")
```

Once we have a model we can do Bayesian inference in one of two ways:

- full inference with MCMC sampling via NUTS/HMC which can be very slow but converges to the actual posterior distribution
- approximate inference with variational inference via ADVI which is so fast you can use it to train deep probabilistic neural networks, but only provides approximations to the posterior

Let's see how the results differ and how long they take to get:

In [6]:

```
tic()
fit_1a <- vb(model_1, data = list(T = T, N = N, y = y_1))
toc()
compare(fit_1a, c(phi1, theta1, delta0, sigma), pars = c("phi1", "theta1", "delta0", "sigma"))
```

In [7]:

```
tic()
fit_1b <- sampling(model_1, data = list(T = T, N = N, y = y_1))
toc()
compare(fit_1b, c(phi1, theta1, delta0, sigma), pars = c("phi1", "theta1", "delta0", "sigma"))
```

Pretty good results in way less time!

The original draft of this notebook just had MCMC and it wasn't until right before I was ready to post it that I was like "let's just try VI and see what happens". Since it's not expensive to do it in addition to the full inference, I've included in the other patterns too to show its similarities, differences, advantages, and shortcomings. I feel like approximate inference with (AD)VI isn't as well-known among Bayesian statisticians and data scientists as MCMC-powered full inference is, so hopefully this could be a cool introduction to that.

When the change from the intervention lasts $d$ days, $I_t = 1$ for $T \leq t \leq T + d$ and $I_t = 0$ for all other $t$.

$$ z_t = \delta_0 I_t $$In [8]:

```
d = 18
I_2 <- ((1:N) >= T) & ((1:N) <= (T + d))
z_2 <- delta0 * I_2
y_2 <- arma11_sim(z_2)
plot_ts(y_2, z_2, which(I_2))
```

In [9]:

```
model_2 <- stan_model(model_name = "pattern_2", model_code = "
data {
int<lower = 1> T; // known time of intervention
int<lower = 1> d; // days the change from intervention lasts
int<lower = 1> N; // number observations in series
real y[N]; // observations
}
parameters {
real<lower = -1, upper = 1> phi1;
real<lower = -1, upper = 1> theta1;
real delta0;
real<lower = 0> sigma;
}
model {
vector[N] nu; // prediction at time t
vector[N] z; // effect of intervention at t
vector[N] epsilon; // diff between pred & obs at t
// priors:
phi1 ~ normal(0, 1);
theta1 ~ normal(0, 1);
delta0 ~ normal(0, 5);
sigma ~ cauchy(0, 5);
// likelihood:
z[1] = 0;
epsilon[1] = 0;
for (t in 2:N) {
z[t] = delta0 * (((t >= T) && (t <= (T + d))) ? 1 : 0);
nu[t] = (phi1 * y[t - 1]) + (theta1 * epsilon[t - 1]) + z[t];
epsilon[t] = y[t] - nu[t];
}
epsilon ~ normal(0, sigma);
}")
```

In [10]:

```
tic()
fit_2a <- vb(model_2, data = list(T = T, N = N, y = y_2, d = d))
toc()
compare(fit_2a, c(phi1, theta1, delta0, sigma), pars = c("phi1", "theta1", "delta0", "sigma"))
```

In [11]:

```
tic()
fit_2b <- sampling(model_2, data = list(T = T, N = N, y = y_2, d = d), control = list(adapt_delta = 0.9))
toc()
compare(fit_2b, c(phi1, theta1, delta0, sigma), pars = c("phi1", "theta1", "delta0", "sigma"))
```

If $T$ is the time of intervention, then $I_t = 0$ when $t < T$ and $I_t = 1$ when $t \geq T$.

$$ z_t = \frac{\delta_0}{1 - \omega_1 B} I_t $$assuming $\omega_1 < 1$. The above is equivalent to $z_t = \omega_1 z_{t-1} + \delta_0 I_t$. Furthermore, since $|\omega_1| < 1$, $z_t$ can be expressed just in terms of $\omega_1$ and $\delta_0$:

$$ z_t = \frac{\delta_0(1 - \omega_1^{t - T + 1})}{1 - \omega_1} $$In [12]:

```
z_3 <- numeric(N)
for (t in 2:N) z_3[t] <- omega1 * z_3[t - 1] + delta0 * (t >= T)
y_3 <- arma11_sim(z_3)
plot_ts(y_3, z_3, which((1:N) >= T))
```

Knowing how the value of $\omega_1$ affects the effect can help us come up with an informative prior for it:

In [13]:

```
zs <- numeric(N)
omegas <- c(0.1, 0.5, 0.7, 0.8, 0.9)
plot(1, type = "n", xlim = c(95, 160), ylim = c(0, 10), yaxt = "n",
xlab = "t", ylab = expression(z[t]),
main = expression(paste("How ", omega[1], " influences effect")))
for (o in seq_along(omegas)) {
for (t in 2:N) zs[t] <- omegas[o] * zs[t - 1] + 1 * (t >= T)
lines(95:155, zs[95:155], type = "l")
text(155, zs[160], omegas[o], pos = 4)
}; rm(zs, omegas, o, t)
```

Looking at the observed $y_t$, it appears that it takes about 50 days before the effect stops increasing. We don't need to know $\delta_0$ to see how $z_t$ behaves under different values of $\omega_1$. Looking at the figure above, we could be confident in saying that $\omega_1$ is *definitely* bigger than 0.1 and *probably* bigger than 0.5. Instead of a Uniform(0, 1) prior, it looks like we would be wise to use a Beta(8, 2) prior instead, which gives values 0.7-0.9 much higher probability:

(Made with my in-development RStudio addin tinydensR)

In [14]:

```
model_3 <- stan_model(model_name = "pattern_3", model_code = "
data {
int<lower = 1> T; // known time of intervention
int<lower = 1> N; // number observations in series
real y[N]; // observations
}
parameters {
real<lower = -1, upper = 1> phi1;
real<lower = -1, upper = 1> theta1;
real<lower = 0> delta0;
real<lower = 0, upper = 1> omega1;
real<lower = 0> sigma;
}
model {
vector[N] nu; // prediction at time t
vector[N] z; // effect of intervention at t
vector[N] epsilon; // diff between pred & obs at t
// priors:
phi1 ~ normal(0, 1);
theta1 ~ normal(0, 1);
delta0 ~ normal(5, 5);
sigma ~ cauchy(0, 5);
omega1 ~ beta(8, 2);
// likelihood:
z[1] = 0;
epsilon[1] = 0;
for (t in 2:N) {
if (t < T) {
z[t] = 0;
} else {
z[t] = delta0 * (1 - (omega1 ^ (t - T + 1))) / (1 - omega1);
}
nu[t] = (phi1 * y[t - 1]) + (theta1 * epsilon[t - 1]) + z[t];
epsilon[t] = y[t] - nu[t];
}
epsilon ~ normal(0, sigma);
}")
```

In [15]:

```
tic()
fit_3a <- vb(model_3, data = list(T = T, N = N, y = y_3))
toc()
compare(fit_3a, c(phi1, theta1, delta0, sigma, omega1), pars = c("phi1", "theta1", "delta0", "sigma", "omega1"))
```

In [16]:

```
tic()
fit_3b <- sampling(model_3, data = list(T = T, N = N, y = y_3),
iter = 40000, control = list(max_treedepth = 15, adapt_delta = 0.999))
toc()
compare(fit_3b, c(phi1, theta1, delta0, sigma, omega1), pars = c("phi1", "theta1", "delta0", "sigma", "omega1"))
```

The formula and model specification are the same as the pattern above except $I_t = 1$ when $t = T$ and $I_t = 0$ otherwise.

In [17]:

```
z_4 <- numeric(N)
for (t in 2:N) z_4[t] <- omega1 * z_4[t - 1] + delta0 * (t == T)
y_4 <- arma11_sim(z_4)
plot_ts(y_4, z_4, which((1:N) >= T))
```

In [18]:

```
model_4 <- stan_model(model_name = "pattern_4", model_code = "
data {
int<lower = 1> T; // known time of intervention
int<lower = 1> N; // number observations in series
real y[N]; // observations
}
parameters {
real<lower = -1, upper = 1> phi1;
real<lower = -1, upper = 1> theta1;
real<lower = 0> delta0;
real<lower = 0, upper = 1> omega1;
real<lower = 0> sigma;
}
model {
vector[N] nu; // prediction at time t
vector[N] z; // effect of intervention at t
vector[N] epsilon; // diff between pred & obs at t
// priors:
phi1 ~ normal(0, 1);
theta1 ~ normal(0, 1);
delta0 ~ normal(0, 5);
sigma ~ cauchy(0, 5);
omega1 ~ beta(6, 2);
// likelihood:
z[1] = 0;
epsilon[1] = 0;
for (t in 2:N) {
z[t] = omega1 * z[t - 1] + delta0 * (t == T ? 1 : 0);
nu[t] = (phi1 * y[t - 1]) + (theta1 * epsilon[t - 1]) + z[t];
epsilon[t] = y[t] - nu[t];
}
epsilon ~ normal(0, sigma);
}")
```

In [19]:

```
tic()
fit_4a <- vb(model_4, data = list(T = T, N = N, y = y_4))
toc()
compare(fit_4a, c(phi1, theta1, delta0, sigma, omega1), pars = c("phi1", "theta1", "delta0", "sigma", "omega1"))
```

In [20]:

```
tic()
fit_4b <- sampling(model_4, data = list(T = T, N = N, y = y_4),
iter = 10000, control = list(max_treedepth = 20, adapt_delta = 0.99))
toc()
compare(fit_4b, c(phi1, theta1, delta0, sigma, omega1), pars = c("phi1", "theta1", "delta0", "sigma", "omega1"))
```

If there's a slow ramp-up and then the change slowly levels-off in an "S" shape, we can represent it with the Gompertz function:

$$ f(t) = a\mathrm{e}^{-b\mathrm{e}^{-ct}} $$where

- $a$ is an asymptote (e.g. $\delta_0$), since $\lim_{t \to \infty} a\mathrm{e}^{-b\mathrm{e}^{-ct }}=a\mathrm{e}^0=a$
- $b > 0$ sets the displacement along the x-axis (translates the graph to the left or right)
- $c > 0$ sets the growth rate (we'll call it $\lambda$)
- e is Euler's Number)

In [21]:

```
gompertz <- function(t, a, b, c) {
return(a * exp(-b * exp(-c * t)))
}
```

In [22]:

```
zs <- numeric(N)
lambdas <- c(0.25, 0.5, 1, 1.5)
plot(1, type = "n", xlim = c(95, 130), ylim = c(0, delta0),# yaxt = "n",
xlab = "t", ylab = expression(z[t]),
main = TeX("How $\\lambda$ influences effect"))
for (l in seq_along(lambdas)) {
for (t in 2:N) {
if (t >= T) {
zs[t] <- gompertz(t - T, delta0, d, lambdas[l])
} else {
zs[t] <- 0
}
}
lines(95:130, zs[95:130], type = "l", col = length(lambdas) + 1 - l, lty = l, lwd = 1.5)
}; rm(zs, l, t)
legend("bottomright", legend = lambdas,
lwd = 1.5, lty = 1:length(lambdas),
col = length(lambdas) + 1 - (1:length(lambdas)),
y.intersp = 3, bty = "n")
```

In [23]:

```
I_5 <- (1:N) >= T
lambda <- 0.1
z_5 <- gompertz((1:N) - T, delta0, d, lambda) * I_5
y_5 <- arma11_sim(z_5)
plot_ts(y_5, z_5, which(I_5))
```

In [24]:

```
model_5 <- stan_model(model_name = "pattern_5", model_code = "
functions {
real gompertz(real t, real a, real b, real c) {
return (a * exp(-b * exp(-c * t)));
}
}
data {
int<lower = 1> T; // known time of intervention
int<lower = 1> N; // number observations in series
real y[N]; // observations
int<lower = 1> d; // midpoint/displacement along t
}
parameters {
real<lower = -1, upper = 1> phi1;
real<lower = -1, upper = 1> theta1;
real<lower = 0> delta0;
real<lower = 0, upper = 1> omega1;
real<lower = 0> sigma;
real<lower = 0> lambda; // Gompertz growth rate
}
model {
vector[N] nu; // prediction at time t
vector[N] z; // effect of intervention at t
vector[N] epsilon; // diff between pred & obs at t
// priors:
phi1 ~ normal(0, 1);
theta1 ~ normal(0, 1);
delta0 ~ normal(0, 5);
sigma ~ cauchy(0, 5);
omega1 ~ beta(6, 2);
lambda ~ cauchy(0, 0.5);
// likelihood:
z[1] = 0;
epsilon[1] = 0;
for (t in 2:N) {
if (t >= T) {
z[t] = gompertz(t - T, delta0, d, lambda);
} else {
z[t] = 0;
}
nu[t] = (phi1 * y[t - 1]) + (theta1 * epsilon[t - 1]) + z[t];
epsilon[t] = y[t] - nu[t];
}
epsilon ~ normal(0, sigma);
}")
```

In [25]:

```
tic()
fit_5a <- vb(model_5, data = list(T = T, N = N, y = y_5, d = d))
toc()
compare(fit_5a, c(phi1, theta1, delta0, sigma, lambda), pars = c("phi1", "theta1", "delta0", "sigma", "lambda"))
```

In [26]:

```
tic()
fit_5b <- sampling(model_5, data = list(T = T, N = N, y = y_5, d),
iter = 4000, control = list(adapt_delta = 0.99))
toc()
compare(fit_5b, c(phi1, theta1, delta0, sigma, lambda), pars = c("phi1", "theta1", "delta0", "sigma", "lambda"))
```

In [27]:

```
# Image at the beginning of this notebook:
png("intervention-patterns.png", width = 15, height = 3, units = "in", res = 150)
par(mfrow = c(1, 5))
plot(y_1, type = "l", main = "Pattern 1", lwd = 0.5, xlab = "t", ylab = expression(y[t]))
lines(1:N, theta1 * z_1, col = "red", lwd = 2)
plot(y_2, type = "l", main = "Pattern 2", lwd = 0.5, xlab = "t", ylab = expression(y[t]))
lines(1:N, theta1 * z_2, col = "red", lwd = 2)
plot(y_3, type = "l", main = "Pattern 3", lwd = 0.5, xlab = "t", ylab = expression(y[t]))
lines(1:N, theta1 * z_3, col = "red", lwd = 2)
plot(y_4, type = "l", main = "Pattern 4", lwd = 0.5, xlab = "t", ylab = expression(y[t]))
lines(1:N, theta1 * z_4, col = "red", lwd = 2)
plot(y_5, type = "l", main = "Pattern 5", lwd = 0.5, xlab = "t", ylab = expression(y[t]))
lines(1:N, theta1 * z_5, col = "red", lwd = 2)
dev.off()
```

In pattern 2 we had an intervention whose effect lasted $d=18$ days. What if we didn't know that? What if we just knew when the intervention happened but were unsure about how long the effect lasted? Our suspicion is that the effect of the intervention lasted between 1 and 3 weeks and is *probably* around a fortnight, although we're not particularly sure. We can specify $d$ as a parameter of the model to infer from the data, and we can formalize our prior beliefs as:

With the Normal distribution there's something called the 68-95-99.7 rule, so one way to think about this is that we're:

- 68% sure that $d$ is between $14-2=12$ and $14+2=16$
- 95% sure that $d$ is between $14-4=10$ and $14+4=18$
- 99.7% sure that $d$ is between $14-6=8$ and $14+6=20$

**By the way**, even though we talk about $d$ as a *discrete* number of days, we can still represent it in the model as a *continuous*, real-valued random variable.

The probability density looks like this:

In [28]:

```
curve(dnorm(x, mean = 14, sd = 2), from = 7, to = 21, lty = "dotted",
ylab = "probability", xlab = "days", main = "Prior distribution on d")
curve(dnorm(x, mean = 14, sd = 2), from = 8, to = 20, lwd = 2, add = TRUE)
abline(v = c(8, 20), lty = "dashed")
abline(v = 18, lty = "dashed", col = "red", lwd = 2)
legend("topleft", "actual value of d", lty = "dashed", col = "red", lwd = 2, bty = "n")
```

In this particular case where we have this additional uncertainty from having $d$ as an unknown, we're going to help the model by setting the lower bound on $\delta_0$ to 0 since it's clear that the intervention had a positive impact.

In [29]:

```
model_2ec <- stan_model(model_name = "pattern_2ec", model_code = "
data {
int<lower = 1> T; // known time of intervention
int<lower = 1> N; // number observations in series
real y[N]; // observations
}
parameters {
real<lower = -1, upper = 1> phi1;
real<lower = -1, upper = 1> theta1;
real<lower = 0> delta0;
real<lower = 0> sigma;
real<lower = 1> d; // days the change from intervention lasts
}
model {
vector[N] nu; // prediction at time t
vector[N] z; // effect of intervention at t
vector[N] epsilon; // diff between pred & obs at t
// priors:
phi1 ~ normal(0, 1);
theta1 ~ normal(0, 1);
delta0 ~ normal(0, 5);
sigma ~ cauchy(0, 5);
d ~ normal(14, 2);
// likelihood:
z[1] = 0;
epsilon[1] = 0;
for (t in 2:N) {
if ((t >= T) && (t < (T + d))) {
z[t] = delta0;
} else {
z[t] = 0;
}
nu[t] = (phi1 * y[t - 1]) + (theta1 * epsilon[t - 1]) + z[t];
epsilon[t] = y[t] - nu[t];
}
epsilon ~ normal(0, sigma);
}")
```

In [30]:

```
tic()
fit_2eca <- vb(model_2ec, data = list(T = T, N = N, y = y_2))
toc()
compare(fit_2eca, c(phi1, theta1, delta0, sigma, d), pars = c("phi1", "theta1", "delta0", "sigma", "d"))
```

In [31]:

```
tic()
fit_2ecb <- sampling(model_2ec, data = list(T = T, N = N, y = y_2))
toc()
compare(fit_2ecb, c(phi1, theta1, delta0, sigma, d), pars = c("phi1", "theta1", "delta0", "sigma", "d"))
```