Forecasting

Dan | Sep 2, 2018 min read

Introduction

I looked into a number of resources to learn how to handle time series data and conduct educated statistical forecasting. Many of these resources jumped right into the equations and lacked a comprehensive introduction and discussion to faciliate learning. I found that a combination of two resources provided the right introduction to the self-learner, such as myself:

  1. Introduction to Time Series Analysis is a website available from the National Institute of Standards and Technology (NIST) Information Technology Library. The website will provide an overview of the important terminology and major concepts for time series analysis with limited mathematics.

  2. The online ebook, Forecasting: Principles and Practice, provides an in-depth, applied introduction to time series analysis and forecasting. The statistics are introduced and explained in ways that make sense and allow the reader to reason through without just accepting algorithms at face value without any understand of how they operate. After finishing this book, I feel comfortable reading journal articles on applied forecasting and using what I know to work through concepts I do not understand yet.

Application

I will walk through the development of improved models to mirror the logical flow introduced in Forecasting. I do not intend to go through the concepts or algorithms in-depth, but I will reference the resources mentioned when needed. I will be analyzing the Capital Bikeshare monthly ridership for the past eight years because I have the data easily accessible from a past project and I am familiar with potential covariates that might be useful later on. The the workflow will be:

  1. preparing time series data for modeling, including wrangling and checking modeling assumptions
  2. building multiple models to determine the most accurate, including averaging models to potentially boost accuracy
  3. building dynamic regression models to incorporate predictors
  4. building hierarchical and grouped models

Packages for time series analysis

The fpp2 package is associated with the Forecasting ebook and it loads a number of datasets used through the book. The package also requires a number of relevant packages for fitting and analyzing time series models.

library(fpp2)

Import the data for univariate analysis

Capital Bikeshare ridership is released quarterly, with information about date and time and whether the user was a member or casual rider. I have aggregated and stored the data at the monthly level for total ridership, and separately for members and casual riders. These three time series will be the forecast variables for analysis.

Preparing the time series data for modeling

Cleaning

raw <- read_csv("./time_series/processed_data/count_byyearandmonth_casual_member_total.csv")

head(raw)

The data consists of five columns, the first two denote the year and month that serve as identifiers. The other three rows are counts for the total ridership (“n”), casual riders (“Casual”), and member riders (“Member”). It is worth noting that there are about 30 rides that were missing a Casual/Member classification, and that monthly aggregation is not included. This might be relevant during the hierachical and grouped modeling because the two columns will not add up to the total ridership in some rows.

Creating time series (ts) objects

The first month of data has been removed because it is substantially lower than even the second month in the dataset. Inclusion of the first month creates a large outlier.

While the month and year are useful when importing new data, those columns are not explicitly used to create the ts object. Instead, those columns are removed, and the remaining three columns, all of which are monthly counts, are piped into the ts() function and the start month and frequency are set to assign the counts to the appropriate months.

The output object does not look very different from the raw data, but the structure of the object is a multivariate time series (“mts”) where all three variables are outcome variables that we want to model. Later on, we will multivariate time series that include predictor variables that intend to refine model accuracy.

raw.ts <- 
  raw %>% 
  slice(-1) %>% 
  select(-year, -month) %>% 
  ts(., start = c(2010, 9), frequency = 12)

head(raw.ts)

Descriptive patterns

We can use the time plot to determine the features of the time series to inform model building.

raw.ts[,3] %>% 
  autoplot(., series = "Total") +
  autolayer(raw.ts[,1], series = "Casual") +
  autolayer(raw.ts[,2], series = "Member")

The time series all have a general upward trend over the eight years. The time series also have a clear seasonal pattern; ridership increases markedly from January through the summer then declines until the next cycle begins. The seasonal pattern appears to increase in amplitude over time, meaning the difference between the high and low points in each seasona seem to increase as time goes on. This pattern likely has to do with the growing availability and popularity of Capital Bikeshare over the time period. This hypothesis will be explored below.

On a more technical note, the seasonal and trend evident in the plot indicate that the time series are non-stationary. The Unit Root test confirms this; the test-statistic is higher than the 1 percent critical value, so the null hypothesis is rejected and the observed data is non-stationary. I will need to convert the data to a stationary format before proceeding.

library(urca)

raw.ts[,3] %>% ur.kpss() %>% summary()

raw.ts[,3] %>% ndiffs()
nsdiffs(raw.ts[,3])

The ndiffs() function indicates that one differencing operation will be suitable to create a stationary time series. After taking the logarithm (to stablise the mean) and differencing once (to stablize the mean), the output does pass the Unit Root Test. The large initial value is surprising, but we will move on for now.

total.diff <-
  raw.ts[,3] %>% 
  log() %>% 
  diff()

total.diff %>% ur.kpss() %>% summary()

total.diff %>% autoplot()

Building a model

ARIMA models and Exponential Smoothing (ES) are the two most common methods for time series forecasting. While ARIMA models pick up on autocorrelation in the observed data, ES estimates and leverages the underlying seasonal and trend components in the observed data. Here we will build both of these models and compare the accuracy.

For now, we will only work with the time series for the total ridership. The training set includes monthly counts through the end of 2016. The test set and prediction horizon will be the time period beyond the training data, from the first quarter of 2017 through the end of the first quarter of 2018.

total.ts <- raw.ts[,3]

train <- window(total.ts, end=c(2017, 1))

h <- length(total.ts) - length(train)

The ARIMA model chooses a ARIMA model with 0 autoregressive terms, 1 differences, and 3 lagged forecast error terms, with an additional seasonal component (2,1,1)[12]. Because this model was created with a Box-Cox Transformation and bias adjustment, the point prediction for the test data assumes a larger increasing trend in comparison to the simple optimized model chosen by auto.arima() without those parameters. The test data does not follow the increasing trend in the arima.lambda model, so the automated arima model is more accurate for the test data according to root mean squared error (RMSE), a common metric for comparing models because it is agnostic to scale and differencing.

arima.lambda <- 
  train %>% 
  auto.arima(., lambda=0, biasadj=T) %>% 
  forecast(h = h)

arima.lambda %>% summary()

arima.auto <- forecast(auto.arima(train),  h=h)

total.ts %>% 
  autoplot()+
  autolayer(arima.auto, series = "ARIMA", PI = F)+
  autolayer(arima.lambda, series = "ARIMA: lambda = 0", PI = F)

accuracy(arima.auto, total.ts)['Test set', 'RMSE']
accuracy(arima.lambda, total.ts)['Test set', 'RMSE']

Building and comparing multiple models

ets() will fit the exponential smoothing model (the function name stands for error trend season to convey that it looks for the trend patterns emblematic of exponential smoothing). A third method

models <- list()

models$arima.auto <- arima.auto
models$arima.lambda <- arima.lambda
models$ETS.lambda <- forecast(ets(train, lambda = 0), h=h)
models$ETS <- forecast(ets(train), h=h)
models$STLF.lambda <- forecast(stlf(train), h=h)
models$STLF <- forecast(stlf(train), h=h)
models$TBATS <- forecast(tbats(train, lambda = 0), h = h)

total.ts %>% 
  autoplot(series = "Observed Data")+
  autolayer(models$STLF, series = "STLF", PI = F)+
  autolayer(models$ETS, series = "ETS", PI = F)+
  autolayer(models$arima.auto, series = "ARIMA", PI = F)+
  autolayer(models$arima.lambda, series = "ARIMA: Lambda = 0", PI = F)+
  autolayer(models$TBATS, series = "TBATS", PI = F)


lapply(models, function(x){accuracy(x, total.ts)['Test set', 'RMSE']})

It appears that the ETS method performed the best among these four. Unfortunately, all models violate the residual assumptions; the model residuals are distinguishable from white noise and do not fully capture the dynamics of the observed data. Only the ARIMA model with a box-cox transformation passes the Ljung-Box test. This does not mean the other models are not worthwhile, all of them outperformed the transformed ARIMA model in terms of accuracy, but they are not fully utilizing patterns in the data because there is still autocorrelation between data points.

lapply(models, function(x){checkresiduals(x, plot = F)})

Improving accuracy through averaging

As a form of ensemble modeling, averaging the point predictions from multiple models often improves model accuracy. In this case, the test set RMSE is better than any individual model when we combined three models, STLF.lambda, ETS, and TBATS. Adding a fourth model that performs well individually, such as STLF or ETS.lambda, both increase the aggregated RMSE, so they are left out. The reason that adding another form a model already in the averaged value does not improve the accuracy is likely that both models are from the same statistical equation, it seems that there is less to be gained by averaging. From this I infer that combining two ETS models is not advantageous, whereas averaging two models that have different approaches to detecting patterns, such as ETS and STLF, would complement each other and improve performance. Using the three models is better than only the two top performing models.

Combination <- (models$STLF.lambda[["mean"]] + models$ETS[["mean"]] + models$TBATS[["mean"]])/3

accuracy(Combination, total.ts)['Test set', 'RMSE']

Combination2 <- (models$STLF.lambda[["mean"]] + models$ETS[["mean"]])/2

accuracy(Combination2, total.ts)['Test set', 'RMSE']

total.ts %>% 
  autoplot()+
  autolayer(Combination2, series = "Combination of ETS and ARIMA")+
  autolayer(Combination, series = "Combination of all four models")+
  autolayer(models$ETS, series = "ETS", PI = F)+
  autolayer(models$arima.auto, series = "ARIMA", PI = F)+
  autolayer(models$TBATS, series = "TBATS", PI = F)+
  autolayer(models$STLF.lambda, series = "STLF.lambda", PI = F)+
  autolayer(models$STLF, series = "STLF", PI = F)

Neural Network Model

Another potential model is a neural network (nn) model which is based on non-linear relationships between predictor and outcome variables. The nn model predicts a more rapid increase in ridership after 2018 begins, while the observed data lags behind.

nn.fit <- nnetar(train, lambda = 0)

total.ts %>% 
  autoplot()+
  autolayer(forecast(nn.fit,  h = h, PI = F))

accuracy(forecast(nn.fit), total.ts)

Multivariate Analysis

The models presented up to this point only use patterns in the ridership count itself. Other variables might assist in forecast future ridership, including monthly average temperature, precipitation, and bicycle availability. I will gather these three data sources and incorporate the information into a dynamic regression model to see how these variables affect forecast accuracy.

Importing predictors

The National Weather Service has monthly average temperature and precipitation for Washington, DC going back to 1871. These records likely come from a single or a few monitoring sites, but they will give a rough estimate of how these variables affects ridership. The tabulizer package provides the function to extract a table from a pdf, and then it is easy to wrangle that data into a table of the pertinent range (2010-2018).

library(tabulizer)

temp.table <- extract_tables("https://www.weather.gov/media/lwx/climate/dcatemps.pdf")

header <- function(df){
  names(df) <- as.character(unlist(df[1,]))
  df[-1,]
}

ts.temp <-  
  temp.table[[5]][c(1, 20, 22:26, 28:30),1:13] %>% 
  data.frame(stringsAsFactors = F) %>% 
  header(.) %>% 
  tidyr::gather(., "Month", "Temperature (Degrees Fahrenheit)", 2:13) %>% 
  arrange(-desc(YEAR)) %>% 
  select(-YEAR, -Month) %>% 
  ts(start = c(2010, 09), end = c(2018, 02), frequency = 12)
precip.table <- extract_tables("https://www.weather.gov/media/lwx/climate/dcaprecip.pdf")

ts.precip <-  
  precip.table[[5]][c(1, 16, 18:22, 24:26),1:13] %>% 
  data.frame(stringsAsFactors = F) %>% 
  header(.) %>% 
  tidyr::gather(., "Month", "Temperature (Degrees Fahrenheit)", 2:13) %>% 
  arrange(-desc(YEAR)) %>% 
  select(-YEAR, -Month) %>% 
  ts(start = c(2010, 09), end = c(2018, 02), frequency = 12) 
stations <- read_csv("./learning_forecasting/station_firstuse_latlon.csv")

ts.stations <-
  stations %>% 
  mutate(MONTH = as.Date(paste(format(as.Date(Time), "%Y-%m"), "-01", sep = ""))) %>% 
  select(-Time, YEAR) %>% 
  count(MONTH) %>% 
  padr::pad(end_val = as.Date("2018-02-01")) %>% 
  tidyr::replace_na(list(n = 0)) %>% 
  mutate(total = cumsum(n)) %>% 
  select(total) %>% 
  ts(start = c(2010, 9), end = c(2018, 02), frequency = 12) 

Dynamic Regression Model

covars <- cbind(ts.temp, ts.precip, ts.stations) %>% 
  window(end = c(2018, 2))

# Restrict data so models use same fitting period
# use 1-4 lagged values in the xreg predictors
fit1 <- auto.arima(total.ts, xreg=covars[,1:3],
                   stationary=TRUE)

fit1

fcast <- forecast(fit1, xreg=cbind(
 # rep(mean(covars[,1], 8)),rep(mean(covars[,2], 8)),rep(mean(covars[,3], 8))
  rep(77.3, 20), mean(0, 20), rep(454, 20)
  ))

fcast %>% autoplot()

I generalized the function provided in Forecasting to create lags of various lengths. Unforunately, the larger the lag, the smaller the dataset, so the information criteria are lower than models that use smaller lags and larger datasets. These modelscan not be compared for that reason, among others.

predictors <- function(ts, var, lags){
  cbind(
    var = ts[,var],
    varLag1 = stats::lag(ts[,var],-lags[1]),
    varLag2 = stats::lag(ts[,var],-lags[2]),
    varLag3 = stats::lag(ts[,var],-lags[3])
  ) %>% 
    window(end = c(2018, 2))
}
lagpreds.mult <- cbind(predictors(covars, "ts.temp", c(1,2,3)),
                       predictors(covars, "ts.stations", c(1,2,3)))

fit1 <- auto.arima(total.ts[4:90], xreg=lagpreds.mult[4:90,1:5],
  stationary=T) # best model has 4 lagged terms for temperature and the present station number

fit2 <- auto.arima(total.ts[4:90], xreg=lagpreds.mult[4:90, 1:3],
  stationary=TRUE)

c(fit1[["aicc"]], fit2[["aicc"]])

Dynamic regression and lagged predictor terms do not improve the model AICC beyond the best models based on the best models produced by purely the forecast variable. It is logical because it seems unlikely that the number of stations will have a large impact in the short run.