Forecast double seasonal time series with multiple linear regression in R
This article is originally published at https://petolau.github.io
I will continue in describing forecast methods, which are suitable to seasonal (or multi-seasonal) time series. In the previous post smart meter data of electricity consumption were introduced and a forecast method using similar day approach was proposed. ARIMA and exponential smoothing (common methods of time series analysis) were used as forecast methods. The biggest disadvantage of this approach was that we created multiple models at once for different days in the week, which is computationally expensive and it can be a little bit unclear. Regression methods are more suitable for multi-seasonal times series. They can handle multiple seasonalities through independent variables (inputs of a model), so just one model is needed. In this post, I will introduce the most basic regression method - multiple linear regression (MLR).
I have prepared a file with four aggregated time series for analysis and forecast. It can be found on my github repo, the name of the file is DT_4_ind. Of course, I’m using EnerNOC smart meter data again and time series were aggregated by four located industries. The file was created easily by the package feather
(CRAN link), so only by this package, you can read this file again. The feather
is a useful tool to share data for R and Python users.
Data manipulation will be done by data.table
package, visualizations by ggplot2
, plotly
and animation
packages.
The first step to do some “magic” is to scan all of the needed packages.
Read the mentioned smart meter data by read_feather
to one data.table
.
Let’s plot, what I have prepared for you - aggregated time series of electricity consumption by industry.
An interesting fact is that the consumption of the industry Food Sales & Storage isn’t changing during holidays as much as others.
Multiple linear regression model for double seasonal time series
First, let’s define formally multiple linear regression model. The aim of the multiple linear regression is to model dependent variable (output) by independent variables (inputs). Another target can be to analyze influence (correlation) of independent variables to the dependent variable. Like in the previous post, we want to forecast consumption one week ahead, so regression model must capture weekly pattern (seasonality). Variables (inputs) will be of two types of seasonal dummy variables - daily (\( d_1, \dots, d_{48} \)) and weekly (\( w_1, \dots, w_6 \)). In the case of the daily variable, there will be \( 1 \), when the consumption during the day will be measured at the particular time, otherwise \( 0 \). In the case of the week variable there will be \( 1 \), when the consumption is measured at the particular day of the week, otherwise \( 0 \).
The regression model can be formally written as:
where \( y_i \) is the electricity consumption at the time \( i \), where \( i = 1, \dots, N \). \( \beta_1, \dots, \beta_{54} \) are regression coefficients, which we want to estimate. \( d_1, \dots, d_{48} \) and \( w_1, \dots, w_6 \) are dummy independent variables. \( \varepsilon_i \) is a random error. Assumption for the errors are that they are independently identical distributed (i.i.d.) with distribution \( \varepsilon \sim N(0,~\sigma^2) \).
Estimation of regression coefficients is done by ordinary least squares (OLS). So if we wrote our model as:
where \( Y \) is a vector of the length \( N \), \( \beta \) is a vector of the length \( p \) and \( \mathbf{X} \) is a matrix of the size \( N\times p,\) then OLS estimation of \( \beta \) is:
You are maybe asking, where is independent variable \( w_7 \) or intercept \( \beta_0 \). We must omit them due to collinearity of independent variables. The model matrix \( \mathbf{X} \) must be a regular matrix, not singular. Thereto, intercept has no sense in the time series regression model, because we do not usually consider time 0.
Regression analysis of time series
Let’s finally do some regression analysis of our proposed model. Firstly, prepare DT
to work with a regression model. Transform the characters of weekdays to integers.
Store informations in variables of the type of industry, date, weekday and period.
Let’s look at some data chunk of consumption and do regression analysis on it. I have picked aggregate consumption of education (schools) buildings for two weeks. Store it in variable data_r
and plot it.
Let’s now create the mentioned independent dummy variables and store all of them in the matrix_train
. When we are using the method lm
in R, it’s simple to define dummy variables in one vector. Just use as.factor
for some vector of classes. We don’t need to create 48 vectors for daily dummy variables and 6 vectors for weekly dummy variables.
Let’s create our first multiple linear model with the function lm
. lm
automatically add to the linear model intercept, so we must define it now 0
. Also, we can simply put all the variables to the model just by the dot - .
.
You can see a nice summary of the linear model, stored in the variable lm_m_1
, but I will omit them now because of its long length (we have 54 variables). So I’m showing you only the two most important statistics: R-squared and p-value of F-statistic of the goodness of fit. They seem pretty good.
Let’s look at the fitted values.
That’s horrible! Do you see that bad fit around 5th March (weekend)? We are missing something here.
Look at the fitted values vs. residuals now.
This is the typical example of heteroskedasticity - occurrence of nonconstant residuals (variance) in a regression model. The linear regression has an assumption that residuals must be from \( N(0,~\sigma^2) \) distribution and they are i.i.d. In the other words, the residuals must be symmetrically around zero.
Let’s look at the next proof that our residuals are not normal. We can use normal Q-Q plot here. I’m using the function from this stackoverflow question to plot it by ggplot2
.
Of course, it is absolutely not normal (points must be close the red line).
What can we do now? Use other regression methods (especially nonlinear ones)? No. Let’s think about why this happened. We have seen on fitted values, that measurements during the day were moved constantly by the estimated coefficient of week variable, but the behavior during the day wasn’t captured. We need to capture this behavior because especially weekends behave absolutely different. It can be handled by defining interactions between day and week dummy variables to the regression model. So we multiply every daily variable with every weekly one. Again, be careful with collinearity and singularity of the model matrix, so we must omit one daily dummy variable (for example \( d_{1} \)). Fortunately, this is done in method lm
automatically, when we use factors as variables.
Let’s train a second linear model. Interactions should solve the problem, that we saw in the plot of fitted values.
Look at R-squared of previous model and the new one with interactions:
R-squared seems better.
Look at the comparison of residuals of two fitted models. Using the interactive plot plotly
here.
This is much better than the previous model, it seems that interactions are working.
Prove it with a sequence of three plots - fitted values, fit vs. residuals and Q-Q plot.
Everything seems much better than in the previous model. The fitted values seem almost perfect.
I also tried to work with a linear trend to boost this model, but it did not help (wasn’t significant). So go ahead and forecast consumption with this model.
Forecast with multiple linear regression
Again, I build function (as in the previous post) to return the forecast of the one week ahead. So we can then simply compare with STL+ARIMA method (was better than STL+ETS). Arguments of this function are just data
and set_of_date
, so it’s easy to manipulate. Let’s add everything needed to function predWeekReg
to create a regression model and forecast.
Define MAPE (Mean Absolute Percentage Error) for evaluation of our forecast.
Now we are ready to produce forecasts. I set training set of the length of two weeks - experimentally proved. In experiments, a whole data set of the length of one year is used, so a forecast for 50 weeks will be produced. A sliding window approach for training is used. Let’s produce (compute) forecast for every type of industry (4), to see differences between them:
Similarly, you can do this with the function predWeek
from the previous post. I used STL+ARIMA method to compare with the MLR with interactions. Here is the plotly
of MAPEs:
For every industry MLR was more accurate than STL+ARIMA, so our basic regression method is working very well for double seasonal time series.
I have created 4 (IMHO) interesting GIFs by the package animation
to show whole forecast for a year. I have done it this way for every four industries:
Here are the created GIFs:
In these animations we can see that the behavior of the electricity consumption can be very stochastic and many external factors influence it (holidays, weather etc.), so it’s a challenging task.
The aim of this post was to introduce most basic regression method, MLR, for forecasting double seasonal time series. By use of the regression analysis, we showed that inclusion of interactions of independent variables to the model can be very effective. This analysis then implies that the forecast accuracy of MLR was much better than with STL+ARIMA method.
In my next post, I will continue with the introduction of regression methods, this time with GAM (Generalized Additive Model).
Script for the creation of this whole post can be found on my github repo.
Thanks for visiting r-craft.org
This article is originally published at https://petolau.github.io
Please visit source website for post related comments.