Predicting Energy Consumption (Part 2)

Photo by Marc-Olivier Jodoin on Unsplash

In Part 1 of this article, we looked at some introductory topics in the domain of time series analysis. Topics covered in Part 1 included exploratory analysis, visualizations, seasonal decomposition, stationarity and ARIMA models. This included an evaluation of autoregression, moving averages, and differencing for development of time series forecasting models.

In Part 2, we are going to be expanding upon what we covered in Part 1 by looking at some alternative time series forecasting methods, including:

Code samples are included in this article, but a full notebook containing examples of each forecasting method can also be found at this Github repository.

Dataset Background

The dataset used for this analysis contains hourly energy consumption data provided by PJM Interconnection. PJM Interconnection is a regional transmission organization (RTO) that coordinates distribution of electricity across a region including all or parts of 14 states in the northeastern United States. The data we will be looking at is composed of hourly energy consumption data from Duquesne Light Co. from January 1, 2005 to August 3, 2018. Duquesne Light Co. serves the Pittsburgh, PA, metropolitan area as well as large segments of Allegheny and Beaver Counties. A copy of the dataset we will be using can be found here.

Data preprocessing included removal of duplicate values and imputation of missing values using interpolation. The code sample below shows how the data was processed before analysis.

Previous Results

In Part 1 of this article, we looked at standard ARIMA models and their seasonal variations to see how well they perform when forecasting hourly energy consumption over a 48-hour period. Mean squared error (MSE) was calculated for each model based on predictions forecasted over the test horizon (8/1/2018, 12:00 AM — 8/3/2018, 12:00 AM). A summary of the results from this comparison is shown in the table below:

The SARIMA model that performed the best included first order autoregressive and moving average terms, differencing, and a seasonal period of 24. This period represents daily seasonality since our dataset includes hourly measurements. While an expanded seasonal period of 24 x 365 = 8,760 may capture seasonal variation at the annual level, implementing the model with a period this large is inefficient since the training time will be prohibitively long.

When evaluating ARIMA models, we used a one-step forecast approach. This approach forecasts a value for the next time step in the series. The forecasted value is stored in a list. Once an actual value has been observed for the next time step, this value is appended to the training time series. The model is then retrained and used to predict the value for the next unobserved time step. This process is repeated over a user-defined window of time. Under this forecasting method, the model is only predicting a single time step ahead at any given time.

One-step forecasting is useful for fast-training models such as ARIMA. However, more complex modeling methods such as the ones covered in this article have longer training times. While it is still possible to perform one-step forecasting using these methods, it can be computationally expensive. As a result, we will generate predictions over our full forecasting horizon for all of the models we will be evaluating in this article for simplicity.

We will continue to use MSE as our evaluation metric, but it is important to note that the prediction methodology used to develop the models covered in this article differs from the one-step approach used in Part 1. Generating forecasts using a one-step approach means that the model is exposed to a larger set of training data and is only responsible for forecasting a single time-step. As a result, this method of training should produce a lower MSE than a method that generates forecasts over an extended horizon. We can expect that our models may not appear to perform as well as standard ARIMA models on the surface, but it is difficult to make a direct comparison between the two prediction methodologies.

Exponential Smoothing

Exponential smoothing is a data forecasting method that incorporates past observations but puts greater emphasis on more recent observations. Future values are forecasted by taking a weighted average over a window of past values. The weight associated with each past observation decreases exponentially as the length of time between the observation and the forecasted value increases. Observations that occurred further in the past are still incorporated into the forecast, but they have less influence over the final forecasted value than more recent observations.

Simple Exponential Smoothing

Simple exponential smoothing refers to calculation of a forecast value using a linear combination of past observations whose weights are adjusted using a smoothing constant (ɑ). The equation used for generating forecasts using simple exponential smoothing is as follows:

This equation represents a weighted average of all of the previous observations in the time series. The larger the time difference between an observed value and the predicted value, the smaller the amount of weight given to the observed value when calculating the predicted value.

This approach is best applied to data that is not characterized by seasonality. Our energy consumption data contains clear seasonality on multiple scales (daily, weekly, and annually), which means that simple exponential smoothing is likely not a good method to use to develop forecasts. However, we will generate forecasts and look at the mean squared error (MSE) to confirm this assumption.

To demonstrate simple exponential smoothing, we will be using the SimpleExpSmoothing class provided by statsmodels. When using this class we are able to select a smoothing constant at the time of model fitting. Alternatively, this value will be optimized if it is not specified. For this example we will allow the model to optimize the smoothing constant to produce the best fit for our data.

Forecasting over our test timeframe using simple exponential smoothing produces a MSE of 57,206. This value indicates that simple exponential smoothing performs very poorly for our dataset. We can also see evidence of poor performance by looking at a plot of the predictions:

There are a number of reasons why simple exponential smoothing does not work well in this instance. First, simple exponential smoothing does not incorporate seasonality. Repeating seasonal patterns are therefore not accounted for in forecast values. Second, since simple exponential smoothing is simply a weighted average it ends up producing a flat forecast. While this can be useful for producing very basic forecasts, data containing seasonality and trend components like the hourly energy consumption data we are using may be too complex for simple exponential smoothing to produce a useful forecast value.

Holt-Winters Method (Triple Exponential Smoothing)

The Holt-Winters Method (also known as Triple Exponential Smoothing) also makes use of exponential smoothing, using a weighted combination of past observations to forecast future values and placing a greater emphasis on more recent observations than on older ones. One of the primary strengths of the Holt-Winters method is its ability to incorporate trend and seasonality into time series forecasting, in addition to the base level / average value of the time series. For more on those topics, please see Part 1 of this article.

When using the Holt-Winters method, exponential smoothing is applied to each of the three constituent components of the time series (trend, seasonality, and base level). This is the reason it is also known as “triple exponential smoothing.” Future values are forecast by combining the influences of each component.

A description of the technical details behind how the Holt-Winters method is implemented is beyond the intended scope of this article. Chapter 7.3 of Forecasting: Principles and Practice by Rob J. Hyndman and George Athanasopoulos provides an excellent description of the mathematics behind the Holt-Winters method as well as some examples using R.

Seasonality and trend may each be modeled in two different ways: additive or multiplicative. An additive method indicates the presence of relatively constant variations, while a multiplicative method indicates the presence of variations that are proportional to changes in the base level of the data.

Photo by Markus Winkler on Unsplash

For our energy consumption dataset, we will be assuming an additive trend component and a multiplicative seasonal component when preparing a Holt-Winters model. These classifications were selected based on trial and error and evaluation of MSE, and it is important to try different combinations of component classifications when fitting a new dataset.

Since our data shows annual seasonality, we will select a seasonal period value of 24. This represents the number of hourly measurements observed in one day. While a seasonal period value of 24 * 365 may provide better performance since it captures annual seasonality, using this seasonal period dramatically increases the model training time.

We will be using the ExponentialSmoothing class from statsmodels to create our Holt-Winters model. The code below sets up a basic Holt-Winters model using our selected parameters, fits it to the training dataset, and generates forecasts over the test horizon:

Forecasting over our test timeframe using the Holt-Winters method produces a MSE of 14,644, significantly outperforming simple exponential smoothing. Clearly incorporating seasonality and trend produces a large improvement. The plot below demonstrates that our predicted values using Holt-Winters follow the actual values very closely at times, indicating that our model performance has improved over simple exponential smoothing.

Implementing a step-wise forecasting approach may offer significant improvements in prediction performance when using Holt-Winters. However, using this method of prediction produces a very long training time.

LSTM Neural Networks

To understand Long Short-Term Memory (LSTM) Neural Networks, we first have to understand Recurrent Neural Networks (RNNs). Understanding RNNs is simpler once we have a grasp of how a simple, fully-connected, or ‘vanilla’, neural network works. Vanilla neural networks take an input vector with a fixed size, passing it to one or more hidden layers of neurons where an activation function is applied to the dot product of the input vector and a vector of weights. Weight vectors are adjusted through backpropagation during the training process. From the hidden layers, information is passed to an output vector which is used to make a prediction based on the input vector. It is assumed that all elements of the input vector are independent of one another.

The architecture described above works well for tabular data and similar formats, but it does not work well with sequential or time series data. RNNs are designed to take sequential data as input. Since RNNs process input sequentially, they are able to accept inputs of any length.

The word recurrent is used to describe the fact that the same operations are performed on each element of a sequence of data. This differs from a traditional fully-connected neural network, which assumes that inputs are not dependent upon one another.

RNNs maintain a “memory” that accumulates information about each element of the sequence as it is processed. The figure below shows a typical RNN architecture. The representation on the left is “rolled” while the representation on the right is “unrolled.”

The RNN possesses a hidden state that is updated as each element of the sequence is processed. The same weight parameters are shared across the length of the sequence, which means that only one set of parameters needs to be learned. Outputs produced at each time step are based on the hidden state memory available at that time step, and they do not include any information provided by elements occurring later in the sequence.

In an RNN, each time step within the sequence is transferred to a hidden state representation. Each hidden state representation is passed to the next time step until the end of the sequence is reached. This way, information from the beginning of the sequence is passed through the sequence, giving the RNN a type of “memory” about the information that occurred previously in the sequence. While this memory transfer is useful, the basic RNN architecture has a difficult time maintaining a memory of events that happened far in the past.

LSTM neural networks are a type of RNN that are designed to improve the default RNN architecture’s ability to learn long-term dependencies between time steps that are separated by wide distances. Like a traditional RNN, LSTMs are composed of a repeating series of modules. Within each module are “gates” that can be used to manage the flow of information into the module’s memory state. By controlling the parameters of these gates, we can control the way information flows in and out of the LSTM memory state, allowing for improved learning of long-term dependencies. An example of a single LSTM module is shown below. For a more detailed explanation of how LSTMs work, check out this article.

To prepare our data for modeling using an LSTM, we will use the TimeseriesGenerator class provided by Tensorflow for sequence preprocessing. We can use this class to convert a sequence of data points into individual batches that can be processed by an LSTM model. Parameters that we must specify to use the TimeseriesGenerator include the length of the output sequences, the period between each individual timestep within the sequence (sampling_rate), the number of timesteps between each output sequence (stride), and the size of each training batch (batch_size). You can read more about how to use the TimeseriesGenerator by reading the Tensorflow documentation.

We will keep our LSTM relatively simple, using a single LSTM layer with 100 neurons followed by a dense layer with one output to produce our forecast value. We will be using an Adam optimizer, and we will train the model to minimize MSE for 80 epochs.

To forecast values using the LSTM, we will use a recursive approach. This means that we will predict one value at a time, appending each predicted value to the training set and using this appended data as input when generating the next prediction in series. This is a similar approach to one-step forecasting except we are appending a predicted value to our training dataset instead of an observed value. Also, we are only using this appended training data to generate predictions rather than fully retraining the model each time the training set is appended.

Forecasts generated by our LSTM generated a MSE of 5,390, demonstrating improved performance over the Holt-Winters method we looked at previously. The plot below illustrates how values predicted using an LSTM compared to actual observed values. In general, the LSTM tended to overestimate hourly energy consumption, but the predicted values were very close to actual observations.

The LSTM used in this example can be improve to increase the model’s predictive capacity. Possible methods for model improvement include the incorporation of additional LSTM layers, differing numbers of neurons in each layer, and addition of regularization via dropout or other methods.


Prophet is an open source procedure developed by Facebook for use in time series forecasting. A description of how Prophet was developed can be found in a paper issued by Facebook entitled Forecasting at Scale.

Prophet models time series data using three components: trend, seasonality, and events/holidays. Trend components are characterized by non-periodicity, while seasonal components are characterized by periodicity. Two methods are used for trend modeling: a nonlinear, saturating growth model and a linear model with changepoints.

A nonlinear, saturating growth model is characterized by logistic growth, meaning that the rate of growth within the system decreases as the system’s carrying capacity is reached. This type of model is used in the natural sciences to model ecosystem populations, which are typically characterized by a logarithmic growth curve that flattens out as a species reaches the maximum population that can be supported by the ecosystem. An example plot showing a logarithmic grown curve is shown below:

This type of model is modified to allow for changes in the system’s carrying capacity over time, which may be more applicable to human systems. In our example, the population of an area or the capacity of the available power generation infrastructure may establish a carrying capacity for energy consumption. However, this capacity can change if the population of the area changes or if new infrastructure is constructed. Linear trend models are used by Prophet to forecast trends that do not exhibit saturating growth (i.e. trends that are not expected to flatten out as a carrying capacity is reached).

Prophet allows for selection of changeponts, which represent specific points in the time series model where the trend growth rate is modified, capturing the effects of nonlinear changes to the system’s trend. Changepoints can be specified by the user if they know specific times at which the behavior of the time series may have changed based on external influences. Alternatively, Prophet can implement automatic changepoint identification.

One large advantage of Prophet over other forecasting methods is that it is able to incorporate multiple periods of seasonality using Fourier series. As described in Part 1 of this article, the hourly energy consumption dataset exhibits not just annual seasonality but also daily and weekly seasonal patterns. ARIMA models and exponential smoothing methods only allow for incorporation of a single seasonal period, meaning that some seasonality is missed.

Prophet also allows the user to provide a list of holidays and other events that may impact time series observations. This includes adjustment of a window of time surrounding each holiday/event during which its impact is still detectable. This allows the model to capture the impact a holiday may have its the preceding and following days.

The code below generates a basic Prophet model that assumes additive seasonality and incorporates effects from US holidays. Also, please note that the model’s trend flexibility is modified by adjusting the changepoint_prior_scale when defining a Prophet model. The default value of this hyperparameter is 0.05. Increasing this value increases the model trend’s flexibility, while decreasing this value decreases trend flexibility. In this example, we have increased the changepoint_prior_scale value to 0.5, increasing the flexibility of our modeled trend component.

Forecasts produced using the Prophet model demonstrated a MSE of 8,515. Predicted values were closer to observed values during periods when energy consumption was rising or falling, but were further away from predicted values at peaks.

Forecasting Method Comparison

The table below compares the MSE observed for each of the four time series forecasting methods evaluated in this article:

The LSTM model performed best, followed closely by the Prophet model. Simple exponential smoothing may be more applicable to time series data that does not contain seasonality. The intention of this article was to provide a basic introduction to time series forecasting using a variety of methods. Predictive performance of each of the methods that were reviewed can likely be improved through additional tuning.

Thanks for Reading!

Please give this article a wave if you found it useful or interesting!

Water/Wastewater Engineer ♦ Data Nerd

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store