A demonstration of time series regression techniques: Features are created for use as inputs to a XGBoost machine learning process used to forecast per-store daily sales. This results in an R2 of over 93%, and is applicable to a wide variety of store types and volumes. Code available at Github.
This is a demonstration of how to use Pandas and SKLearn to make time series predictions for future time periods. The data source is the Kaggle competition Rossman Store Sales, which provides over 1 million records of daily store sales for 1,115 store locations for a European drug store chain. The challenge is to find a means to forecast future store sales performance across a wide variety of locations with limited information, other than past performance.
This project will utilize the Pandas dataframe format, and use Python programming and some great Pandas functionality to create time series features. These time series features are used in an XGBoost regression procedure to create a model that effectively forecasts across the broad range of locations and non-linear sales values. All tools used are open source, python-based frameworks, and the code is always available at my Github.
There are 1,115 different stores, with sales spanning 31 months. Individual stores have a range in average daily sales of nearly 8x from top to bottom, from 2,711 to 21,690. In this I will endeavor to build a single model that can operate on any of these stores.
The data are provided in two files. First is the large file of store/sales day records – over 1M records spanning 31 months. Second is a small set of demographic data at the store level showing how close the nearest competition is to that store, when the competition opened, when that store runs periodic promotions, and some flags regarding the store type and inventory assortment. These need to be merged back to the sales records, and some of the data need to be time-filtered as well. For example, if a competitor store is nearby in the Store data, but doesn’t open until halfway through the Sales data timeframe, we only want our feature to acknowledge the competition after it opens.
This starts with a left join on the store number between Sales and Stores, but some careful programming was needed to have the flags set only after the appropriate dates. My initial prototype of this used index-based looping and was not surprisingly too slow on a dataset of this size. What you can see at my Github makes better use of Pandas efficient numeric processing.
Time Series Data
Pandas has a number of time-series functions available that make the creation of rolling window function data quite easy. However in this project, the Sales dataset isn’t really one time series – it’s 1,115 of them concatenated together. I wanted each sales-day-store record to have access to that store’s individual time series history in the regression to follow, so I built a loop that extracted each store, calculated the time series features, and reassembled them into the master dataset. This let me create features like rolling 30- and 7-day mean sales, 30-day max sale values, as well as convenient flags representing whether the store was closed the prior day, or would be closed the next, by shifting a few of the binary flags already available at the store-sale-day level.
The Rossman data includes records for days the stores are closed. These were useful for inclusion in the data through the preprocessing to keep the data aligned in the 30- and 7-day rolling windows. However, it’s not necessary to keep them included in the dataset for learning, as we should be able to forecast that the sales of a closed store are zero. Therefore all closed records are dropped. In addition, I dropped the first thirty days of records, as that is the maximum window for the time series features, and the features inside that window may be unstable. The Rossman dataset is happily quite large, so we are still left with sufficient data for our learning exercise.
Machine Learning Approach
XGBoost was used with a linear regression objective function to perform the model building. First, I performed cross-validation to check that the parameters were appropriate, and estimate the number of rounds of boosting. Then I ran the XGBoost learner on all but the last 3 months of data, and used that model to predict those 3 months of held out data. How well did it do?
In the cross-validation procedure, the best test RMSE was 698.02. However, we need to keep in mind that this error metric was produced with a random split of the training data. While it is a valid measure, in that we are measuring the predictive power of the model on data that the training algorithm hasn’t seen, it’s not a realistic measure of a forecasting solution. To get this, we will need to train on a dataset of earlier occurrences, and test on later occurrences that were held out of the training.
When we do this, by training on 27 months of data, and predicting the following 3 months, the model produced a RMSE of 761.07, and an R2 of .938. This means that the cross-validation was a good directional measure of a true time series train-test experiment, and that over 93% of the variance in the testing data is explained by the model.
Below are three charts, comparing the actual sales (blue) with the predicted values (red). As one of the original goals of this analysis was to have a single model that effectively can forecast across the wide range of store volumes in this dataset, I have graphed the smallest store by sales, the largest store, and one right in the middle of the distribution. We can see that for each store, the trends and changes in sales volume from day to day are correctly forecasted, +/- normal forecast error.
Strictly speaking, the charts above simulate the performance of a forecasting process that predicts the next day’s sales volume, using everything from the current day and the prior 30 days. While this would be useful in a business process, a likely improvement would be to build an iterative bootstrap process. This type of approach could produce forecast out many days in the future, by recalculating model inputs using forecasts of sales for the interim time periods. Additionally, one could use the modeling process as shown, but change the time series inputs to provide “padding” to allow for the future forecasting period (for example, use days 30-60). Both processes would be expected to lose forecast accuracy as the window moves out in time, but this is par for the course in any forecasting process.
Thanks – let me know what you think!