See into the Future with Time Series Data

Quick Summary

A demonstration of time series regression techniques: Features are created for use as inputs to a XGBoost machine learning process used to forecast per-store daily sales.  This results in an R2 of over 93%, and is applicable to a wide variety of store types and volumes.  Code available at Github.


This is a demonstration of how to use Pandas and SKLearn to make time series predictions for future time periods. The data source is the Kaggle competition Rossman Store Sales, which provides over 1 million records of daily store sales for 1,115 store locations for a European drug store chain.  The challenge is to find a means to forecast future store sales performance across a wide variety of locations with limited information, other than past performance.

This project will utilize the Pandas dataframe format, and use Python programming and some great Pandas functionality to create time series features.  These time series features are used in an XGBoost regression procedure to create a model that effectively forecasts across the broad range of locations and non-linear sales values. All tools used are open source, python-based frameworks, and the code is always available at my Github.

There are 1,115 different stores, with sales spanning 31 months.  Individual stores have a range in average daily sales of nearly 8x from top to bottom, from 2,711 to 21,690.  In this I will endeavor to build a single model that can operate on any of these stores.


Feature Engineering

The data are provided in two files.  First is the large file of store/sales day records – over 1M records spanning 31 months.  Second is a small set of demographic data at the store level showing how close the nearest competition is to that store, when the competition opened, when that store runs periodic promotions, and some flags regarding the store type and inventory assortment.  These need to be merged back to the sales records, and some of the data need to be time-filtered as well.  For example, if a competitor store is nearby in the Store data, but doesn’t open until halfway through the Sales data timeframe, we only want our feature to acknowledge the competition after it opens.

This starts with a left join on the store number between Sales and Stores, but some careful programming was needed to have the flags set only after the appropriate dates.  My initial prototype of this used index-based looping and was not surprisingly too slow on a dataset of this size.  What you can see at my Github makes better use of Pandas  efficient numeric processing.

Time Series Data

Pandas has a number of time-series functions available that make the creation of rolling window function data quite easy.  However in this project, the Sales dataset isn’t really one time series – it’s 1,115 of them concatenated together.  I wanted each sales-day-store record to have access to that store’s individual time series history in the regression to follow, so I built a loop that extracted each store, calculated the time series features, and reassembled them into the master dataset.  This let me create features like rolling 30- and 7-day mean sales, 30-day max sale values, as well as convenient flags representing whether the store was closed the prior day, or would be closed the next, by shifting a few of the binary flags already available at the store-sale-day level.

The Rossman data includes records for days the stores are closed.  These were useful for inclusion in the data through the preprocessing to keep the data aligned in the 30- and 7-day rolling windows.  However, it’s not necessary to keep them included in the dataset for learning, as we should be able to forecast that the sales of a closed store are zero.  Therefore all closed records are dropped.  In addition, I dropped the first thirty days of records, as that is the maximum window for the time series features, and the features inside that window may be unstable.  The Rossman dataset is happily quite large, so we are still left with sufficient data for our learning exercise.


Machine Learning Approach

XGBoost was used with a linear regression objective function to perform the model building.  First, I performed cross-validation to check that the parameters were appropriate, and estimate the number of rounds of boosting.  Then I ran the XGBoost learner on all but the last 3 months of data, and used that model to predict those 3 months of held out data.  How well did it do?

In the cross-validation procedure, the best test RMSE was 698.02.  However, we need to keep in mind that this error metric was produced with a random split of the training data.  While it is a valid measure, in that we are measuring the predictive power of the model on data that the training algorithm hasn’t seen, it’s not a realistic measure of a forecasting solution.  To get this, we will need to train on a dataset of earlier occurrences, and test on later occurrences that were held out of the training.

When we do this, by training on 27 months of data, and predicting the following 3 months, the model produced a RMSE of 761.07, and an R2 of .938.  This means that the cross-validation was a good directional measure of a true time series train-test experiment, and that over 93% of the variance in the testing data is explained by the model.

Below are three charts, comparing the actual sales (blue) with the predicted values (red).  As one of the original goals of this analysis was to have a single model that effectively can forecast across the wide range of store volumes in this dataset, I have graphed the smallest store by sales, the largest store, and one right in the middle of the distribution.  We can see that for each store, the trends and changes in sales volume from day to day are correctly forecasted, +/- normal forecast error.

This slideshow requires JavaScript.

Next Steps

Strictly speaking, the charts above simulate the performance of a forecasting process that predicts the next day’s sales volume, using everything from the current day and the prior 30 days.  While this would be useful in a business process, a likely improvement would be to build an iterative bootstrap process.  This type of approach could produce forecast out many days in the future, by recalculating model inputs using forecasts of sales for the interim time periods.  Additionally, one could use the modeling process as shown, but change the time series inputs to provide “padding” to allow for the future forecasting period (for example, use days 30-60).  Both processes would be expected to lose forecast accuracy as the window moves out in time, but this is par for the course in any forecasting process.


Thanks – let me know what you think!


Computer Vision and Extreme Boosting

Quick Summary:  A demonstration of computer vision techniques to create feature vectors to feed an XGBoost machine learning process which results in over 90% accuracy in recognition of the presence of a particular invasive species of plant in a photograph.


This is an example of how to combine computer vision techniques with supervised machine learning to create a system that can recognize a class of objects.  The data source is the Kaggle competition Invasive Species Monitoring, which provides 2,296 color photographs of plants in a natural habitat.  The challenge is to find an automated means of classifying whether the photograph contains hydrangea plants.  While these plants are sold as ornamentals, in some parts of the US they thrive as an invasive species, and their progress and growth needs to be monitored.



Much of the focus on classifying photographs in machine learning today uses deep learning techniques, a class of computationally intensive techniques, usually requiring specialized hardware which are showing some really amazing results.  However, having just completed a Georgia Tech’s Computer Vision class as part of my Master’s program, I wanted to see how these techniques would fare in practice.

What you’ll see in this post is the transformation of the input photographs into a more traditional, and far more compact feature matrix.  Then, these features are used in an XGBoost classification process to create an effective model that can recognize the presence of a hydrangea plant in new photos.  All tools used are open source, python-based frameworks, and my scripts are available at my Github.

Here’s an example of one of the positive input photographs:



Feature Engineering


Due to the long runtime for feature engineering, the creation of features is separated from the running of the models.  Feature engineering scripts are here.

Four main classes of features are created.

  1. Color histogram distance
  2. Custom-coded blue detector
  3. Template matching intensity squared differences

After reviewing some of the training images of both positive and negative classes, it was apparent that in many of the positive cases, the hydrangea plants could be recognized by their large flowering structures.  These flowers tended towards blue at their peak, but could vary in color if the plant in question was pre- or post-flowering, or simply if the brightness of the image “washed-out” most of the color.  So I set out to build features that would indicate the presence of the unique shape and color of the flowering bodies.

First, I needed to find positive examples within the images that would highlight the presence of the invasive hydrangea.  I selected 8 training images, and manually cut out patches that represented a good feature (18 training patches total).  Clearly, there is a lot of subjectivity in this step, and I am by no means an expert in these flowers, but I needed to start somewhere.

Here’s a couple examples of the patches used as training images for the next steps:




Structure of images

I used OpenCV for most of the image handing.  Within OpenCV, the typical representation of an image is a 3-layer Numpy array.  Each layer represents the intensity of blue, green and red values at each pixel location.  When viewing the image, we only observe a single pixel, but the rendering is built using these three values.  Therefore, we can directly used these color values as a source of information.

Color Histogram Distances

For each of the training patches, I created a histogram of the patch’s BGR (blue, green, red) values.  I then broke each whole image into 60×60 segments and calculated the histogram for that patch.  After excluding low values (I didn’t want to match on dark segments of the photo), and normalizing these histograms to a consistent range, I calculated and recoded the Euclidian distances between then.  Therefore, if one of the 60×60 input segments has a very similar distribution of color values to one of the input templates, the Euclidean distance would be low, and vise versa.  The histogram approach is robust to small differences in scale, orientation or rotation.  As the goal of this analysis is to detect the presence of hydrangea, without needing to localize it within the image, the features created are both the minimum distance between a template and the image, and the number of 60×60 segments below various threshold levels.

OpenCV supports multiple color spaces, which are different ways of encoding color and intensity other than the traditional BGR layers.  I also converted the templates and the images into the HSV color space, and built a series of histogram distance features using the H (hue) and S (saturation) channels, to see if this approach might be more robust to the variation in lighting.


Custom-Coded Blue Detector

To develop a bit of an intuitive feel for the raw photographic data, I opened up several images in GIMP, which allows to easily navigate parts of the image to explore the underlying values.  It was apparent that the relationships between the BGR values on a full-bloomed and properly exposed image of a hydrangea flower were unique – very high blue values and mid-to-low green and red. So I built a quick function in Numpy, working directly with the photo’s matrix values that counted the number of pixels in each segment where there were very high blue values with low green and red, and recorded the highest proportion found in each image.


Template Matching Intensity

OpenCV has a powerful function, cv2.matchTemplate(), which will take a smaller image, and slide it across a larger image, calculating the differences for each position.  This approach has it yield the normalized squared differences between the two vectors, but there are several metrics to choose from.  Here, the program uses the smoothed greyscale versions of the patches to speed computation, and as the first two metrics are already analyzing the color content well.  This function is very fast, but the approach suffers if the scale or orientation of the being tracked is too different.


Features Created

This approach produced a series of features for each input template, plus one overall measure for the blue detector, resulting in 165 independent features for use in machine learning.  This process takes a long time to run (several hours on my MacBook), but once it’s completed, we have a very compact set of features (less than 3.5 Mb) that represents 2.3 Gb worth of jpegs.


Machine Learning Approach

Code for the ML steps are here, and assume that features have already been run and saved as above.

For this project, I’ll use XGBoost (Extreme Gradient Boosting), the library that has been at the top of so many Kaggle competitions since it came out.  It’s fast and powerful, but very sensitive to the parameter settings.  Initially I used CVGrid and RandomForestClassifier (RFC), just to make sure I was on the right track with the features.  RFC is very nice in that it generally produces good results with minimal tuning, so it provided a quick check that things were working the way I wanted them to.

Once I moved to XGBoost, I first used its cross-validation function to tune the parameter settings.  For this project, I chose AUC (“area under the curve”) as the evaluation metric, rather than a simpler accuracy rate.  AUC is sensitive to the strength of the recommendation, and better accounts for the fact that some classifications could be randomly right in any sample.  It’s both more stringent than accuracy rate, and more finely calibrated.

Once I arrived at my parameter settings, tuning the number of booster iterations, level of regularization, learning rate and random sampling, I split the data 80/20 into train and test segments.  This would be expected to have similar AUC measurements as the cv-experiments, but will let us calculate a final accuracy rate and create a ROC chart.



The AUC matrix in cross-validation was .954, which was very encouraging.  I then split the data into train/test samples and built the XGBoost model on the training set with the parameters defined in the cross-validation experiment.  When I used this model to predict the probability of the presence of a hydrangea plant on the holdout testing sample, it resulted in an AUC of .971.  Using these probabilities to make firm predictions resulted in an accuracy rate of 90.4%, with the breakdown below.

Correctly classified, no hydrangea:141
Correctly classified, with hydrangea:273
Incorrectly classified, no hydrangea:24
Incorrectly classified, with hydrangea:20


Next Steps

To improve the performance of this process, I would suggest two areas for next steps:

Input feature patches:  Experimentation with what feature patches work best, and manual selection/cleaning of the patches should provide better accuracy.  I wanted to build this as a minimum viable product to see what kind of results the algorithm could produce, so the input patches are pretty simple selections without cleaning of extraneous features, beyond a little simple Gaussian blurring.

Parallelization of feature creation: Creation of the feature vector is the long pole in this process, but clock time could be significantly improved by parallelizing the image processing.  Individual images have no bearing on each other in the image processing, so there is no reason this couldn’t be parallelized, either on a single computer, or across a cluster, for a significant speedup.


Hockey Alumni Analysis: The Results

It always hurts when your team is scored on – it’s that much worse when the opposing player used to be on your favorite team!


This is Part 1 of a two-part post, which will answer the question of why it seems that often while watching NHL hockey, your team is playing a former teammate.  This part will focus on the results of the analysis.  Part 2 will focus on the technical details of how data science was used to create this analysis.  For now, I’ll mention that the data used is a snapshot, based on Wikipedia, and is a point-in-time analysis.  If teams move their players up from their farm teams, or lose players to injury, it won’t be reflected here.


My hometown team is the Buffalo Sabres.  After last being a playoff contender in 2011, the Sabres have struggled competitively, and had changes in ownership, coaching and most of the players.  It seemed that lately the Sabres were continually matched up against an alumni player (someone who was formerly a Sabre).  I wondered how often this occurs (or is it just noticeable when someone traded away previously gets a goal against the Sabres)?  What team had the most former players active on other teams?  Was there a team that didn’t have any former players active?


It turns out, there’s quite a wide range, and the Sabres are not the top of the stack in active alumni.  It ranges from a low of 8 players for the Detroit Red Wings to 33 for the Chicago Blackhawks.  The Buffalo Sabres come in at 27 in the top 10.  Almost half the teams have enough players to field an opening day roster, currently playing elsewhere (although some players are counted for multiple teams as they move around a lot).


Some teams have more than one alumni playing for a particular team, such as the Winnipeg Jets, who have 4 former Sabres on their roster at the time of this writing.  This means that the number of actual teams with an alumni play that your team might play against is lower than the total number of players outstanding.  The Pittsburgh Penguins edge to the lead here with 21 teams staffing a former Penguin.  Detroit is, not surprisingly, at the bottom with only 6 teams playing a former Red Wing.  Buffalo is a little further down the ranking, possibly due to the heavy placement at Winnipeg, with 16.


During the 82 game regular season, a team will play certain teams more often that others as the schedule is designed around a hierarchy where a team plays within-division and within-conference teams more often than without- teams.  By combining the 2016-2017 regular season schedule with the presence of at least one alumni, one can quantify how often this will actually occur.  Here the Penguins are the leader, with well over half of their games playing against a former Penguin.  In fact, over half of the NHL teams are scheduled to play a former play half the time, with the Sabres, the Penguins and 16 other teams scheduled with 41 or more regular season games.


With the anticipated activity around the spring trade deadline, the numbers will shift around a little.  The addition of the Las Vegas Golden Knights as an expansion team will be a point of inflection as well, due to the expansion draft process.

So in summary, it turns out that our Buffalo Sabres are not the most prolific feeder team to the league.  But, if you put the Sabres together with the Pittsburgh Penguins, then every other team in the NHL but the Dallas Stars and Tampa Bay Lightning have one of our rust belt alumni playing for them.  And the Stars have Lindy Ruff as Head Coach, former Sabre both as a player and as a coach, which we’ll count for this analysis.  So, on behalf of the Sabres and the Pens to the rest of the NHL, you’re welcome.  And to Tampa Bay, maybe after the next round of trades you’ll get on board and simplify the Venn diagram below.


Total counts are higher in the Venn than show above, as this includes the fact that the Buffalo Sabres is of of course full of current Buffalo Sabres, and same with the Penguins.


The Python code used to create this analysis is available at my Github.  I’ll have a post up shortly explaining the web scraping and data analysis in detail.  Go Sabres!



Free Machine Learning eBook!

Andrew Ng’s early access copy of his new book Machine Learning Yearning is now available!  You can sign up for it at the below link.  Add your email to the list and you’ll get a link to download the book – I just downloaded my copy.  Andrew is an influential driver of machine learning and AI, and also has a talent and passion for teaching other, so I’m looking forward to digging into this early access copy.

Who is Andrew Ng?  He’s Chief Scientist at Baidu, co-founder at Coursera, author of a Machine Learning MOOC I mentioned in an earlier post, and professor at Stanford.


Computer Science Education Week

It’s Computer Science Education week, Dec 5-11!  I thought I’d share a few of the resources I’ve found helpful in my own data science journey.  All offer free or low cost options for you to pursue your own interests.

Data Science Education resources:

Udacity:  I’ve been a Udacity user since the early pilot days.  Their content has grown significantly over time, and they continue to experiment with innovative ways of packaging learning.  You can get free access to the course materials, or enroll in various guided learning experiences.  There are a few great ones that will get you started with python programming, and cover essentials such as GitHub.  You can also get access to course materials for Georgia Tech’s online masters program – more on this below.

Coursera: Another major MOOC platform, which has lots of course content, both technical and non-technical available.  If you’re interested in data science, Andrew Ng’s classic machine learning course is a must-do.  It uses Matlab/Octave for the programming assignments, which is a little non-standard, but Octave is free and it gives you some great insights into how machine learning algorithms work.

iTunes U: With apologies to Android users, iTunes U hosts a wide variety of lecture content, so you can find something on whatever topic you’re seeking.  I’ve used it to supplement many other courses.  Sometimes when you’re having trouble understanding a topic, you just need to hear it from a different angle.  I most often have accessed lectures published by MIT’s OCW (Open CourseWare) initiative, and you can get everything directly from their website as well.

Georgia Tech OMSCS: Ready to take the plunge for an accredited CS masters?  I’m currently studying with Georgia Tech, specializing in Machine Learning and Computational Perception.  This is a fully accredited masters program with a competitive application process and a great deal of academic rigor – it’s one of the top 10 CS schools!  This program has been a fantastic experience so far, and uses the Udacity platform to deliver lectures.  They’ve even taught artificial intelligence, using an AI bot!

There’s never been a better time to learn about data science – start or continue your journey today!