Azure Machine Learning (AML) is an exciting technology in the Microsoft Data Platform that is on the radar of more and more organisations. Azure Machine Learning enables organisations to take their first step to performing predictive analytics by offering a PaaS service with a simple but powerful user interface, allowing simple or complex workflows. There are many use cases for the technology, whether you are performing regression, classification or clustering problems. This article focusses on the use case of predicting sales amounts by using regression analysis.
So, you have made a start with AML and ran a few experiments, but how can you get the best results when using the technology? There is no quick answer to this that works in all use cases, but this article describes the types of things that should be considered. These are the considerations you should give when tackling an AML problem.
- Data Quality & Preparation
- Feature Selection
- Feature Engineering
- Algorithm Selection
- Tuning the model
- Increasing data volume & retraining
Before discussing these considerations, it is worth mentioning that one of the most important things you can do at the start of any Azure Machine Learning project, is to set expectation within your organisation on what AML can achieve. Azure Machine Learning gives a prediction. Whilst you can improve the accuracy of this prediction by considering the topics below, it will never be 100% correct. Set this expectation from the start.
To get the best results from AML regression analysis, it is important that your training data set contains meaningful features for the model to consider – features that help to describe the scenario you are trying to predict. Often your starting point will be working with internal data sets. Model prediction performance can be significantly increased by including additional features in the data set, from either additional internal, or external data sets. Types of data sets to use will vary with the use case, but suggestions for enriching your data include;
- Demographic Data (Population Age, Income profile, split of young professionals/families/retired)
- Housing Data (Types of Home, Average House Price)
- Social Media Data (Twitter, Facebook, Yammer)
- Weather Data (type of weather, deviation of the temperature to the average for that day)
- Public Holidays
- External Events (Local/National/Sporting)
- Data Quality & Preparation
Once you have meaningful data in your data set, it is then important to ensure its quality by preparing the data for machine learning.
- Ensure the data is clean (no corruption)
- Ensure data is consistent (are you using master data?)
- Missing data – how are NULL values treated? (over 40% – candidate for feature rejection)
- How do we deal with outliers in the data ( > 1.5x the interquartile range)
- Are data types being converted for optimal AML processing
- Remove duplicate rows
- Check values are in expected range
- Profile your data (min, max, mean, median, standard deviation, number of rows, number of distinct values)
- Investigate Class Imbalance. Should we synthetically add values?
AML has great inbuilt tools to help visualise the incoming data set, producing statistics for each column and histograms showing the distribution of data. It may also be useful to use tools such as Excel or PowerBI to visualise the data to determine if it is representative to the application it is being used for.
Feature selection is where we restrict which columns of data are used by the model. Too few features may result in a poor prediction, as you are not giving the model enough data to describe the influencers of the prediction. Too many columns can lead to what is known as the curse of dimensionality, which can lead to overfitting.
The Curse of Dimensionality
Here is some advice when considering feature selection;
- Aim to reduce features where possible
- Remove any features that has no effect on the model
- Examine correlation between features and remove features with similar correlations
- Visualise the data with AML, and use the built-in statistics
- Use filter based feature selection module in AML
- The filter based feature selection module in AML is useful here to identify the most likely columns which describe the data.
Feature Engineering is where we construct new features in the data set so that the data set better describes the problem. Data domain knowledge is required here. Feature Engineering include such tasks as;
- Bringing in external data
- Decomposing categorical values e.g. colour – would a binary “has colour” feature be useful, as well as a colour feature?
- Decomposing data/time feature – Day of the week, Hour of the day, Public Holiday
- Reframing numerical quantities – change a rate into a time and amount
- Classifying/Binning continuous data into category data if this makes more sense
- Considering using PCA (Principal Component Analysis) to reduce number of features. This is a statistical procedure which converts training dataset into a set of linearly uncorrelated variables called principal components
When looking at any problem you are trying to solve with AML, it is important to evaluate and score multiple models. The regression model that gives the best prediction may vary as you run your experiments, for example as the data set expands with additional features from other data sources or as the volume of training data increases. For regression analysis where you are making a prediction, the following (non-exhaustive) list of models should be considered;
- Linear regression
- Decision Tree regression
- Boosted Decision Tree regression
- Decision Forest regression
- Neural Networks
Consideration should also be given to using additional models offered by R/Python what aren’t included by AML. Azure Machine Learning includes support for running R and Python scripts.
Tuning the Model
For the regression models, each model will have parameters with affect the predictions made by the model. These are called hyperparameters. The accuracy of the model can be greatly enhanced by varying these parameters to suit the data set. The problem then becomes one of the number of permutations – if you are evaluating multiple models each with their own set of parameters, the number of experiments required increases very quickly. Azure Machine Learning comes to the rescue here with the parameter sweep module. When included in your workflow, this allows AML to vary the parameters of each model to automatically tune the hyperparameters for best effect.
Increasing the Data Volume & Retraining
It is important to understand that once you have developed a model, and you are happy with the output, that this is not the end – the model will need to be maintained. The model will need to evolve over time, whenever there is significant business or data change, for example. The frequency of this will depend of the data set and your type of prediction, but it is important to constantly monitor the accuracy of the model. Feature selection and engineering may need to be revisited, as may the choice of model you have deployed.
Machine Learning Workflow Tips
Finally, some tips on using Azure Machine Learning.
Use separate Production and Test/Dev Azure Machine Learning Workspaces to separate experiments from the production model.
Run experiments in batches against multiple models, and then capture and compare all models.
This results in a more automated approach to testing the permutations of using multiple models.Document everything – the data set, methodology used, results etc to build up a knowledge base of each experiment performed and what was being tested and how, to capture a history/change log of what has been done