An overview of Machine Learning concepts

This article looks to summarise some of my learnings from a semester studying Machine Learning.

I recently completed my graduate semester of Machine Learning from Georgia Tech as part of my Master of Computer Science. It was the subject I was looking forward to most, as I sought to gain a deeper understanding of the algorithms involved after having some years of practical experience.

When it comes to using machine learning algorithms, I heard an analogy I quite like involving the difference between cooks and chefs. A cook can follow a recipe, but a chef knows more deeply why to follow the recipe, when to make substitutions, and their impacts. The use of machine learning algorithms is similar, provided you know enough of the basic principles so you don’t get ‘burnt’, an analyst can apply the algorithms and examine the results.

The course covered the three major branches of machine learning – supervised learning, unsupervised learning, and reinforcement learning.

Supervised Learning
Supervised learning is where we use structured data to understand the relationship between many variables and a single variable. The most common example market researchers would be familiar with is driver analysis, where we look to understand how features of a customer’s experience affect their overall satisfaction, customer value, or retention. Supervised learning usually emphasises the ‘R squared’ component, that is, making a strong model that could potentially generate new predictions when provided with a new record. This is more common when utilising a text analysis model for sentiment or classification. Some of the more famous algorithms include ‘deep learning’ and ‘neural networks’, but also include the humble ‘linear regression’ and ‘decision tree’.

Unsupervised Learning
Unsupervised learning is where we do not have an outcome variable in mind, but we want to try to understand the structure of our data better. For market researchers we may have seen two methodologies for these techniques: to cut down on survey length, we may use principle components analysis to analyse if there are a smaller set of ‘quasi-questions’ or ‘dimensions’ that are effective. We also use an unsupervised approach in segmentation projects, where we take many variables and collapse the space down to a single ‘dimension’, such that respondents are allocated to one of a few segments. More widely, these sorts of techniques can be used to pre-process data, particularly text and picture data, to be ready to perform supervised learning. Some common algorithms include ‘Nearest Neighbours’, ‘Principle components analysis’, and can advance up to neural network ‘embeddings’ and ‘t-SNE’.

Reinforcement Learning
Reinforcement learning is similar to supervised learning, but a key difference is that reinforcement learning learns from experience with the ‘world’, as opposed to a pre-formed data set.

Reinforcement learning algorithms produce an ‘agent’ which can act in the world based on being in some sort of state and wanting to move to some other state. The algorithms are designed to take into account concepts like time steps and the potential for there to be other agents in the world. We don’t see this approach used much in market research, but when we see headlines of backgammon, chess and go champions being beaten by computers, they generally use this branch of machine learning.

Reinforcement learning algorithms attempt to provide guidance of how to act given a reward (such as a ‘win’ in a game) being far off into the future. It is concerned with identifying the steps and actions that lead to win or loss. There is some applicability in the domain of chatbots; in that we’re interested in having a conversation based on a number of ‘steps’, at the end of which a respondent provides us feedback on how they enjoyed the conversation. This ‘reward’ can be back-propagated to help design more enjoyable conversations.

Of these three areas, supervised learning is popular in the business community for creating value. Many businesses have been conducting propensity modelling for some time, measuring the likelihood that a customer will defect or upgrade. Nowadays the tools and knowledge to be able to scope, model and drive improvements to key metrics are more available than ever. By way of example, let’s delve into one of the basic algorithms, and explore a fundamental consideration in machine learning.

Imagine a situation where we’re looking to model customer satisfaction, using features of the customer’s experience such as ratings of range, service, ease and price. An alternative might be to use customer churn or spend in the last six months as a target variable. Using this concept, we will explore the decision tree algorithm, and touch on the concept of over/under fitting.

Decision trees work on the principle of ‘if/else’ statements to divide up the sample to show differences in the target variable more clearly. For example, dividing the whole sample into two smaller samples, those who rated ‘Service’ more than 6 versus those that did not – we might expect there to also be a difference in the overall satisfaction ratings for these two groups. A decision tree starts at the root and always branches into two sub-samples. If we want a very ‘deep’ tree, we may end up with lots of branches and subsamples, maybe so much that we are only left with one case for the set of rules. When we’ve finished branching, the resulting subsample is called the ‘leaf’. The if/else branching can be specified manually, but usually we leave it to the algorithm to decide on the best split using ‘information gain’ (which is a topic in its own right). Despite it being a tree, we like to draw them upside down, so the following is an example of a tree with depth = 3 for our hypothetical data set.

decision tree

The bottom boxes indicate the overall satisfaction score for the subsamples after the rules are applied. The decision tree appears to have done a good job – there is a great deal of difference between the groups, and we have a clear set of rules we can apply to understand how a customer’s overall satisfaction score will be influenced by ratings on customer experience features. But before we call it a day, we should really look to understand the accuracy of this model. To do this, we can create a ‘predicted’ score using the decision tree model and compare it to the real overall satisfaction score.

relationship between actual and predicted satisfaction
relationship between actual and predicted satisfaction
relationship between actual and predicted satisfaction

An R-squared of 62% can be interpreted as we’re explaining 62% of the variability in the overall satisfaction score just by using our four features. Not bad! But can we do better? Let’s try the same approach with a decision tree with max_depth=30.

Seeing an R-squared of 86% we might conclude ‘That’s much better!’, but there is a devil in the detail which we need to be aware. Instead of customer satisfaction, imagine we wanted to predict customer purchasing, and now we want to apply our model in the real world. What we may find using this model is that the performance we’ve cited here does not happen in reality. This could be because we have biased our analysis by using the same sample for testing accuracy as we’ve used for creating the model.

What we need to do is split the original sample into three subsamples; one for creating models, a second for testing and selecting the models, and a third to give us an unbiased performance expectation. This is known as ‘train, validation and test’ samples. There are no hard rules for what proportion of sample you should have in each, and it may depend on the particular challenge – in this case we will use 60/20/20 proportions.
Now that we’ve run the decision tree algorithm over many possible tree depth sizes, we see an interesting characteristic in the r-squared result as reported by the training and validation sample sets. We see that the training score appears to keep on increasing, getting better and better. But the validation score appears to hit a sweet spot at about depth = 6.

What we are doing when we make the decision tree deeper adding more complexity to the model. There is a sweet spot where we need model complexity to describe the data, but not too much so that we have a negative impact on real world performance. This is a fundamental concept in machine learning known as the bias/variance trade-off, which is related to overfitting and underfitting.

What are you currently doing with Machine Learning? I’d love to hear your experiences – leave a comment below.