Explaining Machine Learning to a 5th Grader

Machine LearningThis column is by Soumendra Mohanty, EVP & CDAO, L&T Infotech

This is a tough task, I was in this precarious situation trying to explain to my younger son. I had a curated list of top 10 frequently used Machine Learning algorithms, but the key was to do a backward mapping of these Machine Learning techniques to solve problems which are of interest and relevance to my son. As a starting point to the conversation I asked him, list down your decision making points, meaning there may be many situations when you had to make decisions but you may not have all the information. I took those and matched it to the machine learning algorithms while explaining the core concept behind the problem solving.

Here it goes.

  1. Decision trees: C4.5/C5.0

What does it do? C4.5/C5.0 develops a classifier in the form of a flow chart (decision tree) with conditions at each branching node.

Wait, what’s a classifier? A classifier is a machine learning technique that takes a bunch of data and attempts to predict which class the new data belongs to.

What’s an example of this? Sure, remember sometime back you were asking me who you want to invite to your birthday party and whether they will accept your invitation or not! Now, assume that you have got a data set about all the other 29 kids in your class. The information contains their hobby, kind of books they read, do they share their tiffin or not, are they friendly, in your last birthday did they come and did they bring nice gifts, etc.

Now, given these information, you want to predict whether your classmates will accept your invitation or not. Each classmate can fall into 1 of 2 classes: will accept your invitation or won’t.

Cool, but how exactly this works in a decision tree? Decision tree learning creates something similar to a flowchart, at each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified.

For example: You want to invite Arnab but you are not sure whether he will accept your invitation or not! The decision tree builds a path with the following condition:

Stays near to our house, plays soccer with you, you had been to his birthday party hence if you invite him he will accept your invitation.

Is this supervised or unsupervised? This is supervised learning, since the training dataset is labeled with classes. Using the birthday party invitation example, the decision tree doesn’t learn on its own that a classmate will accept your invitation or won’t. We told it first, it generated a decision tree, and now it uses the decision tree to classify.

  1. The k-means algorithm

k-means algorithm divides a given data set into k number of clusters.

What does it do? k-means creates ‘k’ groups from a set of objects so that the members of a group are more similar.

what’s cluster analysis? Cluster analysis forms groups such that the group members are more similar versus non-group members. Clusters and groups are synonymous in the world of cluster analysis.

Is there an example of this? Sure, remember when I ask you about your friends you have this way of explaining who are the nerds, who are the sports guys, who are friendly, who are talkative and who are the troublesome guys! In essence in your school, by analyzing the behavior of the kids you have developed your own cluster analysis techniques.

Interesting part is you tell k-means algorithm how many clusters you want. K-means takes care of the rest.

How does k-means take care of the rest? k-means has lots of variations to optimize and forms the clusters. At a high level, it does something like this:

a. k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.

Every kid in your class will be closest to one of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.

What we have are k clusters, and each kid is now a member of a cluster.

b. k-means then finds the center for each of the k clusters based on its cluster members (yep, using the characteristics of each kid in your class!).

This center becomes the new centroid for the cluster.

c. As you add more and more behavioral aspects, the clusters get refined and start forming strong homogeneity and the cluster memberships stabilize. This is called convergence.

Is this supervised or unsupervised? It is unsupervised, other than we specifying the number of clusters we want to form, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to.

If you have got a massive dataset, you can use clustering technique to get an overall feel of what kind of distinct behaviors are there in the data and then carry on your analysis from there onwards.

  1. Support vector machines (SVMs) –

It is similar to the classification algorithms (C4.5/C5.0/CART) we discussed earlier but it uses sophisticated math to solve classification problems.

What does it do? Support vector machine (SVM) learns a hyperplane to classify data into two classes.

What is a hyperplane? A hyperplane is a function which creates a distinct boundary separating the two classes. In fact, for a simple classification task with just two features, the hyperplane is actually a line.

Do you have an example? The simplest example is to take all your toys (working and not working), dump all of them on a table. If the toys aren’t too mixed together (meaning they are either working or not working, you can’t say that some are partially working) you could take a chalk and draw a line and keep the working toys to one side of the line and not working toys on the other side of the line.

Now, when you want to add another toy, by knowing which side of the line the toy is on, you can predict whether it is working or not working.

What do the toys, table and line represent? The toys represent data points, and the working toys and not working toys represent two classes. The line represents the hyperplane.

What if things get more complicated? If the toys are mixed together, a straight line won’t work. In those scenarios, a 2 dimensional view of your toys won’t work hence you elevate your observation space to 3 dimensions and minutely examine what other additional characteristics will help you determine the classes. SVM does this in an automated way, maps them into a higher dimension and then finds the hyperplane to separate the classes.

One more terminology you will encounter when doing SVM, it is called margin. The margin is the distance between the hyperplane and the two closest data points from each respective class. In the toy example, the distance between the line and the closest working toy and not working toy is the margin. SVM attempts to maximize the margin, so that the hyperplane is just as far away from working toy as the not working toy. In this way, it decreases the chance of misclassification.

Is this supervised or unsupervised? This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.

  1. The Expectation-Minimization (EM) algorithm

The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. EM is useful in Catch-22 situations where it seems like you need to know A before you can calculate B and you need to know B before you can calculate A.

What does it do?  EM begins by making a guess at the parameters. Then it follows an iterative 3-step process:

  1. E-step: Based on the parameters, it calculates the probabilities for assignments of each data point to a cluster.
  2. M-step: Updates the parameters based on the cluster assignments from the E-step.
  3. Repeats until the parameters and cluster assignments stabilize.

This is very complicated, do you have an example? Alright, you don’t like your social studies paper, you have not seriously taken notes in class and have a very minimal understanding about the topics but you have an exam tomorrow. How will you prepare for your exam? There are two things you need to keep in mind. First your level of understanding of the social studies paper (we will call it X) and probable questions that are likely to occur in the exam (we will call it Y).  It is not possible to arrive at Y without X, and waste of time to improve X without knowing Y. The sensible thing to do is to first study for some time (optimize X) then predict what are possible questions for the exam (expectations for Y), then do this cycle again and again until you are satisfied that you understand the topics very well (optimal X) that you are prepared for questions that are very likely to come in the exam (based on expected Y).

By optimizing the likelihood, EM generates a model that assigns class labels to data points (This looks like one more clustering technique!).

Is this supervised or unsupervised? Since we do not provide labeled class information, this is unsupervised learning.

  1. AdaBoost

Is an iterative process where a set of algorithms are used to achieve better performance as opposed to applying a single algorithm. This process is called ensemble learning method.

What does it do? AdaBoost is a boosting algorithm which constructs a classifier. Boosting is an ensemble learning algorithm which takes multiple learning algorithms and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.

Note: A weak learner classifies with accuracy barely above chance. A strong learner has much higher accuracy.

What’s an example of AdaBoost? Let’s start with 3 weak learners. We’re going to train them in 10 rounds on a training dataset containing your soccer team’s data. The dataset contains details about your past matches, goals scored, goals against, your key players and their performance, and also your opponent team’s performance, etc.

The question is, can you predict whether you will win your next match against a new team you have never played with them earlier?

Here’s how AdaBoost answers the question…

In round 1: AdaBoost takes a sample of the dataset and tests to see how accurate each learner is predicting the outcomes against the known teams. The end result is we find the best learner. In addition, the cases where the learner misclassified the outcome are given a heavier weight, so that they have a higher chance of being picked in the next round. Why would you do this? It’s like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.

Similarly, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there’s just 1 learner).

In round 2: AdaBoost again attempts to look for the best learner. The sample of your dataset is now influenced by the more heavily misclassified weights. In other words, previously misclassified outcomes have a higher chance of showing up in the sample. The best learner is again weighted and incorporated into the ensemble, misclassified outcomes are weighted so they have a higher chance of being picked and we repeat the steps.

At the end of the 10 rounds: We’re left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds. Now when you introduce the data about the new team you have got a higher chance of predicting the right outcome.

Is this supervised or unsupervised? This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.

  1. Naïve Bayes

Naïve Bayes is a family of classification algorithms.

What does it do? It is based on a (naïve) assumption that all features of dataset are independent. The algorithm predicts the class given a set of features using probability.

What does independent mean? We say features are independent if the value of all features have no effect on each other. For example, if we look at the kids in your class, it’s reasonable to assume that the kid’s height and where they live are independent, similarly height and exam scores are independent. But, we can’t say the same thing for height and weight because if height increases, weight likely increases.

In real world scenario, the features of a dataset are generally not all independent.

The simplified equation for classification looks something like this:

p(A|B) = p(B|A) * p(A) / p(B)

  • p(A|B) is the posterior (usually read ‘probability of A given B’) is the probability of finding observation A, given that some piece of evidence B is present.
  • p(B|A) is the likelihood. This is the probability of the evidence turning up, given that the outcome obtains.
  • p(A) is the prior. This is the strength in our belief of A without considering the evidence B.
  • p(B) is the evidence. This is the probability of the evidence arising, without regard for the outcome.

Let’s dig deeper into this through an example. Let’s assume that there is a cricket match between two teams: India and South Africa, and you want to determine which team will win. Below table illustrates their past encounters and outcomes.

Won Lost Total
India 5 7 12
South Africa 7 5 12

It turns out that India and South Africa have played against each other on twelve previous occasions. Of these last twelve matches, India won five and South Africa won the other seven. Therefore, all other things being equal, the probability of India winning the next match can be estimated from previous wins: 5 / 12, or 0.417, or 41.7%. South Africa, on the other hand, appears to be a better bet at 7/12, or 0.583 or 58.3%. So, given only the information that we have, and everything else being equal you’d want to back South Africa. A 58.3% likelihood of winning is more favourable than a 41.7% likelihood.

Now let’s add a new factor into the scenario. It turns out that on three of India’s previous five wins, it had rained just before the match. However, it had lost only once when it had rained prior to the match. On the day of the match in question, it is raining in the morning. Now, how does this extra information affect where you should put your money? If you ignore the information about the weather, and use only the counts of previous wins and losses, you should, of course back South Africa as we previously determined. On the other hand, if you use only the new information about the weather, and neglect the previous counts of wins and losses, you would perhaps back India. This is because three of its previous five wins have been on rainy days. You might estimate India’s probability of winning as 3 / 5, or 60%, on this basis. However, to do this would be to ignore a crucial piece of information — that overall India has won fewer matches than South Africa. What we need to do is to combine the two pieces of information to get some kind of overall probability of India winning the match.

To do this, we need to examine four possible situations: India wins when it rains; India wins when it doesn’t rain; India loses when it rains; India loses when it doesn’t rain. We know that India achieved three of its five wins on rainy days, and it only rained once when it lost. This means India must have won twice on sunny days. Since they have played twelve matches, the number of times India lost on a sunny day must be 12 – (3 + 1 + 2), which is 6. We can tabulate these results like this:

Raining                   Not Raining

India wins    3                  2

India loses   1                   6

What we need to know is the probability of India winning, given that it is raining. Of course, we could work out the probability of South Africa winning instead, but since (we assume) one team must win, we can always work out one team’s probability by subtracting the other’s from 1.0. Note that the probability of India winning given that it is raining is not at all the same as the probability of its being raining when India wins. We have already calculated this probability as 60% (three rainy days out of five wins).

So what is the probability of India winning, given that it is raining? We know that India won on three occasions on which it rained, and there were four rainy days in total. So India’s probability of winning, given that it is now raining, is 3 / 4, or 0.75, or 75%. Note that the additional information — that India won three times out of five on a rainy day — shifts its probability of winning this current match from 41.7%, to 75%. More importantly, it means that now India is more likely to win than South Africa, even though South Africa has won more matches overall. This is important information if you plan to bet on — if it is raining you should back India; if it is not, you should back South Africa.

Now, if we go back to our Naïve Bayes Theorem, ‘A’ was the outcome in which India wins, and ‘B’ is the piece of evidence that it is raining. So, p(A|B) is what we want to find out. Since it was raining three of the five times that India won, p(B|A) is 3 / 5, or 0.6. p(A) is the probability of the outcome occurring, without knowledge of the new evidence. In our example, p(A) is 5 / 12, or 0.417, because India won on five out of twelve occasions. Finally, in our example, it is the probability of rain, without regard for which team won the match. Since we know it rained on four days, this value is 4 / 12, or 0.333. So the value of p(A|B) — the probability of India winning given that it is raining — is

p(A|B) = p(B|A) * p(A) / p(B)

= 0.6 * 0.417 / 0.333

= 0.75

Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided a labeled training dataset.

  1. The Apriori algorithm –

This algorithm discovers frequent sets of items (for example, items purchased together in a supermarket) and then finds out association rules based on these itemsets.

What does it do? The Apriori algorithm tries to find correlations and relations among variables in a database.

What’s an example of Apriori? Let’s say we have data about supermarket transactions, where each row is a customer transaction and every column represents a different grocery item.

Transaction ID Soft Drinks Chips Eggs Bread Noodles
1001 Yes Yes No No No
1002 Yes Yes Yes No No
1003 No No No Yes Yes
1004 Yes Yes Yes No No
1005 No Yes Yes No No

By applying the Apriori algorithm, we can learn the items that are purchased together (this is known as association rules).

From our example dataset, you can very well see that soft drinks + chips and bread + noodles seem to frequently occur together. These are called 2-itemsets.

How does this information help me? Imagine, our neighbor also goes to the same supermarket but only picks up soft drink, from our example we already know that people who buy soft drink also buy chips, the sales person at the counter now can succinctly push our neighbor to buy chips.

But, before you apply the algorithm, you’ll need to define 3 things:

  1. The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
  2. The second is your support or the number of transactions containing the itemset divided by the total number of transactions. In our example, this is {Soft Drinks, Chips, Eggs} has a support 2/5 = 0.4.
  3. The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. The confidence of the rule {Chips, Eggs} => {Soft Drinks} is calculated as dividing support count of {Soft Drinks, Chips, Eggs} by support count of {Chips, Eggs}, thus we arrive at 2/3 = 0.67. This means 67% percent of time we can be sure that a person buying chips and eggs will also buy soft drinks.

Is this supervised or unsupervised? Apriori is generally considered an unsupervised learning approach, since it’s often used to discover interesting patterns and relationships without any external inputs.

  1. kNN

kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.

A lazy learner doesn’t do much during the training process other than storing the training data. Only when new unlabeled data is fed into it, a lazy learner tries to classify. On the other hand, an eager learner builds a classification model during training. When new unlabeled data is fed into it, the eager learner feeds that into the classification model.

The other classification algorithms we discussed earlier are all eager learners.

  • 5 builds a decision tree classification model during training.
  • SVM builds a hyperplane classification model during training.
  • AdaBoost builds an ensemble classification model during training.

What does it do? kNN builds no such classification model during the training process. When new unlabeled data comes in, kNN operates in 2 basic steps:

  • It looks at the closest labeled training data points — in other words, the k-nearest neighbors.
  • Using the neighbors’ classes, kNN gets a better idea of how the new data should be classified.

Do you have an example? Sure, Let’s consider few ’drinking items’ which are rated on two parameters on a scale of 1 to 10. The two parameters are “sweetness” and “fizziness”. The ratings of the items look somewhat as:

Name Sweetness Fizziness Type of Drink
Redbull 6 6 Health Drink
Gatorade 6 3 Health Drink
Coke 8 8 Soft Drink
Pepsi 8 8 Soft Drink
Appy 7 5 Health Drink
Lemonade 7 6 Soft Drink
Tropicana Mango 7 2 Health Drink

“Sweetness” determines the perception of the sugar content in the items. “Fizziness” ascertains the presence of bubbles in the drink due to the carbon dioxide content in the drink. We also have identified 2 distinct groups namely, ‘Health Drinks’ and ‘Soft Drinks’.

Now, what if I ask you to determine where ‘Maaza’ as drink will fit into your table? kNN Algorithm does this by applying a ‘distance measure’, the most popular being Euclidian distance formula. Don’t worry the math is actually simple.

Let’s say that the x1, x2, … represents ‘sweetness’ and y1,y2,… represents ‘fizziness’ in our table. Now, assume that Maaza has the parameters as (8,2). Using the co-ordinates of Maaza (8,2) we will calculate the distance between  ‘Maaza’ and all other items and update our table.

Name Sweetness Fizziness Type of Drink Distance to Maaza
Redbull 6 6 Health Drink 4.47
Gatorade 6 3 Health Drink 2.23
Coke 8 8 Soft Drink 6
Pepsi 8 8 Soft Drink 6
Appy 7 5 Health Drink 3.16
Lemonade 7 6 Soft Drink 4.12
Tropicana Mango 7 2 Health Drink 1.41

In our table we see that the distance between Maaza and Tropicana Mango is the least, hence we can conclude that Maaza belongs to the group of health drinks.

Is this supervised or unsupervised? This is supervised learning, since kNN is provided a labeled training dataset.

  1. CART

CART stands for classification and regression trees.  It is a decision tree learning technique.

What does it do? Depending on the outcome you are looking for (classification or numeric/continuous), CART develops a classification tree or a regression tree. For example, if you have got a dataset of your classmates performance in class and you want to determine whether they will pass or not then CART develops a classification tree with outcomes as ‘pass’ or ‘fail’. However, if you want to determine what ranges of %marks (30%-40%, 40%-60%, 60%-80%, etc) they will pass with then CART develops a regression tree.

Is this supervised or unsupervised? CART is a supervised learning technique, since it is provided a labeled training dataset in order to construct the classification or regression tree model.

  1. Recommendation Engines

The main objective of this is to provide the most relevant and accurate items to the user by filtering useful stuff from of a huge pool of information base.

What does it do? Recommendation systems are mostly seen in action in the online shopping sites like Amazon, Flipkart, etc. As an online shopping business, if you have lots of items and lots of customers, how do you determine to showcase the best choices to the consumers according to their tastes and preferences?

This sounds interesting, explain me how this works? There are basically two things one need to construct – Item-Feature Matrix and then User-Feature Matrix.

let’s take the smartphones example, with a simple criteria like good battery performance and good display quality, let’s develop the item-feature matrix.

Based on these features, you look at various smartphones and develop an item-feature matrix like below.

Smartphone Battery Display
Phone 1 0.9 0.1
Phone 2 1.0 0.3
Phone 3 0.8 0.5
Phone 4 0.9 0.9
Phone 5 0.6 1.0

Phone 1 has good battery life but poor display

Phone 2 has an amazing battery performance but very rough display

Phone 3 battery is one of the best and improved display quality

Phone 4 has robust battery and also amazing display quality

Phone 5 has poor battery performance but amazing display quality.

Now, from the past purchases and user feedbacks you have got a dataset which you will use to develop a user-feature matrix as below.

User Battery Display
Pratik 0.9 0.1
Pratyush 0.8 0.3
Max 0.8 0.5
Arnab 0.9 0.1
Mehta 0.6 1.0

Pratik prefers battery over display.

Pratyush likes a long lasting battery and a decent display performance.

Max prefers good battery performance and also has higher expectation from the display quality.

Arnab wants robust battery performance, display is not that important for him

Mehta wants decent battery performance but puts more emphasis on display quality

What is missing in our work so far is the item-user matrix, once we get that done it will be easier to recommend other users based on the similarity of the profiles. How do we do that? We will follow the formula: Max(User 1 across all item sets across all features).

So for Pratik the recommendations will come out in a ranked order as Phone 2, Phone 1, Phone 3, Phone 5 and lastly Phone 4.

Note: In this illustration I have over simplified the recommendation engine algorithm, in reality it deals with matrix products, weightages, behavioral preferences, cost, etc.


There is a widespread belief that at a broad level you do classification or regression using Machine Learning algorithms, it is true to a large extent. Besides the curated list of top 10 Machine Learning algorithms we discussed above, there are hundreds of algorithms that have various modifications and implementation approaches. The important aspect to keep in mind is when you are selecting an appropriate class of algorithms and an algorithm within the class, you should deeply analyze what kind of problem you are trying to solve and define what you should measure or predict, without a clarity on these aspects you will be lost in the long list of machine learning algorithms out there.

Disclaimer: This is a curated post. The statements, opinions and data contained in these publications are solely those of the individual authors and contributors and not of iamwire and the editor(s). The article was published in its original form by the author here.

Have ideas to share? Submit a post on iamwire

One comment

  1. 1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>