0:00
Hey everyone, my name is Asta Chohan, welcome to the Tutorials Point
0:04
In the previous video we have learned all about the unsupervised machine learning algorithm
0:09
And in this video we are going to talk about K means clustering
0:13
So let's see what's in for you in this video. We are going to talk about what is clustering, where is clustering used, types of clustering
0:23
what is k means clustering, albo method working of k means clustering, and at the last what are the
0:29
applications of K means clustering. Let's first discuss about what is clustering
0:35
Clustering is the process of dividing data set into group consisting of similar data points
0:41
That means we cluster the data set in such a way that two data points that are similar will lie in the same group
0:50
and two data points that are dissimilar will lie in separate groups
0:55
We can observe the clustering in our day-to-day life also. When we go to the vegetable market, we observe the clusters of vegetables, like there is
1:05
cluster of potatoes, cluster of tomatoes, cluster of green vegetables. They all are grouped on the basis of similarity
1:15
Also, I have this housing system in my school. There were four color groups or houses uniform in my school, that is red, yellow, green
1:26
and blue. So we are grouped on the basis of the color
1:29
of the uniform and we compete each Saturday with each other. Clustering is used in many
1:36
areas. Some of the famous examples are Amazon Recommendation System, Netflix recommended movies
1:43
Now let's talk about types of clustering. There are three types of clustering. First is
1:48
exclusive clustering, second is overlapping clustering and third is hierarchical glistring. Let first discuss about the exclusive clustering As you can observe from this diagram diagram also that one item belongs to only one group That means this triangle belongs to only one group and this circular data points
2:10
belongs to another group. K means clustering does this type of exclusive clustering. Second is
2:16
overlapping clustering. As the name suggests, overlapping clustering. That means in this an item
2:23
can belong to multiple clusters. From this diagram, you can observe that also that these rectangular data points belong to this triangular group
2:33
and this circular group as well. Now next is hierarchical clustering. When the cluster
2:39
have this parent-child relationship between them or they are building this tree-like structure
2:45
than that is called hierarchical clustering. Now let's talk about k-means clustering
2:51
Definition says k means clustering perform division of the objects into the
2:57
clusters which are similar between them and are dissimilar to the objects belonging to the
3:03
another cluster. That means two data points or two objects that are similar will lie in the
3:10
same group and two data points or objects that are dissimilar will lie in another group. That
3:16
means two objects that lies in the separate groups are dissimilar with each other and two
3:22
data points that lies in the same group are similar to each other. Now let's stop
3:27
talk about the working of K means clustering. Here are four steps
3:32
Let's see what are they. First is, pick K random points as cluster centers called centroids
3:39
Second is, assign each data point to the nearest cluster by calculating its distance
3:46
to each centroid. Third is, find new centroid by taking the average of all assigned points
3:55
And fourth one is repeat step. step 2 step 3 until none of the cluster assignment changes Now the question arises how do we choose the optimum number of clusters And the answer is for that we use Albo method
4:10
There is a term WSS within sum of square, which is defined as the sum of a square
4:17
distance between each member of the cluster and its centroid. And it is represented as sigma of i s equals to 1 to the m xxi minus c i to the
4:29
whole square where x i is equal to data point and c i represent the closest point to the centroid
4:36
After that we plot the curve between wsss and number of clusters. And in this curve we observe
4:43
that after some point the graph changes very slowly. And this is the point that we choose as
4:50
the value of k and in this k we choose the value of k as two. Now it's a quiz time and here is a
4:57
question for you all guys so comment down the answer below and the question is what is the role
5:03
of k parameter in k means clustering and options are it represents the maximum number of iterations
5:10
allowed for the algorithm it determines the number of clusters that the algorithm will partition
5:17
the data into it controls the rate of convergence for the algorithm and the last option is it sets
5:25
the threshold for the minimum similarity required between data points. This question is solely for you all guys
5:32
So comment down the answer below. Now let's understand the working of K means clustering with an example
5:39
So we have a data set that contains runs and wickets gained in the last 10 matches
5:45
And we have to identify whether the given person belongs to the batsman group or belongs to the baller group
5:54
So it is obviously clear for us that the player, who have more wickets than runs will be a baller and the player who have more runs than wickets will be the batsman That means this group is representing the ballers and this group is representing the batsmen
6:13
As we discussed before also, what was the first step? Selecting the K random points as centroid
6:20
So in this case, we are selecting these stars as random centroid
6:25
After that, we will calculate the distance between each centroid and each data point
6:30
and assign them accordingly. But now the question arises, how do we calculate the distance between two points
6:38
The best way to calculate the distance between two data points is Euclidean distance method
6:44
It says there are two data points, point A, X1, Y1 and point B, X2, Y2
6:51
Then the distance between them will be under root X2 minus X1 to the whole square plus Y2 minus Y1 to the whole square
7:00
After calculating the distance between centroids and each data point, we are assigning these data points with this green star centroid
7:09
and these data points with this pink star centroid. After that, we have to relocate the centroid
7:16
and for that, we take the average of each data point. So after taking the average of each data point
7:24
these are the centroid that we relocate. For the first group, this centroid is here
7:30
And for the second group, Centroid is here. Now here are some applications mentioned of K means clustering
7:37
First is academia performance. Second is diagnostic system. Third is search engine
7:43
And the last one is wireless sensor networks. So that was it for this video
7:48
We have already discussed the supervised machine learning algorithm part. And we have started the unsupervised machine learning part
7:55
And in this video, we have covered the K means clustering. And in the next video, we are going to do
8:00
going to learn about hierarchical classway. So stay tuned with tutorials point
8:06
Thanks for watching and have a nice day