Machine Learning - Preprocessing Structured Data - Detecting Outliers
39K views
Oct 10, 2024
Machine Learning - Preprocessing Structured Data - Detecting Outliers https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we are discussing detecting outlayers so we shall give you one demonstration on this very topic
0:08
So here is the demonstration. Let us discuss outlet detection and outlet handling
0:15
So before going for that, let me discuss what are the types of outliers
0:19
The first type is known as the global outlier. So what is the global outlayer
0:24
So here the object significantly deviates from the rest of the dataset
0:30
next one is a contextual outlayer what is a contextual outlayer object deviates
0:35
significantly based on a selected context for example say 28 degrees centigrade is an outlayer of a Moscow winter but it is not an outlayer in
0:46
another context that is say 28 degrees centigrade is not an outlayer for Moscow
0:51
summer so that is a contextual outlier last one is a collective outlayer
0:56
so a subset of data objects collectively deviate significantly from the whole data set and even if the individual data objects
1:05
may not be the outlayers for example say a large set of transactions of the same
1:11
stock among a small party in a short period of time can be considered as an
1:17
evidence of market manipulation so in this way we're having three different types
1:23
of outlayers global outliers contextual outliers and collective outliers so let us go
1:30
for the coding now so here we have imported the respective models and here we have
1:34
created one data frame with three attributes price bedrooms and size the data
1:40
frame name is flat that is a house or flat I'm considering so here we are
1:46
going for the respective printing so I'm executing this so here is my data frame
1:51
consider that last record here the prices are all of five digit numbers but
1:56
it is of they are of six digit numbers and it is of seven digit number so here the bedrooms all of them are lesser than 10 or you can consider 20 also but it is it is 125 I think it is a very big compared to the rest of this So there is a global outlayers we can consider
2:14
Next one here all the sizes are four digits but here the size is of five digits. So here we're having some outlayer. So one record at least I'm finding that it is containing some outlier values. So now how to detect it. So the method we are showing here is the eleventh
2:31
envelope assumes that the data is normally distributed and based on the
2:37
assumption draws an ellipse around the data and classifying any observation inside the ellipse as an in layer labeled as one and any observation outside
2:47
the ellipse will be will be labeled as outlayers and labeled as minus one a
2:52
major limitation of this approach is the need to specify a contamination parameter
2:58
which is the proportion of observations that are outlayers and the value we don't know here we're
3:05
considering that this contamination parameter here we are considering as that 10% data will be
3:11
outlayers but here it is very difficult to guess it if we know that 10% data is outlayered so
3:17
I think we are having a good concept on the data set but that might not be always
3:23
available so this elliptic envelope this particular method is having some limitations
3:28
for that so here we're defining one out layered underscore detector object under the
3:34
class elliptic envelope contamination parameter has passed as 10% that is a
3:39
point one fit detector so executing the fit method and also the predict method
3:45
you can find that here it is being labeled as one one one so these four
3:51
records it has been considered as non outlayers that is the in layers and the
3:56
last record has been considered and labeled as minus one it has been treated as outliers so next one is that how to deal with them so at first we can go for the
4:06
drop so here here we are using one formula one condition that is if the bedroom size is less than 20 then it is it is okay other than it is the record has to be dropped so if we execute
4:20
the same you can find that the last record the last row has been dropped
4:26
because its size was 125 which is greater than 20 you can easily find it out here
4:31
it is 125.0 okay next one the mark that means I do not want
4:38
to delete but I want to mark it with 0 and 1 so here you can go for that is
4:42
that when the flat bedroom is less than 20 mark it as 0 otherwise mark it is
4:48
as 1 and I'm creating another column there is a flat out layer so now you can
4:53
find that this column has got filled up with the values zeros and ones and
4:57
zeros have been put against those rules where the flat size is here you can find
5:04
that the bedroom size is less than our 20 but whenever the bedroom
5:08
room size is greater than 20 I'm finding this one has been labeled as one
5:12
according to the given code next one is a rescaling so here we're creating one
5:18
new column head that is a log of square feet so the log of square feet is there
5:25
so here is a we are taking the x we are taking the x log of the x and then
5:31
x is the respective flat size so on the flat size that means on this very
5:35
column you have done this log and you have done the scaling to some extent so now
5:42
you have if you print the respective data frame it is getting printed one
5:46
extra column that is a log of square feet so now let me go for the that is the I
5:52
QA that is in that is an intermediate quartile range so here you can find that
5:59
flat is equal to DF so DF I'm just keeping I kept this DF earlier also
6:05
so that I want to work on the original data frame so DF is this one so now I've kept
6:11
this original value there in DF so now here we are having this inter quartile range so I QR inter quartile range so the the flat has been initialized with the old values so those two extra columns have got deleted now and then we are
6:27
calculating the quartile one and quartile three we know that quartel one means 25% data will be
6:33
below that value and quartile two means the median that is a 50% data will be below that value
6:39
and quartile three that is a 75% data will be below that value so flat dot quater
6:44
quantile that is a point two five so we are calculating the quartile one and he
6:50
are calculating the quartile two the method name is quantile and that is your
6:55
point two five and point seven five I have passed so inter quartile range so
7:00
here you can find this one as Q3 minus Q1 there is the IQ word print IQR so
7:06
I'm printing this value here you can find that if you print this I keyword
7:10
I'm finding that the IQ values for price bedroom and size so
7:14
they are this so now when the when Q1 minus 1.5 into IQR and Q1 plus 1.5 into Iqr so any flat
7:25
whose whose corresponding value is less than this or greater than this will be
7:31
considered as the respective outlayer so it has been treated as any axis is equal to
7:38
one so this condition will be applied column wise and then meant out will be
7:42
the new data frame you can find that the last record has got deleted so here you can find that we have we are putting this one as not
7:50
that means those records which will not be satisfying both these two conditions will be will be
7:56
persisting and those records which will be satisfying any one of the conditions will be deleted
8:02
and so the third row has got deleted from here so this is my total code here
8:09
you can also type the code you can do the respective uh experiments on it
8:14
to have the better understanding that how to detect and how to deal with the outlayers and what are the different types of outlayers
8:22
We have discussed that one into details. Thanks for watching this
#Machine Learning & Artificial Intelligence
#Reference