MapReduce and Design Patterns - Median and Standard Deviation MapReduce
2K views
Oct 18, 2024
MapReduce and Design Patterns - Median and Standard Deviation MapReduce https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we are discussing median and standard deviation map reduce
0:05
Whenever we are having a set of data, at first we should short it and the middle most
0:10
data would be known as a median and standard deviation measures the deviation of
0:15
our data from the average value. So let us discuss median and standard deviation
0:20
map reduce. So at first we are starting with the median so median from the data set
0:25
The median is the numerical value which is used to separate the data
0:29
set into lower and upper halfs so lower half means all the data which will be
0:34
lesser than the median and upper half means all the data values which will be higher
0:39
than the median value he requires complete data set and also in the shorted order
0:44
in a simple map-produced task it is very difficult to short data because here we are
0:50
dealing with a huge amount of data so shorting of those data will be a very difficult
0:55
task we should use some better logic to perform the operation very
0:59
сред efficiently. Next we are going to discuss standard division of data set
1:08
So the standard division shows how much variations are there in the data from the average
1:14
value of the data. And we need to find an average before reducing the task
1:21
So in this case, in our example, we will take the comments
1:24
So this is one XML we are having already. So comments. XML and from that find the median and standard division of the comment lengths
1:35
So from the comments.xml file we shall calculate the median and also the standard division
1:41
of the comments length. So let us go for a demonstration for the easy understanding of this concept
1:50
So in this video we are discussing that is our median and standard division MR task
1:57
falling under the summarization design pattern so we are having the comments dot xml is our input
2:03
file under the folder input slash comments so under this one we are having this comments
2:09
So let me see the what is the current content here so it is 37.9 8 mb so we have just taken
2:17
a snap a portion of this comments dot XML and we're just opening this one so this is about
2:23
within the comments tag we're having certain number of rows are there
2:27
And each and every row has got multiple attributes. So each and every row has got attributes
2:34
We're having the attributes like your ID and then post ID, is code and the comment that is a text, creation date and the user ID
2:43
So these are the respective attributes are there under each and every row We having two classes are there That is the made ST that is a standard division data data data and made is the MR taxed or Java so made is D data
2:57
implements writable so this is writable interface is getting implemented here we're
3:03
having the set of this getter and setter methods we're having the member
3:07
variables like median and standard division these two are the member variables so we're
3:11
having some getter and setter methods are there accordingly get median get sd set
3:16
median set s d we're having well overriding this read fields this particular
3:21
method is getting overrated right is also getting overridden and we're also
3:25
overriding the two string method for the proper output so we can find that we're
3:30
having the read files so median and standard division these two variables are
3:34
getting calculated in data dot read in and in data dot read double we're having the
3:40
right method so right method is also printing this median and standard
3:44
division accordingly and we are having the two string to string method is also
3:49
there for the proper display so now what we shall do we shall go for made
3:55
ST MR tucks java so it is it is the class and this class is a user's
4:01
comment form at comments dot XML which will be read and determine the median
4:05
and standard division of the common lens part hour of the day we're having that
4:10
data format date format is also defined here there is the hours so so
4:14
we will be going for the date format date format is also defined you can find the
4:18
respective format there and we are having two inner classes so there is a map
4:25
made sd mapper which is inheriting mapper class and also we are also
4:32
are having the reducer class will be also in rated in another inner class
4:38
so we are having the uh that is output hour and output comment length to
4:44
variables are there within this inner class and also we are overriding the
4:50
map method XML parts will be updated by the output of the method that is the XML
4:56
to map so XML to map is a method which we have written already which will take
5:01
the XML as input and returns the hash map object as output so it will read the
5:07
XML and returns the hash map as output and that will be dumped onto this XML
5:11
parts to variable so we are having the create date and a comment so from the xml parse dot get and xml pars dot get text xml parts.d. Get
5:20
creation date and xm.pars. Dot. Wherein the output hour set here we are having
5:27
this get hours method is deprecated but it will work no issues and then output comment
5:32
length.net.net. There is a comment. length. So you can find that we are writing this key
5:36
value pairs onto the context We writing this key value pairs onto the context We having the made SD and there is a median standard division reducer which will be extending the reducer class and obviously as other
5:52
programs so here also will be overriding the reduce method here this reduced method will
5:57
be overwritten so what you shall do here in this case will be calculating the median
6:03
and also will be calculating this here we'll define one array at a list
6:09
you have defined one add a list so initially the list has been made clear there is a
6:13
comment length list and output value dot said sd is equal to zero so at first the comment length
6:19
list add a list will be cleared and also the op values dot set is cleared now we are going for
6:26
this uh comment length dot list add so for each and every uh item is going to get added get the
6:32
lengths and add them to the list and total length also we are going to increase with this current
6:37
value of the length and also the comment count has to be increased here so after doing
6:43
all this so we'll be going for the short because we are going to calculate the
6:47
median median is the middlemost value after getting shorted so we have defined one
6:52
variable that is a the median value and if this comment count is even then we shall
6:59
take the comment count value which is at the n by 2 minus 1 and n by 2 plus and
7:04
there some has to be has to be calculated and then it will be divided by two so we are taking the average of the comment count values at the
7:14
position n by 2 minus 1 and at the position n by 2 so those two comment count values are to be
7:19
averished and that will be kept that will be decided as a median median value but if the comment
7:25
count is odd that means in the else part will be taking the respective value which will be found
7:30
at the middle and that is the median value and op oop value dot set median will be writing this one as
7:37
median valve so in this way the median has to be has been calculated now let us
7:42
calculate the standard division so at first we're calculating the mean that is a
7:46
total in by common count and then we'll be going for the square sum so we're
7:50
going to calculate the respective val minus mean into val minus mean means we're
7:54
going to do the square we're going to calculate the square sums difference
7:59
between val minus mean square and then we'll be taking the square root after
8:05
dividing it by common count minus one so here we are calculating the squares and
8:10
then opi val dot set s d will be going for math dot square root that this
8:15
squares come so squared sum by common count minus one so in this way in this
8:20
way the standard division has been calculated and that has been written onto the
8:24
context the key value pair has been written onto the context so here we're
8:30
having the main function we require two arguments from the common line if the length is not equal to 2 then we be exiting with some error message exit to And then otherwise we be going for this job instance will be created
8:44
And then we are going for add input path, set output path with this job
8:49
And then we know the respective mapper class and reducer class we have inherited here
8:55
So we're writing this one as job. dot set mapper job job dot set reducer job dot set output key job dot set output value class
9:05
output key class and output value class also and we are going for the bullion and if the
9:10
condition is true or false accordingly will be returning the bullion value so in case of success
9:15
will be returning zero in case of failure will be returning one so similarly we have
9:20
written this respective code so now let us go for the jar file creation so we have gone
9:25
for the export on the package name and then we'll be giving the proper path and the file name
9:30
and then you shall go for next and finish but already we have created this jar file so we're
9:35
skipping this particular step now so the jar file has got created so we have come back to the
9:41
command prompt so here the command is hadoo jar and then we are going to give the the jar path
9:49
and the respective jar file name and then mED s d is the package name
9:54
and the class name is METSD MR task that is a class name and input file that is a
10:01
comments dot XML is available under slash input slash comments folder and output will be
10:06
created under the folder output is a folder name where the part file will be created
10:11
as the output file so we have executed the command so let me execute the comment here so
10:16
we're getting this output I hope yes we had a had a smooth journey so we got the
10:22
output so it has got created under this input no we shall go for the root then
10:27
under the output we're having this part hyphen r zeros so that is the
10:32
respective part file name so let me open that one so we shall go back to the
10:37
terminal once again so we'll be issuing the command is DFS DFS minus cat and then we shall give the respective path here and then part file
10:47
name so go for part start so all files starting with the part will be printed
10:52
but we are having only one file there so that will be printed here so here you can find
10:57
that in the first column we are finding the ID in the second column I'm finding the median
11:02
and the third column I'm finding the respective standard division so I've got this median
11:07
and standard division calculated depending upon the comment length so I've shown you that
11:14
how to do that one let me delete this output folder not mandatory but we are just
11:19
deleting this one so that we can create other Exxiv the other map produced programs
11:25
So now it has got deleted. Thanks for watching
#Computer Science
#Programming