MapReduce and Design Patterns - Min Max Count MapReduce
5K views
Oct 18, 2024
MapReduce and Design Patterns - Min Max Count MapReduce https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we are discussing mean max count map reduce
0:05
So this particular problem we shall go on implementation also and we'll be working one XML containing
0:12
such data. So let us discuss on this problem into more details
0:17
So what is mean max count map reduce? So in the map produce task we will group the batches.xml data on the user ID
0:26
So we are having one XML that is a batches dot XML. So they are existing in the respective folder
0:32
We have shown that one in the previous videos. And then on the user ID, there is one column will be there known as the user ID will be forming group over them
0:41
And for each user, we will find the minimum and the maximum the daytime of getting a batch
0:48
So against each and every user ID, we shall find out the minimum and the maximum date time of getting a batch
0:55
So also count the number of batches of that user. So these are the tasks we are going to do
1:01
So that's why it is called mean max and count map reduce
1:06
So in this example, the user ID is the key and mean underscore date, max underscore date and counts are the respective values
1:15
So let us go for one practical demonstration of easy implementation of this problem
1:21
We are discussing mean max count problem in summarization design pattern. we're having this file system here we're having the file that is a batches dot
1:32
XML batches dot XML which is our input file and we'll be discussing on this batches dot
1:38
XML of but at first we want to concentrate to discuss about the Java files so
1:46
batches dot XML is about input file which is under the folder input slash batch so
1:52
under this folder the batches dot XML is there so we are opening a clips here
1:57
we are having two Java classes are there one is a mean max sum data dot Java and another
2:03
one is a mean max sum m r tux dot Java so we are discussing mean max sum data
2:09
or Java it is implementing the writeable interface and it is having the member
2:15
variables mean date max date and sum some some means actually here we'll be
2:20
calculating the count here we're having the constructor that is a mean max sum data
2:25
is our respective constructor here and we're having the other methods are also
2:31
there in this particular class we're having this is the date format this a
2:35
format of the date in which the date will be interpreted and we're having
2:40
get a mean date get max date and get some so these are the geter methods and
2:45
get mean date set mean date set max date and set sum read fields right and two
2:50
string are the other methods are there so read fields right andpack
2:54
string and some getter and setter methods and also the respective constructor we are
3:01
discussing mean max sum m r tux dot java so there is another Java class which is
3:06
extending mapper a mapper class which is inheriting the mapper class and here we having the method that is a map we having the method that is a map and also the other methods are there will be discussing having the respective methods like our say mean max sum
3:27
mapper which is extending the mapper class here and then also this is the
3:36
opi data text op user ID and mean max sum data op data so this data is there so these are the variables are there will be
3:46
using and also we are going to have one method that is the XML to map which
3:52
converts the XML to hash map so let us discuss this particular method at first
3:57
so there is our XML to map which takes a string as input argument and which
4:02
returns one hash map so this is a conversion method and that has been
4:07
called here to which whose output will be kept in XML parsed so XML parsed dot get debt and XML parsed dot get
4:16
user ID so this is the user that is a string user and also the string
4:23
creation date so these two variables will be initialized using this XML parsed
4:29
dot get methods we're having this date format the already the date format is we
4:35
have discussed the date format so this is the date format in which we'll be expecting our debts in our
4:40
data file you can find that these are the respective setter methods are there
4:48
which has been kept in the tri-block and catch block is there which is holding
4:53
the catch block to catch the other exceptions if there is any exception has
4:57
occurred this a count dot right that is a user ID and the opi data so we'll
5:05
be counting that how many times this particular paired has occurred and this is a
5:10
our mean max sum reducer which is extending the reducer class initializing this is a
5:18
reducer this is a reducer class which has been inherited here and we have
5:23
overwritten the method reduce in this particular in this particular section obdata dot set max data null obdata dot set mean data null and obdata dot set sum that is
5:34
the initial count for the for the for this particular value we have kept that one
5:39
zero at first and final sum has also been initialized with zero so if mean max sum
5:45
data so now here we're having to if conditions so if obdata dot get mean
5:49
date is equal to null or obdata dot get mean date dot compares to get mean
5:56
mean date so depending upon this condition so we are going to update the mean
6:02
date with the data dot get mean date otherwise if we compare for the max date also
6:08
So if the max date is not is equal to null or say obd
6:12
.get max date. . compared to this is less than zero, then the op data
6:17
. . . . max data, max date will be initialized with data . . . . . . . max date
6:23
So in this way the set max date and set mean debt methods has been called from the op data object to keep the mean date and max debt respectively depending upon the conditions and also the final sum has got increased by the current count so final sum plus equal to data dot get sum so
6:41
add and sum with the previous sum and here we are writing this context we are
6:46
writing this context so ctx dot write we have used this method there there is a map
6:52
this a map method and let us come to the main method
6:56
now right at this moment so here we require two arguments to be passed so if the number
7:02
of arguments is is not two in that case will be exiting with some error message
7:07
printing now for the first argument will be accessed with argument arg zero and the second
7:12
argument will be accessed as arcs one because we require two arguments to be passed here
7:18
so job job is equal to job dot get instance config mean max some job and the rest
7:26
respective one and then job dot set jr by class mean max sum mr task dot class so that
7:32
class whatever you have defined we've mentioned that one we require two arguments the
7:36
first one will be the input so add input path and the second argument that is the arc
7:40
zero will be the output so set output path here so that will be passing in the command
7:45
and argument whenever will be running the code set mapper class that is a mean max sum
7:50
mapper and set reducer that is a mean max set reducer class so these two classes where
7:56
just assigning to this set mapper and set reducer classes so also we're having
8:03
the bullion so if the success is there then the exit will be done with zero otherwise exit
8:08
will be done till one so we shall now go for the the batches dot XML this is a batches dot XML
8:19
you can find that we're having the multiple rows are there we have taken a snapshot of that so here
8:25
it is one XML files we're having the ID that is ID is 1 2 3 4 in this way user ID is also
8:31
1 3 5 8 10 and so on so we are having different kinds of types of users we're
8:36
having autobiographer we can have the teacher we can have student we can have
8:40
supporter that are so many different category of users and also we're having the
8:45
respective date and time is there in the same format in which we expected there we're
8:50
having the class IDs and the tag based one Boolean field is there in our code we
8:55
didn't use the tag based but also we can use that one so it is under the folder
8:59
that is a slash input slash batch and the file name is batches dot XML so we have
9:06
got this Java files we have got the idea of this batches dot XML that is the
9:10
input file now we shall show you that how to run the code we have opened the
9:14
terminal let us clear it now we shall write the command so it will be a
9:25
the Hadoop jar there is a path map reduce destination pattern slash jar file
9:32
summarization patterns dot jar this jar file we shall be creating from here see we'll be
9:37
going for the summarization pair problems and then go for the right button click and then export and then you shall export it to the jar file is the jar file is the respective folder in which the summarization problem will be will be kept as
9:51
a jar file in the in the Java archive resource you know there is a Java
9:59
archive so in this way the jar files will be created now let us let us come back to
10:03
the command prompt so this is the jar file and the respective path has been
10:08
provided and input will be about uh... means min max sum dot min max sum mr task so that is a respective input class here there is a class
10:20
actually which will be executed then we require the input file input file is under the
10:25
folder slash input slash batch so under this folder the input file will be existing
10:30
and then we'll be also having the output there will be also having the
10:35
respective output output is a folder in which the output will be created now we shall
10:40
execute this command so outputs are to be created under the output folder in the
10:51
form of a part so output has been created in the output folder in the form of a part
10:57
part file so let me open that one so these are batches dot XML that is the input
11:04
and here we are having one output folder if you open the output folder we are having
11:10
the part hyphen r hyphen zeros are there so there is a output file so we are supposed to see the
11:17
content of this output file here so let me go for the output file let me go for the terminal
11:27
and then we shall go for the cat command through sdf s
11:40
we're giving the path name and then any file starting with the part need not to write
11:45
the full file name so any file starting with the part so that will be printed
11:52
because the command the option we have given as cat so the content will be
11:56
printed yes this is the content so you can find that in the first column
12:05
we're having the IDs then we're having the mean date we're having the max date
12:10
and the number of batches given to the respective ID. So here the number count is about 29
12:16
So there is a count actually the last column is containing the count. ID, mean date, max date and the count
12:24
So that was our purpose to print. So that has got printed in the output folder
12:32
So now to a we'll be running so many other files. So we can also delete this output folder
12:37
You can find that we can we shall be deleting the output folder. let me issue the comment here so we shall be deleting the output folder so that
12:46
next time I did not to face any problem it is not mandatory so minus RM for
12:54
remove and recursive so the file folder has been deleted thanks for watching this video
#Data Management
#Programming