MapReduce and Design Patterns - Cartesian Product Pattern Example
1K views
Oct 18, 2024
MapReduce and Design Patterns - Cartesian Product Pattern Example https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we are discussing Cartesian product pattern example
0:05
So here we shall go for one implementation of this concept. In the previous video we have discussed what is the Cartesian product design pattern
0:13
So now let us go for the implementation through one example. So what is the assignment
0:19
Assignment is something like this. So in this example we have designed to find the cross product of same data file and we are
0:27
providing the tax dot XML file to find the cross product so here we'll be using
0:32
only one XML that is a tax dot XML and we can use any other file but as this
0:39
process is costly so we are providing the XML which is the smallest
0:44
in size so now let us go for one practical implementation of this concept to
0:50
get the idea in a better way under the category of joint pattern design pattern
0:56
we are going to discuss the implementation of Carditionvox product example and we shall discuss
1:02
we shall handle this tags.xml. So that will be our input XML file under the folder
1:08
slash input slash tags. So this is one XML file will be going and we shall go for the
1:15
Cardition product operation. So let me show you the current content of this tags.x.m. What are the
1:22
tags are there and residing under this tags inputs tags so let me go for this
1:27
tags dot XML so under the tags we're having multiple rows and here I'm
1:32
highlighting the respective attribute under the row tags so ID tag name count
1:37
expert post ID and then we keep post ID so these are the respective
1:45
multiple rows are there in this tags dot XML let me jump to our Java file it is
1:51
having only one Java file that is a Cardition product MR task so Cartesian product
1:55
Amr task we're having multiple inner classes under this class the first inner class is
2:00
Cardition input format which extends the file input format so here we're defining
2:06
some public static final string variables and we're having the variables like your
2:12
left input format is equal to cart. left dot input format and all this all this
2:18
right-hand side whatever I were writing there available under the Hadoop API so left input format card
2:25
dot left dot input format then we shall be having our left input path so another
2:32
public static final string so left input path that will be initialized with
2:37
card. leftته right input format right input path so all this public static final
2:45
strings will be initialized here so we have been initialized them all the four
2:49
then here we're going to have one that is a set left input one method we're
2:55
defining and here we're writing only two that is a job point or set left input
3:00
format which will be set with input format dot get canonical name a left input path
3:07
will be set with input path so that is the two lines we have written under this
3:12
set left input these two lines we have written I'm just marking so that you can understand
3:19
And now we are having this set right input another method we have written under this class
3:25
so here we are having two lines job convent set right input format input format
3:31
Input format dot get canonical name and right input path will be initialized with this input path
3:36
Next we are going to overwrite one method that is the get splits
3:41
We're having the method get splits where overriding that one it returns in input split
3:46
array it returned the input split array. So here we're defining this input split two objects, left splits and right splits
3:54
So here you can see that get input splits, confdot gate, confdot get left input format and
4:01
the left input path, right input format, right input path, and numb splits and numb splits
4:07
So that's we have passed as the input argument, we have just mentioned that one there
4:12
Next one we're going to have the size of final split, that is the left size into right
4:16
sides that is obvious because in case of partition product we are going to have
4:20
all the possible combinations of left schema and right schema so the length
4:24
will be multiplied as we have mentioned here there is a left splits dot length
4:29
into right splits dot length and number of columns that is a number of
4:34
attributes will be added in the in the final result but the number of rows will
4:40
be multiplied in the final result that is a that is a feature of this cardition product so here we're executing I nt I is equal to zero input split
4:49
left item will be searching on this left splits so input split right item will
4:56
be searching on the right splits so for I is equal to zero we are executing this
5:01
to nested for loops so final split I is equal to new composite input split to
5:07
so we're having we're going to have this left item and right item we're going to
5:13
add here using this lines and then i plus plus or plus plus in this way we shall go on
5:19
adding the final split will be populated with this left item and with this right
5:24
item and final split i will be at first instantiated with this new composite
5:28
input split to so in this way the for loop has got executed and marking so that
5:35
you can understand it better i plus plus will be done for each and every
5:39
for loop execution and we are writing one log in only to for the final check that final splits to process is this so that will
5:49
be printed when you shall execute our code and return final split everything has been kept in the tricatch block for the exception handling this final
5:58
split dot length will be printed when we shall show you the output there and the
6:02
value will be returned now we are going for another method that is a get record reader get record reader which will return the Cardition record reader object which return the
6:17
Cardition Record Reader object and here it will be having the composite input
6:21
split converting the split there is a typecasting we're doing and the job
6:26
conf and the reporter so these three parameters will be returned against this
6:33
particular object we have created under the class Cardition Record Reader so I'm
6:38
just marking that one you can see that from where the values came so they came from the parameters input parameters now the next method is get input
6:49
split so in case of get input split we're defining one file input format object
6:56
that is the input format and this particular has got instantiated this
7:00
object has got instantiated using reflection unit needs dot new instance class dot for name there is the input format class comma
7:09
job conff so against this particular class the new instance has been created and
7:14
the job conf which is passed as the input parameter so they have been mentioned there input format dot set input paths there is a job configuration
7:22
there is a job conff and input path so input path is there that that has been
7:28
passed as input string when you'll be calling these functions you can see the
7:32
parameters will be passed and then we're returning the input format got gets
7:36
plates job conf and number of splits so in this way the methods are ready we're
7:43
having another inner class that is a partition record reader which implements the interface record reader and under this record reader we're
7:53
having one method the methods are to be instantiated but defining one
7:59
private record reader object initialized with now and left record reader and the right record reader initialized with null store
8:08
configuration to recreate the right record reader so file input format under that
8:15
we have created one object that is a right right I input format then we are
8:23
having the job conf object there is a right conf input split object there is a right
8:31
split is so reporter object that is the right reporter so these are the store
8:38
configuration to recreate the right record reader we're going for the helper
8:42
variables like a left key left value right key right value and we are
8:48
initializing the next left with true and done with false so these are the
8:53
bullion variables are there they are known as a helper variable so all all this
8:56
we have defined under this inner class actually this is one this is one you can find that there is a condition record reader is one
9:06
inner class so this is a constructor under this class this is a constructor so
9:11
this constructed has got multiple arguments and those arguments are being passed to initialize this right conf right IS and write reporter so this
9:20
these these variables are getting updated by the input parameters passed to this
9:25
right record reader constructor you can find that I'm just marking so that you can
9:29
feel that from where the values are coming and next one is the right reporter
9:36
which will be initialized with this reporter which we passed as the input argument to this constructor now we have kept the block within the try-catch
9:45
file input format left file input format so left left FIF we're
9:52
initializing that one passing the that is a reflection you tools dot new instance
9:59
So this class.4 name, there is a conch. .get condition input format, dot left input format, and then conf
10:09
So in this way we are just initializing some variables and all these variables will be required
10:15
for the left record reader. To create the left record reader will be just creating all this
10:20
So left record reader is equal to left FIF get record reader split
10:25
get 0 that it means you for the we are going to split that one the first
10:29
argument we are going for splitting conf and that one there is a reporter these
10:37
are the passing arguments now going for the right input format we are
10:43
going for the right input format so here we are having this one the class dot for
10:48
name is having the con dot get quotation input format right input format and conf
10:54
So some same thing which you did in our left FIF here also are doing the same then we are going for this right record record reader has been initialized accordingly
11:08
So everything has been kept in the tri-catch block I'm going to define create the key value pairs for parsing
11:15
So left key will be initialized with thismaan left value will be initialized accordingly create create and create value then
11:24
right key and right value will be also initialized with the same way as I have marked
11:29
here now here with overriding few methods that is a public text create key which
11:35
returns a new text object that is a create key will be referring there next
11:45
one is a create value which will be returning a new text object we're having
11:51
this public long get pause throws I you which returns left rector reader dot get pause so that will be returned
11:59
there overriding the method next within which we are having this text key and
12:07
that takes value so now here we are doing this next method body so if the left next is present that means not null and if and not left record reader then done is equal to true break
12:25
That means it will come out from the loop and that will be done for the next key value pair
12:32
So if the left record record rector dot next is not existing, then done is equal to true and break
12:39
Else key dot set left value dot to string. so it will be converted the value to be converted to the string so key will be set and next
12:47
left is equal to done is equal to false so in this way we are just written this code so reset
12:55
the right record reader you are going for the next line that is a reset the right record reader
13:02
you can find each and every line has been associated with the respective comments right record
13:07
reader has been initialized with this right input format dot get record reader and record parameters
13:12
have been passed if right record reader dot next is existing these are the
13:18
parameters I'm showing that one just see these are the parameters are there
13:22
we have passed that one so if right record reader dot next write key and right
13:28
value then we're having that if block there so value dot set value dot set there's a
13:39
right record reader dot next right key and right value if it is not null then value dot set right value dot two string converting it to the
13:47
string and assigning to the value else next left is equal to true so completing
13:52
the right go to the left again so while next left is not equal to true so in
13:57
this way the things will be done and return not have done that means if done
14:02
is equal to true then it will be returning false if done is equal to false then it
14:06
will be returning true so that is the bully that is the output will be there
14:09
so the function will have a sum output that means return argument now overriding the close method we are not doing anything just
14:17
we are closing the respective left record reader and right record reader we are
14:22
closing here so these two lines were written under the close over it in method
14:28
now we are having this get process so return left record reader dot get
14:33
process so that is our get process we have also over it in the respective
14:38
method now let me go for the mapper let me go for the mapper class so now here cardition pattern mapper extends the map reduce
14:49
base and implements the interface mapper you know that whenever you are going to implement the
14:54
mapper interface we're supposed to write you must be writing the map method for that
15:00
because the map method is undefined in mapper interface so here this is the case now we'll
15:06
be going for private text there is output key is equal to new text we're defining one new text and then the
15:12
map function we have written after defining this output key we have written
15:18
the map function within the map function we have written that key dot two string
15:23
dot equals values value dot two string if it is not true when both are not equal
15:30
then you are defining two string arrays one is a left left okay and everyone is
15:36
a right talk that is a T okay you can find that so they are getting
15:41
initialize with this key to string so left to key here slash slash s is actually
15:47
the regular expression that is a blank space actually actually the blank space
15:51
in regular expression we express it as slash s but we are writing that one within
15:55
double course so we're writing that one slash slash s so left to key and right
16:00
to key you can find by defining two hash set objects there is a left set and
16:06
right set these are the two hash key has set objects we have defined
16:10
and instantiating arrays to as list. Left to Key and Right to Key
16:19
So INT same word count is equal to zero. String Builder object were defined as words is equal to new string builder
16:28
We're executing one fall loop. That means a string element which will be searching iterating on this left set
16:35
If right set contents element, if the element is also available on the right set, add element
16:40
from right set so words dot append element plus comma so comma will be the delimited
16:46
here and the same word count will be increased by one if the same word count is
16:52
greater than two if it is so and after executing this one if same word count is
16:58
greater than two then output key dot set words plus one tab and then key will be set
17:04
to the key and output dot collect will be the output key and the output key and the
17:10
value. So I'm just marking them so that you can understand that output key output
17:16
that output is a goal output key comma value. Now let me come to the main function. Within
17:23
the main function we have defined one long start time which is the current time
17:28
in millisecond the current time millies that is the respective class under that
17:35
we have instantiated that is a method we are called and return argument has
17:39
been put on this long start time and then it is a long type so job configuration
17:45
config is a new job configuration cardition product but define one job with the name
17:50
and here we require two arguments to be passed so usage is given their
17:55
condition product MR task that is a class name input folder and output
17:59
folder the input folder and output folder is to be given if the length count is
18:04
not to then error message will be shown showing this usage but defining the set
18:09
jar by class set mapper class but defining each one of them after after
18:16
discussing this set jar by class there is a jar class name set mapper
18:22
class in a class we define already that is a condition pattern mapper dot class and here it is a map only job so no reducer now going to initialize this set input format that is a
18:37
cardition input format dot class set left input there is a config text input
18:44
format dot class and argument zero that is a first argument and set right input
18:50
so that will be also taken from the argument zero that is the first argument that is the input path text output format set output path
18:59
config and new path arcs 1 that it means that is a second argument whatever
19:04
will be passing that is a there is an output file path that will be there
19:08
config dot set output key class and config dot set output value class so all of
19:15
them will be text class job will be made running so job client dot run job config and if the
19:23
job is not complete that is not is there complement is there then you shall wait
19:28
for one second slip thread or slip 1,000 milliseconds finish time will be
19:33
calculated finish time so define one a long finish time system dot current time
19:38
millies is a respective value will be there so now we'll be just going for the
19:45
difference of finish time and a start time to get the total time in millisecond and if the job is successful return will be zero otherwise return will be two
19:52
So finish time and start time will be the total time taken in millisecond
19:57
So that will be calculated here. So now we shall go for the jar file creation
20:02
So you know, package, right button click, package name, right button click, then export
20:06
Then jar file will be created giving the jar file name and the path
20:10
So that we did so many times in the earlier programs also. So same process will be executed
20:15
But we have already created the jar file. So we can skip this path here
20:19
Skip this part here. now you can directly go to the console to show you that how to run this program and
20:27
how the outputs would be coming so that will be a better idea for us so here the
20:33
command we have listed here there is a Hadoop jar then the respective jar file
20:40
folder and the jar file name then we'll be having let me come down yes let me
20:45
having this one as the and there's a package name next one is a class name then
20:51
we are having the input path that is a slash input slash tags there is a path where the xml file
20:56
is residing and the output path is there if i execute my command you can find that we shall be
21:05
executing this comment now so let's see the time in millisecond is crear 3 5374 we wrote the time
21:15
there is a finish time and a start time so 35374 is a time in millisecond
21:21
Now we shall show you the output that how the cardition product has been done
21:30
So the command is HDFS, then you shall go for minus CAT, then we shall go for blank space
21:39
will be there, blank space and then output. And then you shall go for there is a part star
21:50
P at this start okay so now if I press enter you know that in case of
21:58
condition product between two data sources the number of columns will be added
22:03
and a number of rows for two data sources will be multiplied in the result
22:07
and table so huge number of records are there you can find that is a big
22:11
output the part is containing so big output you can find so big so long huge
22:17
outputs are there because the number of rows are getting multiple it is producing all possible feasible combinations of two record sources so
22:27
that's why it is a big output we're getting you see for the couple of multiple
22:32
number of seconds it is just scrolling up and the outputs are getting printed
22:36
so I hope that you got the idea how to write the code how to execute the
22:42
commands how to get the outputs how to see the outputs and everything we have
22:47
done step by step so I think now you are confident to work on this Cardition product example falling under this joint pattern you
22:57
can do it at your end also pause the video type the code and get it done
23:06
is a long output we are finding all possible combinations of two record sources
23:14
number of rows will be multiplied number of columns will be added up in the final outcome
23:20
that is the basic logic behind the cardition product so Cartesian product will
23:30
be used as when required not always because it will cause it will occupy huge
23:34
memory so that's why you always try to avoid it so this is a huge
23:38
huge use our output we have got you can find that I'm just putting that output
23:44
on the screen let me see the file size is how it is the file size here so let me go
23:50
for the output folder let me go for the output folder the input file was here
23:56
one zero six point two six kb the tags dot XML let me go for the output folder
24:01
see for the output folder the file size is 296.15 mb so you see it has got so
24:11
much big in size compared to the input file size input file was in kb it is now in
24:18
mb so let me see the output here let me delete this output folder thanks for watching
#Computer Science
#Engineering & Technology
#Programming