MapReduce and Design Patterns - Inverted Index MR Example
4K views
Oct 18, 2024
MapReduce and Design Patterns - Inverted Index MR Example https://www.tutorialspoint.com/market/index.asp Get Extra 10% OFF on all courses, Ebooks, and prime packs, USE CODE: YOUTUBE10
View Video Transcript
0:00
In this video we are going to discuss inverted index map reduce example
0:06
So this particular video will be assisted with some practical demonstration to show you how to write
0:11
the codes, how to execute it to get the outputs. Inverted index example
0:17
So we'll be considering this example here. From the posts.xm. Find the Wikipedia links make the inverted index
0:25
So we are having one XML file that is a post.xm. and on which we'll be applying this inverted index
0:32
And in this program also find the type of the post using post type ID
0:37
And the type of the posts may be like comments, post questions, etc
0:43
So there are different type of posts will be there. So let us go for one practical demonstration for the easy understanding of the implementation
0:52
In this program, we are going to demonstrate and implement the summarization design pattern
0:59
under which we are going to implement inverted index MR task we are having one
1:04
XML file that is a post dot XML which is under the folder that is our
1:09
slash input slash post so under this particular path we are having posts dot XML
1:15
this post dot XML is a huge file you can easily find that it is a huge file it is
1:21
having a size of 108.93 mb so we are going to show you that in input slash post
1:29
the file is existing and it is a huge one but I shall be showing a sum of the rows there
1:34
in this particular file for your better understanding and there is 108.93mb so let me go for
1:41
the file content we're opening the text editor so we can find that it is a XML file it is a huge
1:46
long file but I have shown some of the records here under the post tag we're having row
1:52
tags are there and you can find this is a complete row and it is having multiple attributes
1:57
we're having under the row tag we're having the attribute like ID post type ID
2:02
accepted answer ID creation date the next one we're going to have the score then
2:09
view count then body having the next one is a owner user ID last edited user
2:20
ID last edit date last activity date title title then tags then answer count comment count favorite count community owner date so
2:38
these are the different attributes are present so what we shall do we shall search
2:43
for each and every row we are supposed to find out those respective posts which are
2:48
related with the Wikipedia which are related with the Wikipedia so that we are
2:52
supposed to find out but having only one Java file here that is the inverse index MRTarx Java So we are having one Java file and this Java file is extending the this Java file is
3:04
there within that we are having inner classes to inner classes one will be extending
3:08
the mapper class and another one will be extending the the reducer class so this
3:13
a mapper class extended that is the invert index mapper having the member ID
3:18
that is a member variable ID type and keyword both of them of text type
3:23
here we're overriding the map method we are having XML parsed there is a one
3:29
hash map object and here XML to map one method is there let me show you the
3:34
method at first so this is a method which will take one XML as input and
3:39
returns the hash map as output so this is our XML to map so this method will
3:47
take xml as input and hash map as output so under this we are having
3:53
the tri-block and here we're having this ID we're having main text then body will
4:00
you having the body and then we'll be having the next one is the body one and next
4:06
one is our dog type so these are the different variables we have defined now let me
4:11
discuss that if if the respective body if it is not equal to null that means if the
4:18
body is having some content then we are checking the post type if the post type if the
4:23
type is not equal to null and if the post type is equals 1 you know that we are
4:28
having the post ID also we are going to get and then regarding the post type if
4:32
the post type is not equal to null if the post type dot equals 1 then doc type is
4:38
question if post type is not equal to null doc type equals to 2 then it is answered
4:42
if post type is not equal to null otherwise the doc type will be unknown and if the
4:49
body is null in that case with the body is null but if the main text is not equal to null then you shall go for the doc type as comment
4:58
otherwise it will return so here we are finding the doc type using this particular logic
5:05
finding the doc type then id.contact we're having the where doc type is getting added with this
5:12
id but define one yes the doc type is getting added with the idea then we're going to define a
5:18
text so if the doc type dot equals comment text is equal to main
5:23
text or two lowercase otherwise text is equal to body dot to lowercase so we are
5:28
converting them to the lowercase we're defining one string array there is a
5:34
text word and from the text using the blank space as the delimiter we are going to
5:39
split them and put them in this text array that is the text words now we are
5:44
searching for each and every word if the word contains Wikipedia dot word g
5:49
then we going for replacing hd f is equal to ampersion code semicolon will be replaced by blank space and amperson code semicolon will be replaced by the blank space again So word dot replace all and word dot replace all ID type
6:06
ID keyword dot set word and we're going to write this key value paired on the context
6:12
So cTS dot write keyword comma ID type. So this keyword comma ID type is going to be written onto the respective context
6:20
We have kept this one in the tri-catch box. lock accordingly now we shall discuss the reducer we shall discuss the
6:27
deducer the class name is in a invert index reducer we are going to
6:32
override the method reduce we have passed the parameters you can you can see
6:36
that one we are defining one string builder class object and using this
6:41
lithium we are just going on adding or appending with this string builder class
6:47
object that is our STR built there is a row ID dot to string plus a blank
6:52
then output value dot set which we which is which we define as a text type we are
6:58
going to set that one and then we are writing this key value pair onto the
7:02
context so we're writing this one we are just taking the substring starting from
7:07
the index 0 to stuard bill dot length minus 1 converting it to the string we're
7:12
writing that one onto the context as key value pair this is our main function this
7:17
main function is taking the common line arguments and it is checking whether the
7:20
length is two or not if the length is not true then one error message will be
7:24
printed and system will exit then you are defining one job object we're defining
7:29
one job object giving the job name and so on and we know that respective class
7:34
is there there is a set juror by class we are going for input file for file
7:39
input format and file output format we know that argument zero will be containing the
7:43
input file path and one will be containing the output file path where have been
7:47
this set mapper and set reducer and here we'll we define the respective inner classes so there is a set mapper and set
7:55
reducers so these two classes will be assigned to the will be assigned to the
7:58
job and then job dot set output key job dot set output value and then we're
8:04
having the success if the success we are getting then zero will be returned
8:08
otherwise one will be returned so same thing whatever you did in the earlier
8:11
programs also you can pause the video you can watch it now what we shall do we
8:17
shall go for creating the jar file there so what we should
8:20
shall do creating the jar file so in this case we are going to now we shall go for export
8:27
we shall go for export then you shall go for export and you shall go for this jar file
8:35
then you shall go for the respective path and the file name has to be given properly
8:39
then next and finish but already we have created the jar files so we're not going to do that one
8:45
but these are the steps to be to be followed to create the jar files and
8:50
And then we shall come to the command prompt and this is the respective command to execute the code So Hadoop JAR so this is the respective path and then the JAR file name then inverted index
9:02
is the package name, then inverted index MR task dot class is the class name
9:09
Input slash post is the input file path and output is the output file path
9:15
So we know that output file will be having the name as part. r.0s will be the output file name. So if I execute this code, this command, I think it
9:25
would works successfully but no, it is, it didn't work. So I feel that that name node is in
9:32
the step mode. So let me take out from the shape, shape mode now. So let me issue the command
9:41
and that is your Hadoop and then DFS. admin then go for shape mode then leave pressing enter you can find that it has
10:02
come out from the set mode so let me execute the command once again I think I
10:10
hope now the output file will be created under the output folder with the
10:15
name starting with part yes it has got created so let me show you the respective file system you can find that we have gone for
10:25
and the input post my present directory so I've gone for the root you can find that
10:30
under the output folder root output folder we're having part hyphen r hyphen zero so
10:36
this is a file which is containing the respective output so let me execute the cat
10:42
command that is the HDFS dFS minus cat slash we shall go for
10:49
for this output we should go for post part star yes so all the files starting with the part
10:56
will be printed you can find that here we're having this the outputs so here we're
11:02
getting all those links where Wikipedia is present so 1 24 314 plus answer you can
11:08
find that the first part there is a 988 06 is the ID and plus question is a doc type
11:15
which you decided using that nested if statements So we can find that we are having this is the respective post link and each one of them is containing the Wikipedia and this is the ID plus the doc type
11:30
So this is the respective output or whatever you have got. So let me delete the output folder here
11:39
So this is output you can you can easily see. So HDFS DFS minus RM minus R then output folder we are going to delete
11:49
So folder has been deleted. Thanks for watching
#Other
#Programming