Spark Wordcount Example
6K views
Nov 28, 2024
Spark Wordcount Example
View Video Transcript
0:00
in this video we are discussing spark
0:02
what count example in case of what count
0:05
we were having one text file and the
0:07
text file is containing multiple
0:09
different texts and we were supposed to
0:11
count that how many times one word has
0:13
occurred in the respective text file so
0:16
let us go forward some further
0:17
discussion on it and you shall also give
0:19
you a practical demonstration on this
0:21
problem how to implement what come
0:25
problem in spark using the Java
0:28
MapReduce program we have already seen
0:30
that how to count the frequency of words
0:32
in one or more than one text files so
0:35
for this example we are going to count
0:37
towards on the same file which you
0:39
selected in our MapReduce program and
0:41
which is used in MapReduce example
0:44
earlier so the file is stored on to the
0:47
HDFS and now at first we should start
0:51
about Hadoop before accessing this HDFS
0:54
files so let us go for one practical
0:56
demonstration to show you that how this
0:59
Watkyn problem can be written and can be
1:02
executed my hard-up system is on so it
1:06
is running so now here you can find that
1:08
here you are having the Hadoop root that
1:11
is the HDFS root now under this we are
1:13
having one folder that is a Hadoop my
1:15
files so there is the heart of my files
1:17
under this folder we are having one file
1:19
that is a sample underscore file dot txt
1:21
so let me show you what is the content
1:23
of the file so we shall go for control
1:25
alter T we shall open one terminal and
1:28
now we shall go for HDFS DFS - CAT slash
1:35
Hadoop my files we shall go for cat
1:42
Hadoop my files and the file name is
1:45
sample file dot txt so file image sample
1:50
file dot txt so I'm going to see the
1:53
content of the file the file content is
1:55
this one so this is a content now we
1:58
shall open our spark cell and in the
2:00
spark shell we shall execute a program
2:03
that means not a program but the state
2:05
of statements will be writing some lines
2:07
will be writing here one by one to
2:09
perform the word count problem on this
2:12
sample underscore
2:13
file dot txt so that is a purpose and
2:15
that is a demonstration we are going to
2:17
give you right now so let me let me go
2:19
for the initiation of the spark shell so
2:22
to initialize the spark shell we shall
2:24
go forward spark shell if we initialize
2:29
the spark shell then the Scala prompt
2:31
will be coming so at first we shall
2:34
create one file which will read all this
2:38
sample underscore file dot txt content
2:40
so let me open that file so let the
2:43
Scala prompt come the Scala prompt has
2:49
come so I shall go for bar sample file
2:53
is equal to SC Assistance for spark
2:57
context so text file and then we shall
3:00
give the path should be enclosed within
3:03
double quotes so HDFS then hold on / /
3:08
localhost colon 9000 so 9000 is the port
3:14
number and the folder is Hadoop my files
3:18
how do my files and the file name is
3:21
sample sample file dot txt so this is
3:28
the total path with the filename okay it
3:34
opens the txt file stored in the HDFS so
3:38
now to see the content of the text file
3:40
as an array so we shall go for sample
3:42
sample file dot collect we shall go for
3:48
sample file dot collect so to see the
3:51
content of the text file as an array so
3:54
you can find that the content is getting
3:56
shown you see the content is coming in
3:58
the form of an array you can find this
4:00
one in the form of an array now we shall
4:04
we shall split this particular content
4:07
so to split all the words which will be
4:09
separated by the blank spaces okay so
4:12
what you shall go for this impart and
4:14
then W count I can give other name also
4:18
no issues so then sample file dot flat
4:23
map
4:25
so m capital not f flat map and then
4:29
line so line dot split and here the
4:37
delimiter will be a space so I'm
4:39
enclosing this space within double
4:41
quotes and now closing it okay to see
4:46
the contents inside W count so all words
4:49
will be separated in the array so we
4:51
shall go for W count not collect you see
4:58
in the array all the words have got
5:00
separated you can find the output here
5:02
I've just marked it you can find that
5:04
output okay now to see the contents
5:08
inside the W count we issued the command
5:10
that is a W count dot collect here so
5:13
now we shall put one after each word in
5:16
the W count or DD so how to do this one
5:20
so we shall go for save hard a map
5:23
output I'm just writing this one as map
5:26
of P Matt map output W count W count dot
5:32
map and then W and here each and every
5:38
word will have one after it
5:40
I'm pressing enter to see the output
5:45
what is the value will be getting a key
5:47
value pair type of thing so let me show
5:49
you that one also so map output dot
5:53
collector so it will show show us the
5:57
key value pair type of thing you see it
5:59
is the key value pair so key is here
6:01
this and value is one so for each and
6:04
every word we have a test we have
6:06
treated that word as a key and the value
6:08
is one here okay now what we shall do we
6:12
shall call the reduce by key method we
6:15
shall call the reduce by P methods so
6:17
now let me go for and that one so Val
6:23
reduce or output I shall make this one
6:26
as a reduce for P then map of P dot
6:30
reduce
6:32
by key so by P reduce by key and here we
6:39
shall go for underscore + underscore now
6:46
what is the final output we have called
6:48
the reducer also so what is the final
6:50
output so initially we have the we had
6:52
the map output now we are having this
6:54
reduced output so to get this one I
6:57
shall go for reduce output dot collect
7:03
you can find that it is coming like this
7:06
you see it is coming like this so here
7:10
each and every key is there when the key
7:12
is unique not having the occurrences
7:16
again so it is having the frequency one
7:17
count is one but when this particular
7:20
key has got repeated for multiple times
7:22
so so respective counts are coming so in
7:24
this way where you are getting this so
7:26
now let me play let me in save the
7:28
support on to some file in the HDFS so
7:32
how to do that one so reduce reduce
7:36
output dot save as text file so there is
7:41
a method is this one so we are going for
7:45
and passing this parameter what is the
7:47
parameter that is a path so HDFS say me
7:51
make making this one as / / localhost
7:53
colon 9000 we wrote this one earlier
7:57
also so now let me decide some path some
8:01
directories which will be created so let
8:04
it be a spark output /wc spark
8:16
before going for that let me show you
8:18
that there is no folder called spark
8:20
output okay so there is no folder called
8:23
spark output or under that folder
8:25
obviously the WC spark folder will be
8:27
created but there is no folder called
8:29
spark output so what we can do so now if
8:33
I if I execute this one so there is a
8:36
reduce opie dot save as text file so we
8:40
are giving this total path with a file
8:41
name so we are giving the path actually
8:44
here the file will get created
8:45
automatically and that should be
8:47
enclosed within double-quotes so just in
8:50
putting enter so now let me show you
8:53
that the corresponding spark output has
8:56
got created under this we're having this
8:57
WC spark and under this were having this
9:00
underscore success and part - 5 0 so
9:04
this is the file which is containing
9:06
actually the output so let me show you
9:08
the output also so how to show so I
9:11
shall go for the terminals 1 second okay
9:16
so this is the terminal I'm having so
9:19
I'm going for let me come out from this
9:21
so I shall go for exit so coming out so
9:24
I've got the dollar prompt back again
9:27
clear so let me see so I shall go for
9:31
say HDFS DFS - cat to see the content
9:36
here so - cat slash so spark output so
9:43
there is a first folder under which were
9:45
having the next folder is our there is a
9:49
WC spark you can find here so WC spark
9:53
and then we are having this part - one
9:59
two three four five zeros so that is a
10:02
content we are going to get so this is
10:05
the content you can find that this is a
10:06
content so the content has been written
10:10
on to this part - five zeros so here the
10:14
content has got written so instead of
10:16
writing this fool instead of writing
10:17
this full also you can put the what
10:20
should I say that respective wildcard
10:21
characters we can put so that will also
10:24
work for us so that will not also
10:28
produce the same
10:30
so in this demonstration we have given
10:32
you the idea that what are the different
10:34
steps should be followed to execute in
10:36
the word count problem on to a text file
10:39
in our spark shell thanks for watching
10:42
this video
#Programming