Inner Join SQL Query in Fabric Notebook using Spark
1K views
Jan 29, 2024
This video shows how to write inner join SQL Query in Microsoft Fabric Notebook using Spark.
View Video Transcript
0:00
In this video, I'm going to show you how to write in a join SQL query using the PySpark
0:12
in Fabric Notebook. So let's get started. Now the PySpark is one of the Python libraries
0:19
that provide an interface for programming Apache Spark job. So let's see how we can
0:24
use that in the Fabric Notebook. Now before we get started, let me just explain what we
0:29
have on the screen. In this case, I've got this data in the lake house. Of course, you
0:34
can see the dimension tables. Of course, I have the fact table. So you have so many dimension
0:39
table. In this cell, we define a df variable. And of course, we initialize the instance
0:45
of the Spark module and of course passing the select star from wh underscore data underscore
0:53
lake house dot dbo underscore f transaction. Now this is going to be the fact table. And
0:58
of course, we have the dimension tables. And of course, just limit the rows to just 10
1:03
rows and of course display the content of the data frame. Now I'm going to scroll down
1:10
and let's see what we intend to do. We want to write a SQL query using the inner join
1:15
to calculate the total sales by product and the stock. So I'm going to click on this add
1:23
cell. And then we can begin to write. The first thing I want to do is to define a variable that
1:28
I'm going to store the SQL code inside. I'm going to call it total sales by product and
1:38
store. Okay, and then just put in the equal to now to write the SQL in the Spark, I'm going to
1:45
open three double quotes. So three double quotes. And of course, don't forget three double quotes
1:50
And I'm going to come into line two, and then I can begin to write the SQL query. So actually
1:56
select the product column and then the store. And I'm going to scroll up now in the fact table
2:04
we do have the sales amount column that's one aggregate. So let's scroll down. Now I'm going
2:12
to use this sum function, open the brackets, and I want to call this sales amount. Of course
2:19
we want to alias as total sales. Okay, and then click Enter. Now, we'll select from the
2:28
DBO underscore dim products. So DBO underscore dim product dimension table. And of course
2:37
I'm going to alias the name of that table using the P. And I'm going to come back to this product
2:42
column and just type a P dot just to provide a proper name convention. And just click Enter
2:49
I can use the inner join. Now for the inner join, I want to reference the F transaction fact table
2:55
So if transaction, oh, excuse me, DBO underscore F transaction. And I'm going to alias this as F
3:04
And I'm going to come back to the sales amount and alias as F dot sales amount column. And I want
3:11
to use the on operation. Now, first and foremost, we want to access the product key, which is the
3:17
primary key in the dim product table. So I'm going to say P dot product key, which is the primary
3:25
key in the dim product table, equal to F dot product key in the fact table. So product key
3:33
okay, and then click Enter. So for the second part, we want to undo this term. So inner join
3:39
and I'm going to call the DBO underscore dim store dimension table. Now that's exactly what we have
3:46
a DBO underscore dim store. And of course, this is the DBO underscore dim product. So I'm going to
3:52
alias this as S. And I'm going to come back to this store column and alias as S. And let's continue
4:00
So I'm going to use the on operation. Now, in this case, I want to focus on the store key in the F
4:07
transaction fact table, which is the foreign key. And of course, the primary key in the store
4:13
dim store table. So I can say F dot store key, which is the foreign key equal to S dot store
4:25
key, which is the primary key. And then let's click Enter. And then we can group by the two
4:32
columns. So P dot product column comma and the S dot store column. And we can even order by
4:41
so this is store. Okay, click Enter. I'm going to order by and order by the sum of the F
4:51
dot sales amount column. Okay, and then let's order descending. All right, let's go through
5:00
the code to be sure everything is fine. First, we selected two tables, the product and of course
5:06
the store tables. And of course, we aggregate using the sum function, the sales amount
5:11
and of course, the alias as total sales. And then we select from the DB underscore dim product
5:19
dimension table, and of course, we alias that particular table with P. And then we use the
5:25
inner join. And of course, we specify the DB underscore F transfer the fact table. And of
5:30
course, we alias with F. And we use the on operation. And of course, we specify the product key
5:39
equal to the same rows of the product key in the F factor, which is the in this case
5:46
the foreign key, and this is the primary key. And then for the other part, the store we also
5:52
call the inner join DB underscore dim store, and then we alias as S. And we use the on keyword
5:59
again, F, the foreign key, the store key equal to the primary key in the store key that is the S
6:08
And then we group by the two columns that we specified here, P dot product and S dot store
6:15
And then we just order by the sum of sales amount. And that's going to be the code. Now
6:22
the next one to do is to initialize the Spark SQL module again. So I'm going to just type
6:27
df and equal to spark dot SQL module, and then open the brackets now inside the bracket
6:35
I'm just going to pass in the name of this variable total sales by product and store
6:40
I'm going to just paste here. And oops, control Z, okay, control V. Okay, let me just get rid of
6:47
these. Okay, and then let's get it done. Okay. And finally, we can display the content of the
6:57
df this data frame. And let's click on run. Let's scroll down. Absolutely cool. There we go. So we
7:08
can see the output, we can see the results. So we have the sales by product and store. And we can
7:16
easily tell that the largest total sales was achieved when we have products that equals to
7:22
furnitures and store that equal to EB. And that's returned 1,956,607. And the amazing part is that
7:32
I can click on this particular object, the charts and see the charts. There we go. So we can see
7:38
the clustered bar chart, which is quite amazing. I love this functionality. And of course, we can see
7:44
the list of the product. And of course, we can see the volume. And we can even click on this
7:49
customized charts, I can switch from the bar charts to clustered column charts. And I'm going
7:56
to scroll down and click on apply. And there we go. So we have the cluster column charts quite
8:02
amazing. I can even switch to the line chart, just scroll down, click on apply. And there we go. This
8:11
is super cool. So this is basically how we can use the Spark SQL to write SQL query in the
8:18
Fabric Notebook. I trust you enjoy this video. If you do, like, share and comment. Thank you and bye for now. Cheers