Spark DataFrames & Datasets
3K views
Nov 28, 2024
Spark DataFrames & Datasets
View Video Transcript
0:00
In this video we are discussing Spark data frames and datasets
0:05
Data frame is quite similar with the RDMS tables and it can deal with the structured data
0:12
From the Spark version 1.3.0 onwards we are having this data frame concept here and data
0:19
frame is immutable. That means its content cannot be updated and it is resilient also
0:26
So let us go for some further discussion on the data frames and then
0:29
will be going for dataset. What is Spark data frame? So, data frames are similar to the ADBMS tables and it is used to process large amount
0:43
of or large number of structured data. Data frame is introduced in Spark release 1.3.0
0:51
Like the RDD, the data frames are also immutable, that means no update operations can be
0:57
carried out and it is in memory. that means in memory, that means the data will be deciding in the primary memory or RAM and also it is resilient
1:06
etc So we can also create data frames from another sources like our HIP external databases or other different RDDS So from there we can create such data frames So the sources can be the hype can be some external data sources or some other RDDs
1:28
So why we should use the data frames? So data frames have more advantages over the RDD
1:35
And it provides memory management and optimized execution plans. So that is very important
1:41
These particular data frames are having the memory management and also the optimized execution plan
1:48
That means whenever one query is going to get executed on the data frames, it will do the
1:53
some optimization so that the processing can be done in a faster way
1:58
So in the custom memory management scheme, lots of spaces can be saved because as the data
2:05
is stored in the off memory, off-hip memory, that means it is not always dealing with the
2:11
data, there is a in-memory data, here the data will be stored in off-heap memory and there is
2:18
no garbage collection over it in case of data frames. In data frames, the query optimizer is
2:25
optimized execution plan and when the optimization is done, the final execution takes
2:31
place on the RDDs As the query optimizer is working on the queries so the execution plan will take the lesser time to do the query execution and also the processing So that why here we having enjoying multiple number of advantages using data frames
2:48
compared to RDDs. Now, we shall be discussing Spark datasets. So the datasets are the data structure in Spark SQL, and it is an extension to data frame
3:02
So, it is another data structure in SPAR and it is nothing but one extension from the data
3:08
frames to have some more advantages. So, data set provides the object-oriented programming interface and it has encoding features
3:17
The encoding feature is the primary concept of serialization and deserialization. So encoding techniques can be applied, can be adopted on these data sets
3:28
So that's why it is enhancing the possibility of serialization. and also decerealization. Encoders are used to convert JBM objects and Spark's internal binary
3:42
format. So, these are the different advantages of our datasets. So, why we should use the
3:50
datasets So these questions might be coming in our mind So let us discuss on it The Spark SQL query for datasets are very optimized The query execution plans are there So after doing the query execution optimized then only the query will get
4:07
executed for the faster and quick retrieval of data. So, different tools like about catalyst
4:14
query optimizer. So that is another tool is there. That is a catalyst query optimization
4:19
and this framework returns the data flow graph from the query. So the data flow graph will be there and this data flow graph will indicate that how the
4:28
data will be flowing from the subquiries to the nested subquiries. So using data sets we can yze the data during the runtime and this feature is not
4:39
possible for RDD data frames. So during the runtime we can perform the respective ysis on our data which is absent
4:49
in our data frames. So there are serializable and queryable, so we can save that to the persistent
4:57
storages, etc. So we can keep them in the respective persistent storages whenever we require
5:04
because these are serializable and queryable. So in this way, we have discussed what are the
5:11
basic features of the data frame and the datasets and the respective comparisons. Thanks for watching this video
#Data Management
#Programming