Features of RDDs
Show More Show Less View Video Transcript
0:00
In this video we are discussing features of RDT
0:04
We shall be discussing multiple different features of RDD for your better conception
0:09
So let us start with the discussion here. RDT features. Here we are discussing 10 different RDD features and they are location stickiness, fault
0:19
tolerance, parallel, in-memory computation, no limitation, course gained operation, partitioning, immutability, lazy evaluation and persistence
0:30
So we shall go one by one. At first we are starting with the location stickiness
0:35
To compute partitions, RDTs are capable of defining placement preference. And moreover, placement preference refers to information about the location of RDD
0:47
So although the DAG scheduler places the partitions in such a way that task is close to the data as much as possible
0:54
So if the data on which the tax will be operating, if it is placed as much as much
0:59
as possible close to that, then obviously it will speed up the competition on time
1:06
Next we are going to discuss that is the fault tolerance If any worker note fails by using lineage of operations we can recompute the lost partition of RDD from the original one So it is so it is possible to recover
1:25
lost data very easily. Now let us discuss on parallel. What is parallel? So while we talk
1:32
about parallel processing, RDD processes the data parallelly over the cluster. So as a result of that, the competitional time will be quite low
1:43
Next point we are discussing that is in-memory competition. Basically, while storing data in RDD, data is stored in memory for as long as we want to
1:54
store it. So it improves the performance by an order of magnitudes by keeping the data in the
2:01
memory. So it reduces disk red-write operations to manifold and 100-tampers. faster operation will be observed in memory computation next we are going for no
2:15
limitation so there are no limitations to use the number of spark RDD and we can use as any as any number of RDs and basically the limit depends on the size of the disk and the memory So that why there is no limitations that for how many number of Spark IDDs will be working with
2:36
So depending upon the workload, depending upon the size of the disk and the memory, the number of RDDDs can be selected
2:43
Next we are discussing course grand operation. So generally we apply course grand transformations to
2:50
SparkRDD it means the operation applies to the whole dataset not on the single
2:58
element in the data set of RDD in Spark next we are discussing partitioning
3:05
basically RD partition the records logically and also distributes the data across various nodes in the cluster and moreover the logical divisions are only
3:17
for processing and internally it has no division So hence it provides parallelism. Immutability
3:24
So imitability is the next feature of RDD we are going to discuss
3:28
And immutability means once we create an RDD we cannot manipulate it Moreover we can create a new RDD by performing any transformation Also we achieve consistency through the immutability
3:44
Next, a very interesting feature is the lazy evaluation. Spark lazy evaluation means the data
3:52
inside RDDs are not evaluated on the go. Basically, only after an action triggers all the changes
3:59
or the competition will be performed. And therefore, it limits that how much, you can't
4:03
how much work it has to do at that particular instant of time
4:09
As the last feature, we want to discuss persistence. In in-memory, we can store the frequently used RDD and also we can retrieve them directly
4:20
from memory without going to disk. It results in the speed of the execution
4:25
Moreover, we can perform multiple operations on the same time, on the same data
4:30
and it is also possible by storing the data explicitly in memory by calling parsist or cache functions
4:41
So in this way, in this particular video, we have discussed RDD features and different kinds of features we have discussed into details
4:49
Thanks for watching this video
#Data Management
#Programming

