Limitations of Apache Spark
850 views
Nov 28, 2024
Limitations of Apache Spark
View Video Transcript
0:00
In this video we are discussing limitations of Apache Spark
0:04
We know that Apache Spark has got certain advantages but some limitations are also there
0:10
So let us discuss them one by one. So here we have mentioned five different limitations
0:16
The first one is the problem with small files. We know that each in every file, each in every file will be represented as a small partition
0:25
into our Apache Spark. And in case of large file, it will be
0:29
be divided into multiple smaller partitions. And obviously, problem with the small files
0:35
should be there and these particular partitions will be known as the small partitions
0:40
Next, we're having the no file management system. In Apache Spark, that is no inbuilt
0:46
file management system for it. It depends on Hadoop. We're having the latency. Apache Spark
0:52
latency is higher compared to the Apache Flink. So that's why latency is one of the problems
0:57
We're having the manual optimization. So here, no automatic optimization is there
1:03
The manual optimization synchronizations are to be done. So that is one of the limitations of Apache Spark
1:09
We're having the expensive. We know that Apache Spark supports in-memory computation
1:14
That means the data will be available in the memory. During competition, when the intermediate results will be produced and they will be kept
1:21
in the memory at the same time. We know that memory is a very costly thing
1:26
that is a primary memory or RAM is a very costly thing. So, that's why putting huge data into the memory means we require huge amount of memory. So that is a very expensive affairs
1:38
So let us discuss all of them into some more details So there are some limitations of Apache Spark and some of them will be like this So problems with small files In the SparkRDD each file is a small partition and for a large
1:55
file there will be a large number of small partitions will be there. To perform tasks in
2:00
efficient way, we need to repartition them into manageable format. So this particular repartition
2:08
will be a time consuming one. Next one is no file management system. So Spark has got no file
2:15
management system in depends on some other platform like our Hadoop, etc. So spark itself is having
2:23
no file management system. So it depends on other file management systems we can consider here
2:28
as an Hadoop as one of the examples. Next one is the expensive. The spark will be very expensive
2:37
when we want to do cost-efficient processing of big data. And keeping data in the memory is very expensive, and we need lots of RAM to do such
2:48
work smoothly. Because in-memory computation means the huge data will be loaded onto the memory requires
2:54
also the huge primary memory or RAM. Manual optimization. So that is a very headacheful thing
3:01
So the Spark job is needed to be manually optimized to specific data sets
3:06
and to partition and cache in Spark is also to be correct
3:12
So, these are the manual optimization will be a very troublesome for the users
3:17
We are having the latency. That is the spark has higher latency compared to the Apache Flink
3:23
So, these are the different limitations of Apache Spark. We have discussed that one with some diagram and detailing
3:30
Thanks for watching this video
#Programming
#Software