Interview Questions & Answers on Apache Spark [Part 1]
Q1: When do you use apache spark? OR What are the benefits of Spark over Mapreduce?
Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.
In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using Oozie/shell script. This mechanism is very time consuming and the map-reduce task have heavy latency.
And quite often, translating the output out of one MR job into the input of another MR job might require writing another code because Oozie may not suffice.
In Spark, you can basically do everything using single application / console (pyspark or scala console) and get the results immediately. Switching between 'Running something on cluster' and 'doing something locally' is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
Spark kind of equals to MapReduce and Oozie put together.
Q2: Is there are point of learning Mapreduce, then?
Ans: Yes. For the following reason:
Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
When the data grows beyond what can fit into the memory on your cluster, the Hadoop Map-Reduce paradigm is still very relevant.
Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the Mapreduce then you will be able to optimize your queries better.
Q3: When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?
Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes.
So, you just have to install Spark on one node.
Q4: What are the downsides of Spark?
Spark utilizes the memory. The developer has to be careful. A casual developer might make following mistakes:
She may end up running everything on the local node instead of distributing work over to the cluster.
She might hit some webservice too many times by the way of using multiple clusters.
The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.
The second mistake is possible in Map-Reduce too. While writing Map-Reduce, user may hit a service from inside of map() or reduce() too many times. This overloading of service is also possible while using Spark.
Q5: What is a RDD?
The full form of RDD is resilience distributed dataset. It is a representation of data located on a network which is
Immutable - You can operate on the rdd to produce another rdd but you can’t alter it.
Partitioned / Parallel - The data located on RDD is operated in parallel. Any operation on RDD is done using multiple nodes.
Resilience - If one of the node hosting the partition fails, another nodes takes its data.
RDD provides two kinds of operations: Transformations and Actions.
Q6: What is Transformations?
Ans: The transformations are the functions that are applied on an RDD (resilient distributed data set). The transformation results in another RDD. A transformation is not executed until an action follows.
The example of transformations are:
map() - applies the function passed to it on each element of RDD resulting in a new RDD.
filter() - creates a new RDD by picking the elements from the current RDD which pass the function argument.
Q7: What are Actions?
An action brings back the data from the RDD to the local machine. Execution of an action results in all the previously created transformation. The example of actions are:
reduce() - executes the function passed again and again until only one value is left. The function should take two argument and return one value.
take() - take all the values back to the local node form RDD.