is a framework or a programming model that is used for processing
large data sets over clusters of computers using distributed
are 'maps' and 'reduces'?
two phases of solving a query in HDFS. 'Map' is responsible to read
data from input location, and based on the input type, it will
is, an intermediate output in local machine.'Reducer'
is responsible to process the intermediate output received from the
mapper and generate the final output.
are the four basic parameters of a mapper?
four basic parameters of a mapper are LongWritable,
text, text and IntWritable.
The first two represent input parameters and the second two represent
intermediate output parameters.
are the four basic parameters of a reducer?
four basic parameters of a reducer are Text,
IntWritable, Text, IntWritable.The
first two represent intermediate output parameters and the second two
represent final output parameters.
do the master class and the output class do?
is defined to update the Master or the job tracker and the output
class is defined to write data onto the output location.
is the input type/format in MapReduce by default?
default the type input type in MapReduce is 'text'.
it mandatory to set input and output type/format in MapReduce?
it is not mandatory to set the input and output type/format in
MapReduce. By default, the cluster takes the input and the output
type as 'text'.
does the text input format do?
text input format, each line will create a line off-set, that is an
hexa-decimal number. Key is considered as a line off-set and value is
considered as a whole line text. This is how the data gets processed
by a mapper. The mapper will receive the 'key' as a 'LongWritable'
and value as a 'Text'
does job conf class do?
needs to logically separate different jobs running on the same cluster.
to do job level settings such as declaring a job in real environment.
It is recommended that Job name should be descriptive and represent
the type of job that is being executed.
does conf.setMapper Class do?
sets the mapper class and all the stuff related to map job such as
reading a data and generating a key-value
of the mapper.
do sorting and shuffling do?
and shuffling are responsible for creating a unique key and a list of
values.Making similar keys at one location is known as Sorting.
And the process by which the intermediate output of the mapper is
sorted and sent across to the reducers is known as Shuffling.
does a split do?
transferring the data from hard disk location to map method, there is
a phase or method called the 'Split
Split method pulls a block of data from HDFS to the framework.
not write anything, but reads data from the block and pass it to the
mapper.Be default, Split is taken care by the framework. Split method
is equal to the block size and is used to divide block into bunch of
can we change the split size if our commodity hardware has less
our commodity hardware has less storage space, we can change the
split size by writing the 'custom
There is a feature of customization in Hadoop which can be called
from the main method.
does a MapReduce partitioner do?
MapReduce partitioner makes
sure that all the value of a single key goes to the same reducer,
thus allows evenly distribution of the map output over the reducers.
It redirects the mapper output to the reducer by determining which
reducer is responsible for a particular key.
is Hadoop different from other data processing tools?
Hadoop, based upon your requirements, you can increase or decrease
the number of mappers without bothering about the volume of data to
be processed. this is the beauty of parallel processing in contrast to
the other data processing tools available.
we rename the output file?
we can rename the output file by implementingmultiple
format output class.
we cannot do aggregation (addition) in a mapper? Why we require
reducer for that?
cannot do aggregation (addition) in a mapper because, sorting is not
done in a mapper. Sorting happens only on the reducer side. Mapper
method initialization depends upon each input split. While doing
aggregation, we will lose the value of the previous instance. For
each row, a new mapper will get initialized. For each row,
inputsplit again gets divided into mapper, thus we do not have a track
of the previous row value.
is a feature with Hadoop framework that allows us to do programming
using MapReduce in any programming language which can accept standard
input and can produce standard output. It could be Perl, Python, Ruby
and not necessarily be Java. However, customization in MapReduce can
only be done using Java and not any other programming language.
is a Combiner?
'Combiner' is a mini reducer that performs the local reduce task. It
receives the input from the mapper on a particular node and sends the
output to the reducer. Combiners help in enhancing the efficiency of
MapReduce by reducing the quantum of data that is required to be sent
to the reducers.
is the difference between an HDFS Block and Input Split?
the physical division of the data and Input
the logical division of the data.
happens in a TextInputFormat?
each line in the text file is a record. Key is
the byte offset of the line and value is
the content of the line. For instance,Key: LongWritable, value: Text.
do you know about KeyValueTextInputFormat?
each line in the text file is a 'record'.
The first separator character divides each line. Everything before
the separator is the key and
everything after the separator is the value. For instance,Key: Text, value: Text.
do you know about SequenceFileInputFormat?
an input format for reading in sequence files. Key and value are
user defined. It is a specific compressed binary file format which is
optimized for passing the data between the output of one MapReduce
job to the input of some other MapReduce job.
do you know about NLineOutputFormat?
'n' lines of input as one split.