Hadoop Interview Questions

1. What are the most common Input Formats in Hadoop?

Text Input Format: Default input format in Hadoop.

Key Value Input Format: used for plain text files where the files are broken into lines

Sequence File Input Format: used for reading files in sequence

2. What are the core methods of a Reducer?

setup(): this method is used for configuring various parameters like input data size, distributed cache. public void setup (context) .

reduce(): heart of the reducer always called once per key with the associated reduced task public void reduce(Key, Value, context)

cleanup(): this method is called to clean temporary files, only once at the end of the task public void cleanup (context)

3. What are the different configuration files available in Hadoop>





4. What is the problem with small files in Hadoop?

Hadoop is not suited for small data. Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. A large number of many small files overload NameNode since it stores the namespace of HDFS.

5. What is throughput in Hadoop?

The amount of work done in a unit time is Throughput. Because of bellow reasons HDFS provides good throughput:

The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.

Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.

6. What is configured in /etc/hosts and what is its role in setting Hadoop cluster?

./etc/hosts file contains the hostname and their IP address of that host. It also, maps the IP address to the hostname. In hadoop cluster, we store all the hostnames (master and slaves) with their IP address in ./etc/hosts. So, we can use hostnames easily instead of IP addresses.

7. How is indexing done in HDFS?

Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the block size. HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

8. What is Fault Tolerance in HDFS?

Fault-tolerance in HDFS is working strength of a system in unfavorable conditions ( like the crashing of the node, hardware failure and so on). HDFS control faults by the process of replica creation. When client stores a file in HDFS, then the file is divided into blocks and blocks of data are distributed across different machines present in HDFS cluster. And, It creates a replica of each block on other machines present in the cluster. HDFS, by default, creates 3 copies of a block on other machines present in the cluster. If any machine in the cluster goes down or fails due to unfavorable conditions, then also, the user can easily access that data from other machines in the cluster in which replica of the block is present.

Frequently asked Interview Questions

what are the diffferent execution engines available for hive?

how to connect or connect property for conecting HIVE and HDFS ?

default metastore in HIVE ?

what is views in HIVE ?

what is Skewed tables in Hive ?

what is TEZ ?

what is the cluster size...?

different interfaces in HIVE ?

what is trash interval?

Partition and bucketing in HIVE ? with real time example ?