Text Input Format: Default input format in Hadoop.
Key Value Input Format: used for plain text files where the files are broken into lines
Sequence File Input Format: used for reading files in sequence
setup(): this method is used for configuring various parameters like input data size, distributed cache. public void setup (context) .
reduce(): heart of the reducer always called once per key with the associated reduced task public void reduce(Key, Value, context)
cleanup(): this method is called to clean temporary files, only once at the end of the task public void cleanup (context)
Hadoop is not suited for small data. Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. A large number of many small files overload NameNode since it stores the namespace of HDFS.
The amount of work done in a unit time is Throughput. Because of bellow reasons HDFS provides good throughput:
The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.
Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.
./etc/hosts file contains the hostname and their IP address of that host. It also, maps the IP address to the hostname. In hadoop cluster, we store all the hostnames (master and slaves) with their IP address in ./etc/hosts. So, we can use hostnames easily instead of IP addresses.
Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the block size. HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
Fault-tolerance in HDFS is working strength of a system in unfavorable conditions ( like the crashing of the node, hardware failure and so on). HDFS control faults by the process of replica creation. When client stores a file in HDFS, then the file is divided into blocks and blocks of data are distributed across different machines present in HDFS cluster. And, It creates a replica of each block on other machines present in the cluster. HDFS, by default, creates 3 copies of a block on other machines present in the cluster. If any machine in the cluster goes down or fails due to unfavorable conditions, then also, the user can easily access that data from other machines in the cluster in which replica of the block is present.