The Architecture of Apache Hadoop

Apache Hadoop platform is highly fault-tolerant for system disasters. At the core of Apache Hadoop, there are Hadoop Distributed File System and MapReduce that diffuse high-velocity streams of big data across multiple racks of low-priced servers. Hadoop Distributed File System does not have a limitation on the size of the file for data storage, write, and read operations. The limitation can only arise from the disk capacity of the machine, but not from HDFS. HDFS also supports streaming analytics such as video analytics from multiple industries such as weather patterns and space exploration programs. To accommodate such streaming big data analytics, Apache Hadoop had to alleviate some of the principles of Portable Operating System Interface Design. When Apache Hadoop stores the data on HDFS, in all likelihood, some of the nodes on the commodity servers can go down due to system crashes, hard disk failures due to overheat, and power outages. The design and architecture of HDFS take these disasters into consideration and automatically intertwines with functioning nodes and restore the entire machine network in no time (Borthakur, 2013).

Designed as Enterprise-grade batch processing file system, HDFS performs data processing for massive volumes of the petabytes of data. HDFS is open freeware ecosystem when it comes to read and write operations on the file. Data gets written into a file at the time of creation as a one-time file operation. HDFS can read the file multiple times, but cannot be rewritten to avoid overwrite operations and to boost the high input and output operations on the large data sets. A single server is only capable of running data sets on a single server with multiple cores and cannot batch process these files beyond a limit. However, HDFS can batch process with large-scale parallelization technique on various servers to increase the throughput in the data processing. HDFS can run in clusters on divergent operating systems such as Windows, Mac OS X, Unix, and Linux. HDFS’s platform independence provides the unique ability to migrate the data operating system seamlessly (Borthakur, 2013).

The architecture of HDFS system encompasses name node, data node, and blocks. Name node fundamentally orchestrates and administers the privileges and rights to the file operations. It also performs file name space operations similar to other operating systems. The one-to-one ratio is the touchstone for the relationship definition between data node and cluster. A file on the data node divides into multiple blocks by categorizing the data storage elements. There is a logical mapping between the blocks and the data nodes. Based on the top-down orchestration from Name node to the data node, data node will take several actions to manage file and block operations (Borthakur, 2013).

The HDFS files usually breaks into multiple chunks representing data blocks in homologous sizes. DFS defines the size of each data block for each file. There are backup data blocks generated representing each file to store the data onto backup data blocks in case of system disasters. Name node handles all the creation of backup data blocks. Design to create backup data blocks with fault-tolerance combat machine crashes. Rack hooks a substantial number of node boards. HDFS connects a large number of clusters running on the network infrastructure storage appliance. The data exchange between the nodes in a single rack and asymmetrical rack-mounted server differ. The challenge arises when data exchange occurs between the data nodes that reside in separate rack servers, as they have to go through crossbar network switches. The data blocks and data nodes continuously send signals back to the name node to indicate the availability of the primary and backup data blocks. The system administrator has to optimize the data exchange and definition of the number of data blocks per node and per cluster. When there is a power outage or system crash on a single rack, name node can redirect the write and read requests to other racks and backup data blocks to write and read the files in data blocks. This phenomenon throughout the HDFS clusters ensures that the system is up and running all the time for HDFS operations, as the backup data blocks are spread out among multiple racks to provide immediate switch when there is a machine crash. HDFS can run across multiple global data centers with a combination of Cloud-based and on-premise data centers. The definition of data blocks and backup data blocks are delineated with localization concept when a disaster recovery plan is designed to reduce the points of failures and recover from system failures by rerouting the requests to other data nodes when particular data blocks do not transmit signals (Borthakur, 2013).

At the core of Apache Hadoop platform, there is another critical component MapReduce. Most of the traditional RDBMS systems schedule the jobs as a single task running on a primary instance or with multiple threads spawning across multiple application servers. However, MapReduce is the governing programming framework for high scalability and high-speed computing to parallel process, monitor, and schedule several task workloads on 1000s of clusters of low-priced servers. The monitoring functionality arises from any batch jobs that did not perform due to the machine failure and reschedule the tasks. MapReduce works on several data blocks that are part of a file and create mapping functions. The mapping functions define the key and value of task from a file and create input streams. The input streams produce result key and value from a single logical unit of work and produces an output stream. The output streams are now merged to produce the outcome of the key and value. Application programming interfaces from object-oriented programming language invoke the map and reduce tasks (Hadoop, 2013).

MapReduce leverages the same concepts of HDFS backup data blocks. When Mapper and Reduce tasks are running, they show high fault-tolerance and switch over to recovery without having to shift the data from one block to another block physically. However, Message Passing Interface technique requires the availability of the data in physical location when there is a system failure. Localization without data transformation puts MapReduce ahead of the pack when it comes to fault-tolerance. The mapper and reduce tasks are autonomous, self-reliant, and incoherent with each other. Such independence eliminates the points of failure, as there is no massive failure. The Job Tasker in MapReduce can reschedule any particular task failed seamlessly to accelerate the parallel computing. Message Passing Interface requires data migration to reschedule the tasks, which is not efficient (Jin, Qiao, Sun, & Li, 2010).
There is a transmission signal associated with the MapReduce Job Tasker and each data block that performs a task. MapReduce will not perform parallelization when there is no signal from each task. MapReduce will spot the trends with these tasks from machines, and reschedules the jobs in different clusters or data blocks. Thus, both HDFS and MapReduce complement each other and are highly fault-tolerant to machine failures. The design ensures to reroute the requests to new tasks without sacrificing the performance (Dean & Ghemawat, 2004).

References

Borthakur, D. (2013). HDFS Architecture Guide. Retrieved October 28, 2015, from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Retrieved October 30, 2015, from http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Hadoop, A. (2013). MapReduce Tutorial. Retrieved October 30, 2015, from https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Jin, H., Qiao, K., Sun, X., & Li, Y. (2010). Performance under Failures of MapReduce Applications. Retrieved October 30, 2015, from http://www.cs.iit.edu/~scs/psfiles/PID1705069.pdf

Leave a Reply

Your email address will not be published. Required fields are marked *