Hadoop is an open-source framework that allows distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, it is designed to detect and handle failures at the application layer, so delivering a very high degree of fault tolerance.
Origins and History
It was created by Doug Cutting, an engineer at Yahoo!, and owes much to his earlier work on Nutch, an open source web search software project. The first stable release of it was in 2011. Since then it has grown massively in popularity to become one of the dominant technologies for processing big data in a distributed manner. The main initial development was done by Yahoo! and later donated to the Apache Software Foundation where it continues as an open-source project.
Architecture and Components
At its core, it consists of Hadoop Distributed File System (HDFS) for storage and MapReduce for processing data in a distributed fashion across compute nodes. Hadoop provides high throughput access to application data and is designed to be portable across heterogeneous hardware. MapReduce is a programming model for parallel processing of large datasets. Jobs are split into independent chunks called tasks that can be executed in parallel on different nodes.
Hadoop Distributed File System (HDFS)
HDFS stores large amounts of data reliably across commodity machines in a large cluster. It provides high throughput access to application data and is designed to be portable across heterogeneous hardware. Internally it divides each file into large blocks (typically 128MB) and stores multiple replicas of each block across local disks in multiple machines in the cluster. This replication provides fault tolerance by ensuring that in the case of failure of a node another node already has a copy of the data.
Map Reduce
MapReduce is the underlying processing engine of it which governs the operation of HDFS blocks written for a job. A typical MapReduce job consists of a Map and Reduce stage. The Map stage performs filtering and sorting (such as sorting customers by country or revenue). The Reduce stage aggregates the output of the Map stage by keys (such as total revenue by country). MapReduce automatically parallelizes, schedules, monitors and manages the execution of Map and Reduce tasks across the cluster.
YARN (Yet Another Resource Negotiator)
YARN is the next generation scheduling solution introduced as part of Hadoop 2.0 to replace the original MapReduce scheduler. Whereas MapReduce is specifically tailored for batch processing, YARN allows its clusters to be used for general purpose distributed applications. It separates the concepts of resource management and job scheduling into distinct components for better extensibility. With YARN, a cluster is able to run various applications like interactive queries, stream processing etc besides just MapReduce jobs.
Ecosystem
Over the years, a whole ecosystem of projects and tools has grown up around the core, steadily increasing its capabilities. Some of the most prominent ecosystem projects are: Hive, which provides a SQL-like interface to query and analyze massive datasets stored in it; Pig, a high-level platform for analyzing large datasets; HBase, a non-relational distributed database; Kafka, a distributed streaming platform; Spark, a fast general purpose distributed computing framework; and Ambari, a tool for provisioning, managing and monitoring Apache Hadoop clusters.
Benefits of Using It
– Distributed computing – It allows processing very large datasets that cannot fit on a single machine by distributing processing and storage across commodity hardware.
– Fault tolerance – HDFS redundant storage of data protects against failure of machines at the hardware level through replication.
– Scalability – Hadoop architecture scales linearly as more commodity machines are added to the cluster. Compute and storage capabilities ramp up without downtime.
– Economics – It clusters can use inexpensive commodity hardware for processing massive datasets since underlying hardware faults are handled at software level through replication and scheduling.
– Flexibility – With the rich ecosystem, workloads like batch processing, interactive queries, stream processing can all run on the same distributed storage and computing platform.
– Programming model accessibility – With options like MapReduce, Pig and Hive, developers are shielded from the complexity of parallel and distributed system development and can focus on the problem domain logic.
Discover the Report for More Insights, Tailored to Your Language:-
*Note:
1. Source: Coherent Market Insights, Public sources, Desk research
2. We have leveraged AI tools to mine information and compile it
About Author - Vaagisha Singh
Vaagisha brings over three years of expertise as a content editor in the market research domain. Originally a creative writer, she discovered her passion for editing, combining her flair for writing with a meticulous eye for detail. Her ability to craft and refine compelling content makes her an invaluable asset in delivering polished and engaging write-ups. LinkedIn