Apache Hadoop and the Hadoop Ecosystem
Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS, renamed from NDFS), the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing. All of the core projects covered in this book are hosted by the Apache Software Foun- dation, which provides support for a community of open source software projects, including the original HTTP Server from which it gets its name. As the Hadoop eco-system grows, more projects are appearing, not necessarily hosted at Apache, which provide complementary services to Hadoop, or build on the core to add higher-level abstractions.
A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
A serialization system for efficient, cross-language RPC, and persistent data storage.
A distributed data processing model and execution environment that runs on large clusters of commodity machines.
A distributed filesystem that runs on large clusters of commodity machines.
A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
A tool for efficiently moving data between relational databases and HDFS.