Overview of Hadoop Stack
HDFS holds data. YARN is resource manager. MapReduce is one option of engine, Spark is another. Tez is alos one option in Hadoop 2.0, where the applications are layered on top of that.
HBase - a scalable data warehouse with support for large tables
Hive - a data warehouse infrastructure that provides data summarization and ad hoc quering
pig - A high-level data-flow language and execution framework for parallel computation
Spart - a fast and general compute engune for Hadoop data. Wide range of applications -ETL, Machine Learning, stream processing, and graph analytics.
Cloudera Setup:
HDFS and HDFS2
Concept:
Scalable distributed filesystem
Distribute data on local disks on several nodes
Low cost commodity hardware
Design goals:
Resilience - recover from nodes or nodes' components failing
Scalability - spreading out the data to blocks on lots of nodes ; namespace capacity
Application Locality - data scale but application does not. It localise on each compute node and keep compute task on the node with data
Portability - means commodity hardware widely accepted about OS type and not much change needed.
Architecture:
Single NameNode
Metadata is info about filesystem state, block information, edit & transaction info, locks
Multiple DataNodes - Data is spreaded across to blocks on lots of nodes
Manange storage - blocks of data (downward)
Serving read/write requests from clients (upward)
Block creation, deletion, replication (horizontally) - Replication is 3 times by default
From Hadoop2.0 (Federation):
Multiple NameNode but not single any more. Multiple namespaces providing scalability. Each namespace has a block pool. Metadata is stored in block pools. Pools are spread out over all data nodes.
Standby NameNode taking snapshot, but failover is handling manually.
Heterogeneous Storage - Archive storage, SSD, Ram_disk
MapReduce Framework
Basic idea: (1)Job splits data into chunks, and MapBus maps tasks to all the (2)compute nodes to process chunks. Once the process chunks of data is finished, the framework sorts the map's output. Reduce tasks use the sorted map's output as input to perform some reduction opetaions.
Typically, compute and data nodes are the same, so MapReduce tasks and HDFS are running on the same nodes.
Before Hadoop 2.0 YARN burn:
Single master JobTracker (1) - schedules, monitors, and re-executes failed tasks. It's the main daemon in Hadoop. It initiates TaskTrackers on SlaveNodes (compute nodes/data nodes)
One slave TaskTracker per cluster node (2) - executes tasks from JobTracker requests (with HDFS handler).
YARN
From MapReduce. Main idea : separate resource management and job scheduling / monitoring.
Overall/Coordiante -- ResourceManager : on Master Node, gets job requests from clients, gets Node Status from NodeManagers about what resources are available, gets status of applications from ApplicationMaster.
Resource Management part -- NodeManager : on each node. Like Capacity scheduler / fair share scheduler - choosing container/allocatiing resource based on capacity and queues to jobs
Job Scheduling / monitoring part -- ApplicationMaster : one for each application on certain nodes. All of them together break out that piece of original single JobTracker
So, YARN is doing MapReduce's (1) part, but it is more deeper from container level for scheduling jobs.
YARN has features below also:
High Availability ResouceManager in the newest Hadoop release - One Standby RM.
Timeline server - trace storage/application history like how many map/reduce/resource are done/used.
Cgourps - manage resources used by containers, as it also support Secure Containers with restrictions to particular users.
Restful API providing web services for cluster access.