学习计划
- Big Data Specialization from the Uni of California, San Diego
- Hadoop 权威指南
本文
- Hadoop Platform and Application Framework Week1: ** Hadoop Basics**
- Hadoop 权威指南第一章:初识Hadoop
Hadoop是什么?
Apache Hadoop是在商用硬件集群上储存并大规模处理数据集的开源软件框架(Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware)。
Hadoop框架的基本模块是什么?
- Hadoop Common: Hadoop Common 包含其他Hadoop模块需要的库和实用程序(Hadoop Common contains libraries and utilities needed by other Hadoop modules)
-
Hadoop分布式文件系统(Hadoop Distributed File System): HDFS 是一个用于储存超大文件的系统。这个系统在商用硬件集群上运行,以流式数据访问模式来存储这些超大文件(HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware)
- 超大文件(Very large files): GB, TB, PB级别文件
- 流式数据访问(Streaming data access):一次写入,多次读取
- 商用硬件(Commodity hardware): 并不需要运行在高可靠的硬件上。因此,成本低但节点故障率高
- Hadoop YARN (Yet Another Resource Negotiator): YARN 是用于集群计算资源管理和用户、应用规划的资源管理平台(YARN is a resource management platform responsible for managing compute resources in the cluster and using them in order to schedule users and applications). YARN的基础思想是将job tracker的两个主要功能(资源管理和任务分配与监控)分离 (The fundamental idea behind the MapReduce 2.0 is to split up two major functionalities of the job tracker, resource management, and the job scheduling and monitoring, and to do two separate units.)
- Hadoop MapReduce:一个用于数据处理的编程模型(MapReduce is a programming model for data processing.)
Hadoop生态系统主要组成部分是什么?
- Apache Sqoop: 在关系型数据库和HDFS之间移动数据的工具(A tool for efficiently moving data between relational databases and HDFS)
- Apache HBase:一个分布式的列数据库。HBase使用HDFS进行基础储存并同时支持MapReduce的批量计算和随机读取的点查询(A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computation using MapReduce and point queries (random reads))
- Apache Pig:Pig是一种探索大规模数据集的脚本语言,由两部分组成:Pig Latin(描述数据流)和用于运行Pig Latin程序的执行环境。
- Apache Hive: Hive是一个分布式的数据仓库,管理存储在HDFS中的数据并提供和SQL长得像的查询语言来查询数据(A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.)
- Apache Oozie: Oozie用于管理Hadoop所有工作的工作流计划系统(Oozie's a workflow schedule system that manages all of our Apache Hadoop jobs)
- Apache Flume: Flume 是一个用于收集不断增加并移动的大量数据的分布式服务(Flume is a distributed and reliable available service for efficiently collecting aggregating and moving large amounts of data)
- Apache Zookeeper: Zookeeper提供分布式的配置服务和同步服务,这样我们可以将Hadoop的所有工作和整个分布系统的注册表同步(It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system)