Hadoop 3.0 在之前的 2.X 版本上做出了很多重要的改进。由于这只是一个测试版本,目前不能保证它的任何特性和效率。
Apache Hadoop 3.0.0-alpha1 incorporates a number of significant enhancements over the previous major release line (hadoop-2.x).
This is an alpha release to facilitate testing and the collection of feedback from downstream application developers and users. There are no guarantees regarding API stability or quality.
Overview 概述
Minimum required Java version increased from Java 7 to Java 8 请至少使用Java 8
Hadoop中使用的Jar包只能与Java 8 兼容,请务必升级。
Support for erasure encoding in HDFS
在Hdfs中支持纠删码
Erasure coding is a method for durably storing data with significantspace savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.
纠删码是用于保证数据的可靠存储,并且相对于简单的复制,只需更少的空间。标准的纠删码只需要1.4倍的存储空间,相比于3倍的HDFS副本。
Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.
由于纠删码在使用中会有多余的开销,而且多用于远程读取。因此它通常用于存储不常用的数据。用户应该考虑到这些多余的开销对网络的带宽和CPU性能的影响。
具体问题可以参考Apache Hadoop 3.0.0-alpha1 – HDFS Erasure Coding
Shell script rewrite
Shell 脚本重写
The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. While an eye has been kept towards compatibility, some changes may break existing installations. Incompatible changes are documented in the release notes, with related discussion onHADOOP-9902. More details are available in theUnix Shell Guidedocumentation. Power users will also be pleased by theUnix Shell APIdocumentation, which describes much of the new functionality, particularly related to extensibility.
Hadoop的Shell脚本已经重写,添加了很多功能,并修改了很多遗留已久的Bug。一方面,强调兼容性问题,另一方面也打破了很多界限。具体可以参考Unix shell 指导 Apache Hadoop 3.0.0-alpha1 – Unix Shell Guide
MapReduce task-level native optimization
Mapreduce 任务级别优化
MapReduce has added support for a native implementation of the map output collector. For shuffle-intensive jobs, this can lead to a performance improvement of 30% or more.
Mapreduce增加了对map结果管理的支持。对于混洗任务较多的job,这样可以提升30%左右的性能。
Support for more than 2 NameNodes
支持超过两个NameNode
The initial implementation of HDFS NameNode high-availability provided for a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one node in the system.
最初的HDFS高可用性包括了一个NameNode节点和一个备用NameNode节点。若将被选节点增大到3个,那么这一个架构将保证在任意一个节点故障的时候,整个系统依然能够正常运行。
However, some deployments require higher degrees of fault-tolerance. This is enabled by this new feature, which allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes and five JournalNodes, the cluster is able to tolerate the failure of two nodes rather than just one.
但是,一些部署要求更高的可靠性。为了实现这个功能,我们可以增加备用NameNode的数量。例如,若将NN数量设为3,JN数量设为5,这个集群可以承受两个Node宕机,而不是一个。
Default ports of multiple services have been changed.
服务的默认端口有变
Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). This meant that at startup, services would sometimes fail to bind to the port due to a conflict with another application.
之前,Hadoop服务的临时端口在32768和61000之间。这也带来了端口冲突的隐患。
These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. Our documentation has been updated appropriately, but see the release notes forHDFS-9427andHADOOP-12811for a list of port changes.
这些可能待考冲突的端口已经被排除了临时端口的范围,对NN,SNN,DN和KMS带来影响。具体的变化可以参考相关文档。
Support for Microsoft Azure Data Lake filesystem connector
支持连接到Azure数据湖文件系统
Hadoop now supports integration with Microsoft Azure Data Lake as an alternative Hadoop-compatible filesystem.
现在可以把Azure data lake 文件系统作为Hadoop 文件系统的备用。
Intra-datanode balancer
DN节点间负载均衡
A single DataNode manages multiple disks. During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode. This situation is not handled by the existing HDFS balancer, which concerns itself with inter-, not intra-,DN skew.
若一个单独的DN使用多快硬盘。在正常的写操作中,硬盘将被均匀的读写。但是,增加或移出硬盘将会给一个DN带来数据倾斜。这个问题不在HDFS均衡器的任务范围内,HDFS主要关心DN之间的数据倾斜。
This situation is handled by the new intra-DataNode balancing functionality, which is invoked via thehdfs diskbalancerCLI. See the disk balancer section in theHDFS Commands Guidefor more information.
这个问题已经被新的DN节点内均衡机制解决。这个任务由hdfs diskbalancerCLI 触发。具体的命令参见HDFS 命令行。
Reworked daemon and task heap management
重构后台进程和堆栈管理
A series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.
HADOOP-10950introduces new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and theHADOOP_HEAPSIZEvariable has been deprecated. See the full release notes of HADOOP-10950 for more detail.
使用新的方法管理后台任务的堆栈大小。尤其是目前自适应可以基于内存的大小。
MAPREDUCE-5785simplifies the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change. See the full release notes of MAPREDUCE-5785 for more details.
简化了Mapreduce任务的配置,无须指定Heap大小。