在阿里云ECS服务器上搭建Hadoop集群
简介
Hadoop是一个开源的分布式计算的基础框架,其中最主要的组成部分则包括了hadoop分布式文件系统(hadoop distributed file system, 简称hdfs)以及mapreduce功能,而mapreduce在hadoop中则使用yarn调度组件完成。众所周知,google在分布式系统最有名的三个成果就是big table, google file system跟mapreduce,而hdfs则是对应的google file system的一个开源实现,而yarn则是mapreduce的一个开源实现。在下文中则将会介绍如何在数台阿里云的ecs服务器上搭建一个hadoop集群。为了方便下文中的配置,首先简要地介绍一下hdfs的架构,hdfs由一个namenode(用于存储文件系统的meta data),以及若干个datanode(存储文件数据)来组成,其中namenode负责对整个分布式文件系统进行管理,以及存储文件命名数据,因此也可以被称之为master node,而datanode有时候也被称为slavenode。(本篇教程主要参考自[1])
搭建Hadoop集群步骤
在阿里云上购买ECS服务器
首先需要完成的是在阿里云上购买运行hadoop的ecs服务器(注意所有服务器都需要在同一一个可用区中),由于我需要运行的应用需要较大的内存,因此我选择了购买ecs.r5.2xlarge实例,每台服务器具有64GB内存以及8个虚拟cpu核心,总共购买了9台云服务器实例。在购买的过程中可以选择服务器的hostname,为了方便起见,我将其中将作为hadoop namenode(master node)的服务器hostname改为hadoop-master,而剩下的作为datanode(slave node)的服务器分别命名为"hadoop-slave001"..."hadoop-slave008"。
配置集群内ssh无密码登陆
添加hadoop账户
首先需要在每台机器上都添加一个名为hadoop的账户,在这里为了方便起见,我将所有机器上的hadoop账户都添加了sudo权限,所需要的命令如下所示:(以下命令需要在root账户中执行)
useradd hadoop
passwd hadoop
usermod -aG wheel hadoop
在使用root账户添加完成hadoop账户之后,就退出root账户,登陆到hadoop账户中进行接下来的所有操作;
接下来配置所有机器上的/etc/hosts文件(需要sudo权限),在该文件中,添加整个集群的所有机器跟其内网ip地址(注意是内网ip而不是公网ip)的映射关系,修改完该文件之后的结果应当如下所示:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
[hadoop-master的内网ip] hadoop-master hadoop-master
[hadoop-slave001的内网ip] hadoop-slave001 hadoop-slave001
[hadoop-slave002的内网ip] hadoop-slave002 hadoop-slave002
[hadoop-slave003的内网ip] hadoop-slave003 hadoop-slave003
[hadoop-slave004的内网ip] hadoop-slave004 hadoop-slave004
[hadoop-slave005的内网ip] hadoop-slave005 hadoop-slave005
[hadoop-slave006的内网ip] hadoop-slave006 hadoop-slave006
[hadoop-slave007的内网ip] hadoop-slave007 hadoop-slave007
[hadoop-slave008的内网ip] hadoop-slave008 hadoop-slave008
设置ssh密钥登陆
由于hadoop需要各个机器之间无密码ssh登陆来进行通信,因此下一步操作就是设置ssh密钥登陆了,首先在每台机器上生成本地的ssh密钥对:
ssh-keygen -b 4096
然后将公钥发送给其他所有机器:
ssh-copy-id hadoop@hadoop-master
ssh-copy-id hadoop@hadoop-slave001
ssh-copy-id hadoop@hadoop-slave002
ssh-copy-id hadoop@hadoop-slave003
ssh-copy-id hadoop@hadoop-slave004
ssh-copy-id hadoop@hadoop-slave005
ssh-copy-id hadoop@hadoop-slave006
ssh-copy-id hadoop@hadoop-slave007
ssh-copy-id hadoop@hadoop-slave008
安装Hadoop
安装jdk
由于hadoop是搭建在jvm之上的,因此需要安装java的开发工具即jdk,如以下命令所示:
sudo yum update
sudo yum install java-1.8.0-openjdk-devel
下载hadoop编译好的二进制文件
hadoop编译好的二进制文件可以在https://hadoop.apache.org/releases.html网站找到,由于本人需要跑的应用是在hadoop 2上开发的,因此下载了2.7.7版本,如果下载3版本的话,在后续的配置hadoop的部分会有稍微的不同(比如2.x版本的slaves配置文件在3.x版本中被命名为了workers文件,同时监视hdfs的端口也有所不同)。
下载与解压hadoop 2.7.7二进制文件的命令如下:
cd ~
wget https://www-us.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar xvf hadoop-2.7.7.tar.gz
mv hadoop-2.7.7 hadoop
配置环境变量
接下来为了方便调用hdfs等命令,需要配置一下PATH环境变量:
vim ~/.bash_profile
然后在export PATH这一行前面添加
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH
然后保存并退出该文件。
配置hadoop
首先为了让hadoop能够识别到jdk的安装位置,需要进行相应配置,首先使用以下命令:
update-alternatives --display java
可以看到如下的输出,其中的xxx/bin/java便是java的安装位置,而xxx便是jdk的目录,在此该目录被确定为"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre"。
java - status is auto.
link currently points to /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java - family java-1.8.0-openjdk.x86_64 priority 1800212
slave jre: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre
slave jre_exports: /usr/lib/jvm-exports/jre-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64
slave jjs: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/jjs
slave keytool: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/keytool
slave orbd: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/orbd
slave pack200: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/pack200
slave rmid: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/rmid
slave rmiregistry: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/rmiregistry
slave servertool: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/servertool
slave tnameserv: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/tnameserv
slave policytool: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/policytool
slave unpack200: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/unpack200
slave java.1.gz: /usr/share/man/man1/java-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave jjs.1.gz: /usr/share/man/man1/jjs-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave keytool.1.gz: /usr/share/man/man1/keytool-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave orbd.1.gz: /usr/share/man/man1/orbd-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave pack200.1.gz: /usr/share/man/man1/pack200-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave rmid.1.gz: /usr/share/man/man1/rmid-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave rmiregistry.1.gz: /usr/share/man/man1/rmiregistry-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave servertool.1.gz: /usr/share/man/man1/servertool-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave tnameserv.1.gz: /usr/share/man/man1/tnameserv-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave policytool.1.gz: /usr/share/man/man1/policytool-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
slave unpack200.1.gz: /usr/share/man/man1/unpack200-java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64.1.gz
Current `best' version is /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre/bin/java.
接下来打开~/hadoop/etc/hadoop/hadoop-env.sh文件进行编辑,找到"export JAVA_HOME=${JAVA_HOME}"这一行,并且将其替换为“export JAVA_HOME={我们刚刚发现的jdk目录}”,这个目录在不同机器上可能不同,在本机器上修改为“export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre”。
接下来配置NameNode的位置(也就是文件系统hdfs的元数据),需要编辑文件"~/hadoop/etc/hadoop/core-site.xml",编辑完的结果如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
接下来配置namenode和datanode在各自机器上的存放路径,编辑文件“~/hadoop/etc/hadoop/hdfs-site.xml”,编辑后的结果如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
接下来配置YARN,首先执行以下命令:
cd ~/hadoop/etc/hadoop
mv mapred-site.xml.template mapred-site.xml
然后编辑文件"~/hadoop/etc/hadoop/mapred-site.xml",编辑结果如下所示:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>yarn</value>
</property>
</configuration>
编辑文件“~/hadoop/etc/hadoop/yarn-site.xml”,编辑结果如下所示:
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
接下来配置slave nodes的列表,打开文件"~/hadoop/etc/hadoop/slaves",编辑结果如下:
hadoop-slave001
hadoop-slave002
hadoop-slave003
hadoop-slave004
hadoop-slave005
hadoop-slave006
hadoop-slave007
hadoop-slave008
接下来将配置好的hadoop发送到各个slave机器上:
cd ~
scp -r hadoop hadoop-slave001:~
scp -r hadoop hadoop-slave002:~
scp -r hadoop hadoop-slave003:~
scp -r hadoop hadoop-slave004:~
scp -r hadoop hadoop-slave005:~
scp -r hadoop hadoop-slave006:~
scp -r hadoop hadoop-slave007:~
scp -r hadoop hadoop-slave008:~
接下来在master节点上格式化hdfs文件系统:
hdfs namenode -format
接下来就可以启动hdfs了:
start-dfs.sh
接下来可以使用jps命令查看hdfs是否正常运行,在hadoop-master上执行"jps"命令的结果应当如下所示:(进程号不一定一样)
hadoop@hadoop-master ~> jps
5536 SecondaryNameNode
5317 NameNode
5691 Jps
而在hadoop-slave上执行jps的结果应当如下:
[hadoop@hadoop-slave001 ~]$ jps
16753 Jps
16646 DataNode
如果需要关闭hdfs,则使用命令:
stop-dfs.sh
接下来启动yarn,执行命令:
start-yarn.sh
如果yarn正常启动,在执行"jps"命令的时候应该可以发现,在hadoop-master上多了一个名为“ResourceManager”的进程,而在hadoop-slave上多了一个名为“NodeManager”的进程;
如果需要关闭yarn,则:
stop-yarn.sh
测试Hadoop是否正确安装
接下来运行一些简单的样例(统计某几个文本文件中的单词总数)来判断hadoop是否能在集群上正常运行:
首先在hdfs上创建一个home目录:
hdfs dfs -mkdir /home
hdfs dfs -mkdir /home/hadoop
下载数据集并复制(put命令)到hdfs上:
hdfs dfs -mkdir /home/hadoop/books
cd ~
mkdir books
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
hdfs dfs -put alice.txt holmes.txt frankenstein.txt /home/hadoop/books
查看hdfs上的数据集:
hdfs dfs -ls /home/hadoop/books
hdfs dfs -cat /home/hadoop/books/alice.txt
使用hadoop自带的单词数目统计样例来统计数据集中的所有单词的数目:
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount "/home/hadoop/books/*" /home/hadoop/output
顺利的话可以在hdfs的/home/hadoop/output/目录下看到输出文件:
hadoop@hadoop-master ~> hdfs dfs -ls /home/hadoop/output
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2019-05-28 14:59 /home/hadoop/output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 789726 2019-05-28 14:59 /home/hadoop/output/part-r-00000
至此整个hadoop集群的环境就已经基本搭建完成了~
参考资料
[1] https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/
如果我的文章给您带来了帮助,并且您愿意给我一些小小的支持的话,以下这个是我的比特币地址~
My bitcoin address: 3KsqM8tef5XJ9jPvWGEVXyJNpvyLLsrPZj