Spark天生提供的并行计算、分布式计算给大数据的分析提供了非常棒的平台,以往的数据库的很多操作都可以直接在Spark中进行处理,其速度十分快,远远比数据库中的集合操作要爽很多,因此也准备认坑Spark。
首先是Spark的环境搭建,单纯单机的Spark的环境还是十分简单,选择也有很多种,比如docker,比如虚拟机,比如下载解压就可以用。
环境搭建中主要包括以下几个方面:
1、SSH环境,Spark有很多种部署方式,local、standalone、集群,都需要SSH免登陆设置,SSH的免登陆设置只需要查找ssh-keygen证书免登陆设置就可以查到,如果是单机,要确保单机SSH不需要密码,如果是集群,要确保集群间SSH不需要密码(此处个人认为应该是master和slaves之间即可,不需要slaves之间的配置。有人配置过的可以请明示一下!!!)。
2、解压你的Spark:spark-2.3.1-bin-hadoop2.7.tgz,当然有需要的也会连带安装Hadoop(解压hadoop-2.7.7.tar),往往Hadoop和Spark版本是需要对应的,不然会出错。
3、Java环境,现在较新版本的Spark都需要1.8以上的JDK,最好是rpm的原场JDK。
4、配置环境,如~/.bashrc,spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh环境。
export JAVA_HOME=/usr/local/java/jdk1.8.0_192
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export HADOOP_HOME=/usr/local /hadoop-2.7.7
export SPARK_HOME=/usr/local/spark-2.3.1-bin-hadoop2.7
exportPATH=${JAVA_HOME}/bin:${JRE_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${PATH}
然后更新bash环境
source ~/.bashrc
此外还需要设置Spark的执行环境,编辑spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh
export JAVA_HOME=/usr/local/java/jdk1.8.0_192
export HADOOP_HOME=/usr/local/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/usr/local/spark-2.3.1-bin-hadoop2.7
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_HOST=127.0.0.1
export SPARK_LOCLA_DIRS=/usr/local/spark-2.3.1-bin-hadoop2.7
export SPARK_YARN_USER_ENV="CLASSPATH=/usr/local/hadoop-2.7.7/etc/hadoop"
5、让hadoop使用Spark的shuffle
cp /usr/local/spark-2.3.1-bin-hadoop2.7/yarn/spark-2.3.1-yarn-shuffle.jar /usr/local/hadoop-2.7.7/share/hadoop/yarn/lib
6、配置hadoop设置/hadoop-2.7.7/etc/hadoop下的core-site.xml,hdfs-site.xml,mapred_site.xml,yarn-site.xml。详细如下:
core-site.xml
<configuration>
<!-- Configuration defauleFS -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://127.0.0.1:8020</value>
</property>
<!-- Configuration dataTemp -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-2.7.7/data/tmp</value>
</property>
<!-- Configuration ZooKeeper -->
<property>
<name>ha.zookeeper.quorum</name>
<value>127.0.0.1:2181</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<!-- Configuration the amount of data backup -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- configuration master node -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.7/dfs/name</value>
</property>
<!-- Configuration slave node -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop-2.7.7/dfs/data</value>
</property>
<!-- Configuration from the maximum number of nodes -->
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<!-- Is the configuration visible on the web page -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
mapred_site.xml
<configuration>
<!-- Configuration of compution frameword -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<final>true</final>
</property>
<!-- Setting up historical work record address -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>127.0.0.1:10020</value>
</property>
<!-- Configuration you can see history in webapp -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>127.0.0.1:19888</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- Configuring node management services -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<!-- Configuration specific calculation method -->
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<!-- Configuration log file address -->
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/usr/local/hadoop/hadoop-2.7.7/logs</value>
</property>
<!-- Configuration of resource management -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<!-- Configuration of resource management address -->
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
<!-- Configuration of resource management for webapp address -->
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>127.0.0.1:8088</value>
</property>
<!-- If you used jdk1.8 that add config in this -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
7、启动,首先要对namenode进行格式化(docker中,我没有格式化,结果怎么都不能使用)。
/usr/local/hadoop-2.7.7/bin/hadoop namenode -format
然后启动hadoop,
/usr/local/hadoop-2.7.7/sbin/start-all.sh
其次启动spark
/usr/local/spark-2.3.1-bin-hadoop2.7/sbin/start-all.sh
8、关闭,首先关闭spark
/usr/local/spark-2.3.1-bin-hadoop2.7/sbin/stop-all.sh
然后关闭hadoop
/usr/local/hadoop-2.7.7/sbin/stop-all.sh