背景
公司小组分配了三台虚拟机,在虚拟机上面意欲装hadoop集群及spark on yarn
版本
Hadoop 2.7.2
spark 2.3.2
问题
配置好hadoop集群与spark配置后,启动spark-shell --master yarn报错如下
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/04/07 14:14:35 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/04/07 14:14:47 ERROR TransportClient: Failed to send RPC 4632628449949268526 to /172.17.1.150:41470: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:47 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 1 at RPC address 172.17.1.150:41480, but got no response. Marking as slave lost.
java.io.IOException: Failed to send RPC 4632628449949268526 to /172.17.1.150:41470: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:47 ERROR YarnScheduler: Lost executor 1 on wangwei01: Slave lost
Spark context Web UI available at http://wangwei02:4040
Spark context available as 'sc' (master = yarn, app id = application_1554616735892_0001).
Spark session available as 'spark'.
19/04/07 14:14:52 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED!
19/04/07 14:14:52 ERROR TransportClient: Failed to send RPC 7884125299619016997 to /172.17.1.150:41490: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:52 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 7884125299619016997 to /172.17.1.150:41490: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
19/04/07 14:14:53 ERROR Utils: Uncaught exception in thread Yarn application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:567)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1804)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 7884125299619016997 to /172.17.1.150:41490: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
ApplicationMaster日志如下
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/filecache/10/__spark_libs__2498524289906141930.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/04/07 14:14:41 INFO SignalUtils: Registered signal handler for TERM
19/04/07 14:14:41 INFO SignalUtils: Registered signal handler for HUP
19/04/07 14:14:41 INFO SignalUtils: Registered signal handler for INT
19/04/07 14:14:41 INFO SecurityManager: Changing view acls to: root
19/04/07 14:14:41 INFO SecurityManager: Changing modify acls to: root
19/04/07 14:14:41 INFO SecurityManager: Changing view acls groups to:
19/04/07 14:14:41 INFO SecurityManager: Changing modify acls groups to:
19/04/07 14:14:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
19/04/07 14:14:42 INFO ApplicationMaster: Preparing Local resources
19/04/07 14:14:42 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1554616735892_0001_000001
19/04/07 14:14:43 INFO ApplicationMaster: Waiting for Spark driver to be reachable.
19/04/07 14:14:43 INFO ApplicationMaster: Driver now available: wangwei02:34614
19/04/07 14:14:43 INFO TransportClientFactory: Successfully created connection to wangwei02/172.17.1.151:34614 after 47 ms (0 ms spent in bootstraps)
19/04/07 14:14:43 INFO ApplicationMaster:
===============================================================================
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> hdfs://wangwei01:9000/user/root/.sparkStaging/application_1554616735892_0001
SPARK_USER -> root
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.driver.port=34614' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler@wangwei02:34614 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1554616735892_0001 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_libs__ -> resource { scheme: "hdfs" host: "wangwei01" port: 9000 file: "/user/root/.sparkStaging/application_1554616735892_0001/__spark_libs__2498524289906141930.zip" } size: 235228182 timestamp: 1554617677670 type: ARCHIVE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "hdfs" host: "wangwei01" port: 9000 file: "/user/root/.sparkStaging/application_1554616735892_0001/__spark_conf__.zip" } size: 183160 timestamp: 1554617678303 type: ARCHIVE visibility: PRIVATE
===============================================================================
19/04/07 14:14:43 INFO RMProxy: Connecting to ResourceManager at wangwei01/172.17.1.150:8030
19/04/07 14:14:43 INFO YarnRMClient: Registering the ApplicationMaster
19/04/07 14:14:43 INFO YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
19/04/07 14:14:43 INFO YarnAllocator: Submitted 2 unlocalized container requests.
19/04/07 14:14:43 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
19/04/07 14:14:44 INFO AMRMClientImpl: Received new token for : wangwei01:41680
19/04/07 14:14:44 INFO AMRMClientImpl: Received new token for : wangwei02:35865
19/04/07 14:14:44 INFO YarnAllocator: Launching container container_1554616735892_0001_01_000002 on host wangwei01 for executor with ID 1
19/04/07 14:14:44 INFO YarnAllocator: Launching container container_1554616735892_0001_01_000003 on host wangwei02 for executor with ID 2
19/04/07 14:14:44 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: Opening proxy : wangwei01:41680
19/04/07 14:14:44 INFO ContainerManagementProtocolProxy: Opening proxy : wangwei02:35865
19/04/07 14:14:46 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
19/04/07 14:14:46 INFO ApplicationMaster: Final app status: UNDEFINED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
19/04/07 14:14:46 INFO ShutdownHookManager: Shutdown hook called
上图可以看出ApplicationMaster进程丢失,导致连接失败
在AM启动节点上面查看NM的日志信息如下(部分日志)
2019-04-07 14:14:44,317 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000002/default_container_executor.sh]
2019-04-07 14:14:44,321 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1554616735892_0001_01_000002 transitioned from LOCALIZED to RUNNING
2019-04-07 14:14:46,166 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1554616735892_0001_01_000002
2019-04-07 14:14:46,176 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29146 for container-id container_1554616735892_0001_01_000002: 234.3 MB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used
2019-04-07 14:14:46,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29105 for container-id container_1554616735892_0001_01_000001: 330.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2019-04-07 14:14:46,188 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1554616735892_0001_01_000001 has processes older than 1 iteration running over the configured limit. Limit=2254857728, current usage = 2498314240
2019-04-07 14:14:46,188 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=29105,containerID=container_1554616735892_0001_01_000001] is running beyond virtual memory limits. Current usage: 330.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1554616735892_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 29110 29105 29105 29105 (java) 542 24 2382413824 84382 /opt/jdk1.8.0_201/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/tmp -Dspark.yarn.app.container.log.dir=/opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg wangwei02:34614 --properties-file /tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/__spark_conf__/__spark_conf__.properties
|- 29105 29104 29105 29105 (bash) 0 0 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/tmp -Dspark.yarn.app.container.log.dir=/opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'wangwei02:34614' --properties-file /tmp/hadoop/hadoop-root/nm-local-dir/usercache/root/appcache/application_1554616735892_0001/container_1554616735892_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001/stdout 2> /opt/hadoop-2.7.2/logs/userlogs/application_1554616735892_0001/container_1554616735892_0001_01_000001/stderr
2019-04-07 14:14:46,189 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Removed ProcessTree with root 29105
2019-04-07 14:14:46,189 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1554616735892_0001_01_000001 transitioned from RUNNING to KILLING
有此日志可以看出由于AM的vmem内存使用(2.3G)超出了默认的(2.1G),因此被kill掉了
解决方案
- 设置虚拟内存与物理内存比率
在yarn-site.xml中修改yarn.nodemanager.vmem-pmem-ratio(default:2.1)
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>10</value>
</property>
- yarn.nodemanager.vmem-check-enabled,false掉
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>