Spark集群硬件配置参考

标签（空格分隔）： Spark

Hardware Provisioning

A common question received by Spark developers is how to configure hardware for it. While the right hardware will depend on the situation, we make the following recommendations.

硬件配置

Spark开发人员面临的最常见一个问题就是集群的配置硬件。一般来说，合理的硬件配置取决于自身的实际情况，我们只能从以下几个方面提出建议。

Storage Systems

Because most Spark jobs will likely have to read input data from an external storage system (e.g. the Hadoop File System, or HBase), it is important to place it as close to this system as possible. We recommend the following:

If at all possible, run Spark on the same nodes as HDFS. The simplest way is to set up a Spark standalone mode cluster on the same nodes, and configure Spark and Hadoop’s memory and CPU usage to avoid interference (for Hadoop, the relevant options are mapred.child.java.opts for the per-task memory and mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum for number of tasks). Alternatively, you can run Hadoop and Spark on a common cluster manager like Mesos or Hadoop YARN.

If this is not possible, run Spark on different nodes in the same local-area network as HDFS.

For low-latency data stores like HBase, it may be preferrable to run computing jobs on different nodes than the storage system to avoid interference.

存储系统

大部分的Spark作业会从外部存储系统（比如Hadoop文件系统或者Hbase）读取输入数据，因此将其与存储系统放得越近越好，我们给出如下建议：

如果可能的话，在与HDFS相同的节点上运行Spark。最简单的方法是在相同的节点上安装Spark standalone模式集群，并配置Spark和Hadoop的内存和CPU使用，以避免干扰 (对于Hadoop来说，相关的选项是：每个任务的内存配置是mapred.child.java.opts，任务数的配置是mapred.tasktracker.map.tasks.maximum 和 mapred.tasktracker.reduce.tasks.maximum)。你也可以在集群管理器上运行Hadoop和Spark，比如Mesos或Hadoop YARN。

如果这个没办法实现，那么Spark集群要与HDFS在同一局域网。

对于像HBase这样的低延迟数据存储，在不同的节点上运行计算作业可能比存储系统更容易，以避免干扰。

Local Disks

While Spark can perform a lot of its computation in memory, it still uses local disks to store data that doesn’t fit in RAM, as well as to preserve intermediate output between stages. We recommend having 4-8 disks per node, configured without RAID (just as separate mount points). In Linux, mount the disks with the noatime option to reduce unnecessary writes. In Spark, configure the spark.local.dir variable to be a comma-separated list of the local disks. If you are running HDFS, it’s fine to use the same disks as HDFS.

本地磁盘

虽然Spark很多计算都在内存中进行，但当数据在内存中装不下的时候，它仍然使用本地磁盘来存储数据，以及在不同阶段之间保留中间的输出。我们建议每个节点有4-8个磁盘，不做RAID(就像单独的挂载点一样)。在Linux中，用noatime选项挂载磁盘，以减少不必要的写操作。在Spark中，将多个挂载的磁盘配置在spark.local.dir变量中，用逗号分隔。如果你正在运行HDFS，那么可以使用与HDFS相同的磁盘。

Memory

In general, Spark can run well with anywhere from 8 GB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache.

How much memory you will need will depend on your application. To determine how much your application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the Storage tab of Spark’s monitoring UI (http://<driver-node>:4040) to see its size in memory. Note that memory usage is greatly affected by storage level and serialization format – see the tuning guide for tips on how to reduce it.

Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES.

内存

一般来说，每台机器上8GB到几百GB的内存，Spark都可以运行的很好。如果考虑所有情况，我们建议在Spark中最多分配75%的内存，剩下的部分留给操作系统和缓冲区缓存。

需要多少内存取决于你的应用程序。想要确定你的应用程序在某个数据集下对内存的使用情况，将数据集的一部分加载到RDD中，并使用Spark的监视UI(http://<driver-node>:4040)的Storage选项卡来查看内存使用情况。需要注意一点，内存使用情况受存储等级和序列化格式的影响很大，请参阅有关如何减少它的提示的调优指南。

最后请注意，内存超过200GB时，Java虚拟机运行状况并不是总是那么良好。如果你购买的机器内存很大，超过了200G，那么可以在每个节点上运行多个Worker。在Spark的Standalone模式中，你可以使用conf/spark-env.sh中通过SPARK_WORKER_INSTANCES变量来配置每个节点的worker数量，用SPARK_WORKER_CORES配置每个worker的CPU核心数目。

Network

In our experience, when the data is in memory, a lot of Spark applications are network-bound. Using a 10 Gigabit or higher network is the best way to make these applications faster. This is especially true for “distributed reduce” applications such as group-bys, reduce-bys, and SQL joins. In any given application, you can see how much data Spark shuffles across the network from the application’s monitoring UI (http://<driver-node>:4040).

网络

根据我们的经验，当数据在内存中时，很多Spark应用程序都是和网络紧密相关的。使用10千兆或更高的网络是使这些应用程序更快的最好的方法。这对于“分布式reduce”应用程序尤其适用，比如group-by操作、reduce-by操作和SQL join。在任何给定的应用程序中，您都可以看到从应用程序的监视UI(http://<driver-node>:4040)中看到网络中有多大量的shuffle数据。

CPU Cores

Spark scales well to tens of CPU cores per machine because it performes minimal sharing between threads. You should likely provision at least 8-16 cores per machine. Depending on the CPU cost of your workload, you may also need more: once data is in memory, most applications are either CPU- or network-bound.

CPU核心数目

Spark在每台机器上可以扩展到数十个CPU内核，因为它在线程之间最小共享。在每台机器上至少提供8-16个内核。根据工作负载的CPU消耗，可能还需要更多:一旦数据在内存中，大多数应用程序要么是CPU相关，要么是网络相关。

翻译原文地址：http://spark.apache.org/docs/1.6.3/hardware-provisioning.html

Spark集群硬件配置参考

Spark集群硬件配置参考

Hardware Provisioning

硬件配置

Storage Systems

存储系统

Local Disks

本地磁盘

Memory

内存

Network

网络

CPU Cores

CPU核心数目

推荐阅读更多精彩内容