Spark Tungsten-sort shuffle write流程解析

Tungsten简介

tungsten-sort这个名字作为一种排序方法，听起来有点怪异。下面简单介绍一下Tungsten。

Project Tungsten（“钨丝计划”）是DataBricks在3~4年前提出的Spark优化方案。由于Spark主要用Scala来开发，底层依赖JVM，因此不可避免地带来一些额外的性能开销（所谓overhead）。Tungsten致力于优化Spark的CPU与内存效率，主要有三方面：

显式内存管理与基于二进制的处理：由Spark应用自己管理（序列化的）对象和内存，消除JVM对象模型和GC等带来的overhead；
对缓存有感知的计算：提出高效的、能充分利用计算机存储体系的算法和数据结构；
代码生成技术：充分利用最新的编译器和CPU的特性，提高运行效率。

关于它的详情，可以参看DataBricks的官方说明，以及Tungsten对应的Jira issue。

目前，Tungsten在Spark SQL方面的应用最广泛。在其他方面，tungsten-sort shuffle就是比较重要的应用。从上面我们可以感觉到，tungsten-sort shuffle与前面的两种方式区别非常大，也会更难理解。下面来具体探索它的shuffle write细节，注释会写得尽量详细一点。

shuffle write入口

#1 - o.a.s.shuffle.sort.UnsafeShuffleWriter.write()方法

类名之所以叫UnsafeShuffleWriter，是因为Tungsten内部使用了很多sun.misc.Unsafe的API。

  @Override
  public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException {
    // Keep track of success so we know if we encountered an exception
    // We do this rather than a standard try/catch/re-throw to handle
    // generic throwables.
    boolean success = false;
    try {
      while (records.hasNext()) {
        //【#2 - 将shuffle数据插入排序器ShuffleExternalSorter进行处理】
        insertRecordIntoSorter(records.next());
      }
      //【#9 - 合并与写输出文件】
      closeAndWriteOutput();
      success = true;
    } finally {
      if (sorter != null) {
        try {
          sorter.cleanupResources();
        } catch (Exception e) {
          // Only throw this error if we won't be masking another
          // error.
          if (success) {
            throw e;
          } else {
            logger.error("In addition to a failure during writing, we failed during " +
                         "cleanup.", e);
          }
        }
      }
    }
  }

看起来好像和SortShuffleWriter是一样的套路，甚至还更简单？那是不可能的。

序列化

#2 - o.a.s.shuffle.sort.UnsafeShuffleWriter.open()与insertRecordIntoSorter()方法

  //【这个方法用来初始化insertRecordIntoSorter()用到的东西】
  private void open() {
    assert (sorter == null);
    //【sorter是ShuffleExternalSorter类的实例】
    sorter = new ShuffleExternalSorter(
      memoryManager,
      blockManager,
      taskContext,
      initialSortBufferSize,
      partitioner.numPartitions(),
      sparkConf,
      writeMetrics);
    //【MyByteArrayOutputStream类是ByteArrayOutputStream的简单封装，只是将内部byte[]数组暴露出来】
    //【DEFAULT_INITIAL_SER_BUFFER_SIZE常量值是1024 * 1024，即缓冲区初始1MB大】
    serBuffer = new MyByteArrayOutputStream(DEFAULT_INITIAL_SER_BUFFER_SIZE);
    //【获取序列化流SerializationStream】
    serOutputStream = serializer.serializeStream(serBuffer);
  }

  @VisibleForTesting
  void insertRecordIntoSorter(Product2<K, V> record) throws IOException {
    assert(sorter != null);
    //【获取到键和分区】
    final K key = record._1();
    final int partitionId = partitioner.getPartition(key);
    serBuffer.reset();
    //【序列化键和值，将它们都当做Object写入缓冲区】
    serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
    serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
    serOutputStream.flush();
    //【取得序列化数据的大小】
    final int serializedRecordSize = serBuffer.size();
    assert (serializedRecordSize > 0);
    //【#3 - 将序列化之后的二进制数据插入ShuffleExternalSorter处理】
    //【BYTE_ARRAY_OFFSET对应Unsafe中的native方法arrayBaseOffset()】
    sorter.insertRecord(
      serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
  }

这样一来，shuffle数据就已经序列化了，在shuffle write阶段结束前，都将在这些序列化的数据上操作，因此使用tungsten-sort的其中一条限制就是完全没有聚合操作。

下面会涉及到一些类似C语言风格的、偏底层的逻辑。这倒是符合Tungsten的“显式内存管理与基于二进制的处理”特征。

那么，向ShuffleExternalSorter插入数据之后又发生了什么呢？

内存缓存与磁盘溢写

#3 - o.a.s.shuffle.sort.ShuffleExternalSorter.insertRecord()方法

  /**
   * Write a record to the shuffle sorter.
   */
  public void insertRecord(Object recordBase, long recordOffset, int length, int partitionId)
    throws IOException {

    // for tests
    //【inMemSorter在之前已经初始化过了。顾名思义，它是个内存排序器】
    //【this.inMemSorter = new ShuffleInMemorySorter(this, initialSize, conf.getBoolean("spark.shuffle.sort.useRadixSort", true));】
    assert(inMemSorter != null);
    //【如果已经有超过spark.shuffle.spill.numElementsForceSpillThreshold条数据】
    if (inMemSorter.numRecords() >= numElementsForSpillThreshold) {
      logger.info("Spilling data because number of spilledRecords crossed the threshold " +
        numElementsForSpillThreshold);
      //【#5 - 直接溢写磁盘】
      spill();
    }

    //【#4 - 检查是否需要对排序缓存扩容及溢写】
    //【注意这个“pointerArray”，指针数组，是inMemSorter里的东西。看到本方法最后一句就能明白一点】
    growPointerArrayIfNecessary();
    // Need 4 bytes to store the record length.
    //【数据长度是int型，占额外4个字节】
    final int required = length + 4;
    //【如果需要更多内存，向TaskMemoryManager申请新的页】
    //【注意"页"（page）这个概念。Tungsten的内存管理与操作系统中的分页内存管理非常像】
    acquireNewPageIfNecessary(required);

    assert(currentPage != null);
    //【currentPage是一个MemoryBlock实例，它又是MemoryLocation的子类】
    //【由于引用传递，这里的base和下面的recordBase都代表一个Object在堆内的基地址】
    final Object base = currentPage.getBaseObject();
    //【#6 - 将当前这条数据的逻辑内存地址（页号+偏移量）编码成一个长整型】
    final long recordAddress = taskMemoryManager.encodePageNumberAndOffset(currentPage, pageCursor);
    //【先写长度值，并移动指针。Platform类中几乎所有方法都是直接调用unsafe API】
    Platform.putInt(base, pageCursor, length);
    pageCursor += 4;
    //【再写序列化之后的数据】
    Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
    pageCursor += length;
    //【#7 - 将分区ID和编码的逻辑地址（“指针”）送给ShuffleInMemorySorter进行排序】
    inMemSorter.insertRecord(recordAddress, partitionId);
  }

这段代码信息量大而晦涩，不过我们可以清楚地知道，在tungsten-sort机制中，也存在数据缓存和溢写，这与sort shuffle是类似的。但是，这里不再借助像PartitionedPairBuffer之类的高级数据结构，而是由程序自己完成，并且是直接操作内存空间。另外，真正完成排序工作的是最后出场的ShuffleInMemorySorter。

代码注释中提到的“分页内存管理”是大学操作系统课程中会讲到的东西，下面只用一幅图来简单展示下。

分页内存管理逻辑地址到物理地址的转换

如果不了解或者忘记了逻辑地址、物理地址、页表、快表（TLB）等概念的话，可以参看操作系统方面的书籍，如Abraham Silberschatz所著《操作系统概念》（Operating System Concepts）。这也是我上大学时用的教材，非常好。

往回拢一拢，看看tungsten-sort机制中是如何溢写的。

#4 - o.a.s.shuffle.sort.ShuffleExternalSorter.growPointerArrayIfNecessary()方法

  /**
   * Checks whether there is enough space to insert an additional record in to the sort pointer
   * array and grows the array if additional space is required. If the required space cannot be
   * obtained, then the in-memory data will be spilled to disk.
   */
  private void growPointerArrayIfNecessary() throws IOException {
    assert(inMemSorter != null);
    //【pointerArray缓存中写不下新数据了】
    if (!inMemSorter.hasSpaceForAnotherRecord()) {
      //【取得当前内存占用量】
      long used = inMemSorter.getMemoryUsage();
      LongArray array;
      try {
        // could trigger spilling
        //【试图给缓存分配原来两倍大的容量。除以8是因为long占64bit】
        array = allocateArray(used / 8 * 2);
      } catch (TooLargePageException e) {
        // The pointer array is too big to fix in a single page, spill.
        //【#5 - 如果分配内存超出了一页的限制，就直接溢写】
        spill();
        return;
      } catch (SparkOutOfMemoryError e) {
        // should have trigger spilling
        //【如果OOM了，就是扩容与溢写都没成功，抛异常出去】
        if (!inMemSorter.hasSpaceForAnotherRecord()) {
          logger.error("Unable to grow the pointer array");
          throw e;
        }
        return;
      }
      // check if spilling is triggered or not
      if (inMemSorter.hasSpaceForAnotherRecord()) {
        //【如果上面一波操作过后又有了剩余空间，表示已经溢写了，没必要扩容。释放掉刚才分配的内存】
        freeArray(array);
      } else {
        //【否则就是没溢写，真正去扩容】
        //【这个方法非常简单，就是直接把原来的内容复制到新数组，然后free掉原来的空间】
        inMemSorter.expandPointerArray(array);
      }
    }
  }

与sort shuffle机制类似，在溢写之前仍然会先申请内存扩容，不过这里会受到页大小的限制。那么一页最大是多少呢？在PackedRecordPointer类中定义有常量：

static final int MAXIMUM_PAGE_SIZE_BYTES = 1 << 27;

static final int MAXIMUM_PARTITION_ID = (1 << 24) - 1;

可见，一页最大是2^27字节，即128MB。之前提到的最大分区数目也是在这里定义的。另外由于页大小的限制，序列化的数据单条也不能超过128MB，这是前述三个限制条件之外的第四个条件。

上面代码中的内存操作与C语言中的malloc()/free()非常相似，底层都是依赖TaskMemoryManager类来管理的。之后在讨论Tungsten内存管理机制时，会着重分析它的源码。

下面来看具体的溢写方法。

#5 - o.a.s.shuffle.sort.ShuffleExternalSorter.spill()与writeSortedFile()方法

  /**
   * Sort and spill the current records in response to memory pressure.
   */
  @Override
  public long spill(long size, MemoryConsumer trigger) throws IOException {
    if (trigger != this || inMemSorter == null || inMemSorter.numRecords() == 0) {
      return 0L;
    }

    logger.info("Thread {} spilling sort data of {} to disk ({} {} so far)",
      Thread.currentThread().getId(),
      Utils.bytesToString(getMemoryUsage()),
      spills.size(),
      spills.size() > 1 ? " times" : " time");
    //【将数据按序写入文件】
    //【boolean参数表示是否为最终的输出文件。这里为false，表示是溢写文件】
    writeSortedFile(false);
    final long spillSize = freeMemory();
    inMemSorter.reset();
    // Reset the in-memory sorter's pointer array only after freeing up the memory pages holding the
    // records. Otherwise, if the task is over allocated memory, then without freeing the memory
    // pages, we might not be able to get memory for the pointer array.
    taskContext.taskMetrics().incMemoryBytesSpilled(spillSize);
    return spillSize;
  }

  /**
   * Sorts the in-memory records and writes the sorted records to an on-disk file.
   * This method does not free the sort data structures.
   *
   * @param isLastFile if true, this indicates that we're writing the final output file and that the
   *                   bytes written should be counted towards shuffle spill metrics rather than
   *                   shuffle write metrics.
   */
  private void writeSortedFile(boolean isLastFile) {
    final ShuffleWriteMetrics writeMetricsToUse;

    //【判断是最终输出文件还是溢写文件，然后更新不同的指标】
    if (isLastFile) {
      // We're writing the final non-spill file, so we _do_ want to count this as shuffle bytes.
      writeMetricsToUse = writeMetrics;
    } else {
      // We're spilling, so bytes written should be counted towards spill rather than write.
      // Create a dummy WriteMetrics object to absorb these metrics, since we don't want to count
      // them towards shuffle bytes written.
      writeMetricsToUse = new ShuffleWriteMetrics();
    }

    // This call performs the actual sort.
    //【#6 - 用inMemSorter排序，返回排序结果的"迭代器"（不是Java内部的迭代器，是单独实现的，像指针）】
    final ShuffleInMemorySorter.ShuffleSorterIterator sortedRecords =
      inMemSorter.getSortedIterator();

    // Small writes to DiskBlockObjectWriter will be fairly inefficient. Since there doesn't seem to
    // be an API to directly transfer bytes from managed memory to the disk writer, we buffer
    // data through a byte array. This array does not need to be large enough to hold a single
    // record;
    //【创建一个写入缓存作为内存与DiskBlockObjectWriter之间的中转】
    //【diskWriteBufferSize大小是SHUFFLE_DISK_WRITE_BUFFER_SIZE常量，即1MB】
    final byte[] writeBuffer = new byte[diskWriteBufferSize];

    // Because this output will be read during shuffle, its compression codec must be controlled by
    // spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
    // createTempShuffleBlock here; see SPARK-3426 for more details.
    //【创建临时的shuffle块】
    final Tuple2<TempShuffleBlockId, File> spilledFileInfo =
      blockManager.diskBlockManager().createTempShuffleBlock();
    final File file = spilledFileInfo._2();
    final TempShuffleBlockId blockId = spilledFileInfo._1();
    final SpillInfo spillInfo = new SpillInfo(numPartitions, file, blockId);

    // Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.
    // Our write path doesn't actually use this serializer (since we end up calling the `write()`
    // OutputStream methods), but DiskBlockObjectWriter still calls some methods on it. To work
    // around this, we pass a dummy no-op serializer.
    //【构造DiskBlockObjectWriter时必须要一个序列化器，所以这里新建一个dummy（不做任何转化的）序列化器】
    final SerializerInstance ser = DummySerializerInstance.INSTANCE;

    final DiskBlockObjectWriter writer =
      blockManager.getDiskWriter(blockId, file, ser, fileBufferSizeBytes, writeMetricsToUse);

    int currentPartition = -1;
    while (sortedRecords.hasNext()) {
      //【按分区遍历已经排好序的数据】
      sortedRecords.loadNext();
      final int partition = sortedRecords.packedRecordPointer.getPartitionId();
      assert (partition >= currentPartition);
      if (partition != currentPartition) {
        // Switch to the new partition
        //【检查分区号如果发生了变化，就先提交当前分区】
        if (currentPartition != -1) {
          //【提交写操作，得到FileSegment，并记录分区的大小】
          final FileSegment fileSegment = writer.commitAndGet();
          spillInfo.partitionLengths[currentPartition] = fileSegment.length();
        }
        //【然后换到下一个分区继续】
        currentPartition = partition;
      }

      //【取得指针，再通过指针取得页号与偏移量，就得到了数据的起始地址和长度】
      final long recordPointer = sortedRecords.packedRecordPointer.getRecordPointer();
      final Object recordPage = taskMemoryManager.getPage(recordPointer);
      final long recordOffsetInPage = taskMemoryManager.getOffsetInPage(recordPointer);
      //【先取得数据前面存储的长度，然后让指针跳过它】
      int dataRemaining = Platform.getInt(recordPage, recordOffsetInPage);
      long recordReadPosition = recordOffsetInPage + 4; // skip over record length
      while (dataRemaining > 0) {
        //【将数据拷贝到上面创建的缓存中，通过缓存转到DiskBlockObjectWriter】
        final int toTransfer = Math.min(diskWriteBufferSize, dataRemaining);
        Platform.copyMemory(
          recordPage, recordReadPosition, writeBuffer, Platform.BYTE_ARRAY_OFFSET, toTransfer);
        //【写数据】
        writer.write(writeBuffer, 0, toTransfer);
        //【移动指针，更新余量】
        recordReadPosition += toTransfer;
        dataRemaining -= toTransfer;
      }
      writer.recordWritten();
    }

    //【提交最后一个分区】
    final FileSegment committedSegment = writer.commitAndGet();
    writer.close();
    // If `writeSortedFile()` was called from `closeAndGetSpills()` and no records were inserted,
    // then the file might be empty. Note that it might be better to avoid calling
    // writeSortedFile() in that case.
    if (currentPartition != -1) {
      //【记录溢写文件的列表】
      spillInfo.partitionLengths[currentPartition] = committedSegment.length();
      spills.add(spillInfo);
    }

    if (!isLastFile) {  // i.e. this is a spill file
      //【原注释太长太长了，略去】
      //【如果是溢写文件，更新溢写的记录指标】
      writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten());
      taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten());
    }
  }

可见，在溢写之前仍然先对数据排序，并且一定会按分区号有序。每趟溢写产生的文件数与分区数相同，这点与基本sort shuffle不一样，sort shuffle的写批次是有参数控制的。

至此，溢写逻辑就完成了。那么，数据是以什么形式进入ShuffleInMemorySorter的，具体排序逻辑又是怎么样的？下面来看。

数据插入和排序

先来看数据是如何定址的。

#6 - o.a.s.memory.TaskMemoryManager.encodePageNumberAndOffset()方法

  /** The number of bits used to address the page table. */
  private static final int PAGE_NUMBER_BITS = 13;

  /** The number of bits used to encode offsets in data pages. */
  @VisibleForTesting
  static final int OFFSET_BITS = 64 - PAGE_NUMBER_BITS;  // 51

  /**
   * Given a memory page and offset within that page, encode this address into a 64-bit long.
   * This address will remain valid as long as the corresponding page has not been freed.
   *
   * @param page a data page allocated by {@link TaskMemoryManager#allocatePage}/
   * @param offsetInPage an offset in this page which incorporates the base offset. In other words,
   *                     this should be the value that you would pass as the base offset into an
   *                     UNSAFE call (e.g. page.baseOffset() + something).
   * @return an encoded page address.
   */
  public long encodePageNumberAndOffset(MemoryBlock page, long offsetInPage) {
    //【如果开启了堆外内存，由于在堆外不存在对象的基地址，所以偏移量也是绝对地址，有可能会非常大】
    //【故需要减去当前页的基地址，变成相对的】
    if (tungstenMemoryMode == MemoryMode.OFF_HEAP) {
      // In off-heap mode, an offset is an absolute address that may require a full 64 bits to
      // encode. Due to our page size limitation, though, we can convert this into an offset that's
      // relative to the page's base offset; this relative offset will fit in 51 bits.
      offsetInPage -= page.getBaseOffset();
    }
    //【当然我们现在主要考虑堆内内存的情况】
    return encodePageNumberAndOffset(page.pageNumber, offsetInPage);
  }

  @VisibleForTesting
  public static long encodePageNumberAndOffset(int pageNumber, long offsetInPage) {
    assert (pageNumber >= 0) : "encodePageNumberAndOffset called with invalid page";
    //【页号左移51位，然后拼上偏移量与上一个低51bit都为1的掩码（0x7FFFFFFFFFFFFL）】
    return (((long) pageNumber) << OFFSET_BITS) | (offsetInPage & MASK_LONG_LOWER_51_BITS);
  }

从这段代码中，我们可以看出Tungsten中逻辑地址的表示方式：高13bit为页号，低51bit为偏移量。可是，前面已经说过一页的最大容量是128MB，也就是说偏移量只占用了27bit，剩下的24bit哪里去了？答案是给了分区号来使用，下面马上会看到。

#7 - o.a.s.shuffle.sort.ShuffleInMemorySorter.insertRecord()与PackedRecordPointer.packPointer()方法

  /**
   * Inserts a record to be sorted.
   *
   * @param recordPointer a pointer to the record, encoded by the task memory manager. Due to
   *                      certain pointer compression techniques used by the sorter, the sort can
   *                      only operate on pointers that point to locations in the first
   *                      {@link PackedRecordPointer#MAXIMUM_PAGE_SIZE_BYTES} bytes of a data page.
   * @param partitionId the partition id, which must be less than or equal to
   *                    {@link PackedRecordPointer#MAXIMUM_PARTITION_ID}.
   */
  public void insertRecord(long recordPointer, int partitionId) {
    if (!hasSpaceForAnotherRecord()) {
      throw new IllegalStateException("There is no space for new record");
    }
    //【这里的array就是前文提到过的pointerArray，是LongArray类型的，里面存储的是"打包"过的指针】
    array.set(pos, PackedRecordPointer.packPointer(recordPointer, partitionId));
    pos++;
  }

  /**
   * Pack a record address and partition id into a single word.
   *
   * @param recordPointer a record pointer encoded by TaskMemoryManager.
   * @param partitionId a shuffle partition id (maximum value of 2^24).
   * @return a packed pointer that can be decoded using the {@link PackedRecordPointer} class.
   */
  public static long packPointer(long recordPointer, int partitionId) {
    assert (partitionId <= MAXIMUM_PARTITION_ID);
    // Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.
    // Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.
    //【所谓“打包”其实是压缩。将页号右移了24位，然后与低27位拼在一起，逻辑地址就被压缩到了40位】
    final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
    final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
    //【将分区号放在空出来的高位上】
    return (((long) partitionId) << 40) | compressedAddress;
  }

由此可见，进入ShuffleInMemorySorter中待排序的指针都符合下图的格式。

tungsten-sort指针位图

既然排序是对指针的排序，那么数据本身自然是不会保证有序性的。看一下排序逻辑的入口。

#8 - o.a.s.shuffle.sort.ShuffleInMemorySorter.getSortedIterator()方法

  /**
   * Return an iterator over record pointers in sorted order.
   */
  public ShuffleSorterIterator getSortedIterator() {
    int offset = 0;
    //【useRadixSort由spark.shuffle.sort.useRadixSort参数控制。默认true】
    //【如果为true，采用最低位（LSD）优先的基数排序算法。由于分区号在最高位，因此分区是有序的】
    if (useRadixSort) {
      offset = RadixSort.sort(
        array, pos,
        PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,
        PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX, false, false);
    } else {
      //【false就采用与sort shuffle相同的TimSort算法，在Sorter类中】
      MemoryBlock unused = new MemoryBlock(
        array.getBaseObject(),
        array.getBaseOffset() + pos * 8L,
        (array.size() - pos) * 8L);
      LongArray buffer = new LongArray(unused);
      Sorter<PackedRecordPointer, LongArray> sorter =
        new Sorter<>(new ShuffleSortDataFormat(buffer));
      //【SORT_COMPARATOR是简单的对分区ID的比较器】
      sorter.sort(array, 0, pos, SORT_COMPARATOR);
    }
    return new ShuffleSorterIterator(pos, array, offset);
  }

可见有两种排序算法，并且都保证分区号有序。默认的基数排序效率非常高，可以自行去参考它的源码，这里就不详细讲解了。

最后，还剩下文件合并与输出部分。加油。

文件合并与输出

#9 - o.a.s.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput()方法

  @VisibleForTesting
  void closeAndWriteOutput() throws IOException {
    assert(sorter != null);
    //【更新峰值内存用量】
    updatePeakMemoryUsed();
    //【取得所有溢写文件，然后释放掉缓存、输出流和排序器】
    serBuffer = null;
    serOutputStream = null;
    final SpillInfo[] spills = sorter.closeAndGetSpills();
    sorter = null;
    final long[] partitionLengths;
    //【创建输出数据文件和临时数据文件】
    final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    final File tmp = Utils.tempFileWith(output);
    try {
      try {
        //【#10 - 合并溢写文件到临时输出文件】
        partitionLengths = mergeSpills(spills, tmp);
      } finally {
        //【然后删除它们】
        for (SpillInfo spill : spills) {
          if (spill.file.exists() && ! spill.file.delete()) {
            logger.error("Error while deleting spill file {}", spill.file.getPath());
          }
        }
      }
      //【创建索引文件。这个方法在之前有详细的解释过】
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
      }
    }
    //【填充MapStatus结果】
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
  }

#10 - o.a.s.shuffle.sort.UnsafeShuffleWriter.mergeSpills()方法

  /**
   * Merge zero or more spill files together, choosing the fastest merging strategy based on the
   * number of spills and the IO compression codec.
   *
   * @return the partition lengths in the merged file.
   */
  private long[] mergeSpills(SpillInfo[] spills, File outputFile) throws IOException {
    //【该参数用来控制是否做shuffle阶段的压缩】
    final boolean compressionEnabled = sparkConf.getBoolean("spark.shuffle.compress", true);
    final CompressionCodec compressionCodec = CompressionCodec$.MODULE$.createCodec(sparkConf);
    //【该参数用来控制是否要做快速的合并】
    final boolean fastMergeEnabled =
      sparkConf.getBoolean("spark.shuffle.unsafe.fastMergeEnabled", true);
    //【但上面的还不够，还得判断是否支持快速合并。如果shuffle阶段不启用压缩，或者启用了lz4/lzf/snappy/zstd压缩，就支持】
    final boolean fastMergeIsSupported = !compressionEnabled ||
      CompressionCodec$.MODULE$.supportsConcatenationOfSerializedStreams(compressionCodec);
    final boolean encryptionEnabled = blockManager.serializerManager().encryptionEnabled();
    try {
      if (spills.length == 0) {
        //【如果根本没有溢写文件，写一个空文件】
        new FileOutputStream(outputFile).close(); // Create an empty file
        return new long[partitioner.numPartitions()];
      } else if (spills.length == 1) {
        //【如果只有一个溢写文件，就直接将它合并进输出文件中】
        // Here, we don't need to perform any metrics updates because the bytes written to this
        // output file would have already been counted as shuffle bytes written.
        Files.move(spills[0].file, outputFile);
        return spills[0].partitionLengths;
      } else {
        //【如果有多个溢写文件】
        final long[] partitionLengths;
        //【英文注释过长，是关于metrics的，略去】
        if (fastMergeEnabled && fastMergeIsSupported) {
          // Compression is disabled or we are using an IO compression codec that supports
          // decompression of concatenated compressed streams, so we can perform a fast spill merge
          // that doesn't need to interpret the spilled bytes.
          //【如果启用并支持快速合并，并且启用了上一篇文章中的transferTo机制，还没有加密】
          //【就使用NIO zero-copy来合并到输出文件】
          if (transferToEnabled && !encryptionEnabled) {
            logger.debug("Using transferTo-based fast merge");
            partitionLengths = mergeSpillsWithTransferTo(spills, outputFile);
          } else {
            //【如果不用transferTo的话，就使用不经压缩的BIO FileStream来合并到输出文件】
            logger.debug("Using fileStream-based fast merge");
            partitionLengths = mergeSpillsWithFileStream(spills, outputFile, null);
          }
        } else {
          //【如果不启用或不支持快速合并，就使用压缩的BIO FileStream来合并到输出文件。这个方式比较"慢"】
          logger.debug("Using slow merge");
          partitionLengths = mergeSpillsWithFileStream(spills, outputFile, compressionCodec);
        }
        // When closing an UnsafeShuffleExternalSorter that has already spilled once but also has
        // in-memory records, we write out the in-memory records to a file but do not count that
        // final write as bytes spilled (instead, it's accounted as shuffle write). The merge needs
        // to be counted as shuffle write, but this will lead to double-counting of the final
        // SpillInfo's bytes.
        writeMetrics.decBytesWritten(spills[spills.length - 1].file.length());
        writeMetrics.incBytesWritten(outputFile.length());
        //【返回分区大小，索引文件要用到】
        return partitionLengths;
      }
    } catch (IOException e) {
      if (outputFile.exists() && !outputFile.delete()) {
        logger.error("Unable to delete output file {}", outputFile.getPath());
      }
      throw e;
    }
  }

可见，tungsten-sort shuffle的文件合并方式与bypass sort shuffle的方式比较类似，有NIO和BIO两种方式，但判断条件更严格一些。至此，整个tungsten-sort shuffle write流程就结束了。

总结

下图来自https://0x0fff.com/spark-architecture-shuffle/。

tungsten-sort shuffle write流程简图

暂未涉及细节的知识点

Tungsten的内存管理机制详情
NIO在shuffle处理中的应用

最后编辑于：2019.05.10 13:21:24

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,590评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 86,808评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,151评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,779评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,773评论 5赞 367
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,656评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,022评论 3赞 398
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,678评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 41,038评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,659评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,756评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,411评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,005评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,973评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,203评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,053评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,495评论 2赞 343