记一次sleuth发送zipkin异常引起的OOM

一、问题背景

一次生产事故，线上服务响应慢；
作为常规操作，服务的VM启动参数有配置OOM提取内存DUMP信息：

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/dump-path/

这是个好习惯。

使用Eclipse MAT分析dump文件，大对象视图如下：

大对象

一种对象占据了1.8G的JVM内存空间，程序配置的最大堆大小是2G；很明显，这是由于程序问题引起的单一对象大量产生，而又一直引用可达，造成JVM无法GC引起的OOM。

二、MAT分析

接下来继续使用MAT，分析对象产生的堆栈：

总结性描述

这是一个总结性的描述，意思是一个zipkin2.reporter.InMemoryReporterMetrics类的实例占据了96.09%的堆空间，而内存的增加是由于java.util.concurrent.ConcurrentHashMap$Node[]实例的堆积引起的。

通过这个总结性的描述信息，大概能够知道去InMemoryReporterMetrics这个类找问题了。

1）到内存积累点的最短路径

MAT还提供了视图Shortest Paths to the Accumulation Point来定位大对象产生的引用关系：

最短路径

通过这个视图，大对象的引用关系是：

AsyncReporter.Builder ->
AsyncReporter.BoundedAsyncReporter(metrics属性) ->
InMemoryReporterMetrics(messagesDropped属性)

2）大对象内容

既然大对象是ConcurrentHashMap$Node的实例，那么可以通过了解Node的具体内容，来定位问题；

通过MAT，还可以看到堆积的大对象的具体内容。

操作方式是：

outgoing references

得到大对象内容：

大对象内容

任意选取一个对象，通过查看Map的Node内容，发现：

key是一个异常类，具体是ResourceAccessException
value是一个自动Long AtomicLong
key这个异常的产生原因是：对http://localhost:9411/api/v2/spans这个地址的POST被拒绝

三、源码分析

使用MAT工具分析DUMP，已经得出了很多信息，甚至已经知道问题原因。但是还需要进一步分析源码，详细了解问题的产生，以及解决方法。

1）InMemoryReporterMetrics

通过MAT分析得出的大对象引用关系，查看类InMemoryReporterMetrics:

private final ConcurrentHashMap<Throwable, AtomicLong> messagesDropped =
      new ConcurrentHashMap<Throwable, AtomicLong>();

messagesDropped是一个key 为Throwable，value为AtomicLong的ConcurrentHashMap。

InMemoryReporterMetrics，看名字，它是一个内存报告度量。具体对是sleuth发送到zipkin服务器的所有消息的一个统计，包括发送成功的消息，发送失败的消息。注意这个统计信息是存在内存里的。
而这个度量中的messagesDropped就是存储发送异常的消息，key是具体异常信息，value是出现次数。

那么推断如果发送zipkin异常不断产生，那么messagesDropped的不断堆积，势必会造成OOM。

2）AsyncReporter

从引用关系上来看，InMemoryReporterMetrics是由AsyncReporter.BoundedAsyncReporter中的属性metrics引用的：

static final class BoundedAsyncReporter<S> extends AsyncReporter<S> {

final ReporterMetrics metrics;
}

在这个类的flush()方法中，有这样一段代码：

void flush(BufferNextMessage<S> bundler) {
     try {
        sender.sendSpans(nextMessage).execute();
      } catch (IOException | RuntimeException | Error t) {
            // In failure case, we increment messages and spans dropped.
            metrics.incrementMessagesDropped(t);
      }
}

可以看到，当sender发送消息到zipkin产生异常时，就会将异常实例本身，存入metrics的messagesDropped中。

AsyncReporter类使用了build模式，来创建异步报告者(AsyncReporter)，而这个异步报告者的具体类，就是AsyncReporter的内部类BoundedAsyncReporter。

在AsyncReporter.Builder的builder()方法中，启动了一个线程，在一个while循环中，不断将消息队列中的消息flush到zipkin。这就是异步reporter的由来。

3）zipkin自动配置

SpringBoot的自动配置，其实就是根据相关必须条件，将具备各种功能的bean注入到spring上下文中。zipkin的自动配置也不例外：

自动配置类ZipkinAutoConfiguration创建异步报告者的方法如下：

@Bean
@ConditionalOnMissingBean
public Reporter<Span> reporter(
        ReporterMetrics reporterMetrics,
        ZipkinProperties zipkin,
        Sender sender,
        BytesEncoder<Span> spanBytesEncoder
) {
    return AsyncReporter.builder(sender)
            .queuedMaxSpans(1000) // historical constraint. Note: AsyncReporter supports memory bounds
            .messageTimeout(zipkin.getMessageTimeout(), TimeUnit.SECONDS)
            .metrics(reporterMetrics)
            .build(spanBytesEncoder);
}

这个类中，还创建了发送到zipkin所需的sender，以及我们的关注点ReporterMetrics：

@Bean
@ConditionalOnMissingBean
ReporterMetrics sleuthReporterMetrics() {
    return new InMemoryReporterMetrics(); 
}

四、问题原因

服务在开发测试时，使用了zipkin的调用链追踪。但是投产时，由于某些原因，无法使用zipkin，于是将zipkin的相关配置注释掉了。

因此服务有zipkin的依赖：

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>

但是没有zipkin的配置：

# 调用链
#  zipkin:
#    base-url: http://172.20.6.23:9412
#  sleuth:
#    sampler:
#      probability: 1.0 # 采样率, 默认为0.1, 采样10%的请求

通过观察zipkin的自动配置类ZipkinAutoConfiguration：

@EnableConfigurationProperties({ZipkinProperties.class, SamplerProperties.class})
@ConditionalOnProperty(value = "spring.zipkin.enabled", matchIfMissing = true)
public class ZipkinAutoConfiguration {
    @Bean
    @ConditionalOnMissingBean
    public Reporter<Span> reporter(
            ReporterMetrics reporterMetrics,
            ZipkinProperties zipkin,
            Sender sender,
            BytesEncoder<Span> spanBytesEncoder
    ) {
        return AsyncReporter.builder(sender)
                .queuedMaxSpans(1000) // historical constraint. Note: AsyncReporter supports memory bounds
                .messageTimeout(zipkin.getMessageTimeout(), TimeUnit.SECONDS)
                .metrics(reporterMetrics)
                .build(spanBytesEncoder);
    }
}

即使没有任何zipkin的配置，都会创建一个异步报告者，默认的采样率是：

private float probability = 0.1f;

所以即使不配置相关配置项，也会以默认采样率10%，发送到zipkin，这是默认的地址是：

@ConfigurationProperties("spring.zipkin")
public class ZipkinProperties {
    /**
     *  URL of the zipkin query server instance. You can also provide
     *  the service id of the Zipkin server if Zipkin's registered in
     *  service discovery (e.g. http://zipkinserver/)
     */
    private String baseUrl = "http://localhost:9411/";
}

此时发送到localhost显然会连接拒绝。导致度量中的异常实例堆积，从而OOM。

五、问题解决

通过MAT分析和源码分析，可以容易得到问题原因是zipkin地址的问题，那么把地址配置正确应该就可以解决问题。

更深层次的问题

通过分析得出，其实随异步发送者创建的InMemoryReporterMetrics是有缺陷的；
因为若由于一些不可预知的原因导致发送zipkin产生异常，那么这个异常信息会存放到内存度量中（InMemoryReporterMetrics），而且又没有机制去删除。若不断堆积，还是会产生OOM。

这一点，不知道是不是zipkin的设计缺陷。

解决办法

同事提出可以创建一个空的度量，来替换原来的内存度量：

@Bean
public ReporterMetrics metrics() {
    return new ReporterMetrics() {

        @Override
        public void incrementMessages() {
            
        }

        @Override
        public void incrementMessagesDropped(Throwable cause) {

        }

        @Override
        public void incrementSpans(int quantity) {

        }

        @Override
        public void incrementSpanBytes(int quantity) {

        }

        @Override
        public void incrementMessageBytes(int quantity) {

        }

        @Override
        public void incrementSpansDropped(int quantity) {

        }

        @Override
        public void updateQueuedSpans(int update) {

        }

        @Override
        public void updateQueuedBytes(int update) {

        }
    };
}