背景
收到运维通知,负责的工程下面有很多core文件,是python进程崩溃后系统生成的。core 文件的生成原理这里不错介绍了,感兴趣的可以自己去了解一下。
分析
分析core文件需要上 gdb,因为符号文件的限制,分析时需要有一个和原有问题环境一样的调试环境,否则看到的就是乱码
因为程序运行在k8s中,需要将镜像文件下载后在容器内调试。
首先是需要把core文件拷贝到容器中,可以按照如下命令进行
docker cp /loca/path/file <container_id>:/container_path
container_id 可以通过 docker ps 查询获得
接下来按如下步骤分析
- 启动gdb
gdb /home/chunyu/workspace/ENV/bin/python core.17445
- 查看出问题的调用栈
直接 bt 出来的并不对应到python代码,需要启动python的debug信息才行。通过 file python 可以查看当前python debug信息的加载情况。 安装debug信息的方式一般 gdb 也会给出来,我执行的时候命令如下
yum –enablerepo='*debug*' install /usr/lib/debug/.buildid/8d/75b23c27b98a6fc5656327f915409f6f1fba5b.debug
之后就可以通过 py-bt 命令来分析了
通过上图可以看出,线程是在执行 hbase 数据读取的时候产生异常出的core文件
- 原因分析
初步怀疑是多线程访问导致的问题,从所有线程的调用栈分析上可以看出,thread 1 和 thread 35 两个线程都在进行hbase的访问。thread 35 中直接创建了新链接而thread 1 中还在访问老的链接,直接导致了异常
Thread 1 (Thread 0x7f66b14aa700 (LWP 32386)):
#18 Frame 0x7f66a002ac80, for file /home/workspace/ENV/lib/python2.7/site-packages/thriftpy/thrift.py, line 150, in read (self=<getRowsWithColumns_result(io=None, success=None) at remote 0x7f66b1537890>, iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>)
iprot.read_struct(self)
#22 Frame 0x7f66a005d860, for file /home/workspace/ENV/lib/python2.7/site-packages/thriftpy/thrift.py, line 217, in _recv (self=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>) at remote 0x7f66b1544a10>, _api='getRowsWithColumns', fname=u'getRowsWithColumns', mtype=2, rseqid=0, result=<getRowsWithColumns_result(io=None, success=None) at remote 0x7f66b1537890>)
result.read(self._iprot)
#26 Frame 0x7f669c124950, for file /home/chunyu/workspace/ENV/lib/python2.7/site-packages/thriftpy/thrift.py, line 198, in _req (self=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>) at remote 0x7f66b1544a10>, _api='getRowsWithColumns',
return self._recv(_api)
#36 Frame 0x7f66a000bd00, for file /home/workspace/ENV/lib/python2.7/site-packages/happybase/table.py, line 162, in rows (self=<Table(connection=<Connection(compat='0.96', _transport_class=<type at remote 0x7f670fdb2d00>, table_prefix=None, table_prefix_separator='_', _protocol_class=<type at remote 0x7f670f57b4c0>, _initialized=True, host='offline_hbase', client=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1554b40>) at remote 0x7f66b1544a10>, timeout=1000, port=29090, transport=<thriftpy.transport.buffered.cybuffered.TCyBufferedTransport at remote 0x7f66b1554870>) at remote 0x7f66b1544f50>, name='') at remote 0x7f66b1537c10>
self.name, rows, columns, {})
Thread 35 (Thread 0x7f66b0ca9700 (LWP 32390)):
#14 Frame 0x7f672d8f9790, for file /usr/lib64/python2.7/socket.py, line 224, in meth (name='connect', self=<_socketobject at remote 0x7f66b14c5ad0>, args=(('offline_hbase', 29090),))
return getattr(self._sock,name)(*args)
#22 Frame 0x7f66b4ad7620, for file /home/workspace/ENV/lib/python2.7/site-packages/thriftpy/transport/socket.py, line 96, in open (self=<TSocket(socket_timeout=<float at remote 0x7f669c0b15e8>, sock=<_socketobject at remote 0x7f66b14c5ad0>, socket_family=2, unix_socket=None, host='offline_hbase', connect_timeout=<float at remote 0x7f669c0b15e8>, port=29090) at remote 0x7f66b14c6e50>, addr=('offline_hbase', 29090))
self.sock.connect(addr)
#31 Frame 0x7f66b56ceb00, for file /home/workspace/ENV/lib/python2.7/site-packages/happybase/connection.py, line 178, in open (self=<Connection(compat='0.96', _transport_class=<type at remote 0x7f670fdb2d00>, table_prefix=None, table_prefix_separator='_', _protocol_class=<type at remote 0x7f670f57b4c0>, host='offline_hbase', client=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>) at remote 0x7f66b14c6790>, timeout=1000, port=29090, transport=<thriftpy.transport.buffered.cybuffered.TCyBufferedTransport at remote 0x7f66b15029b0>) at remote 0x7f66b14c6350>)
self.transport.open()
#34 Frame 0x2e41150, for file /home/workspace/ENV/lib/python2.7/site-packages/happybase/connection.py, line 148, in __init__ (self=<Connection(compat='0.96', _transport_class=<type at remote 0x7f670fdb2d00>, table_prefix=None, table_prefix_separator='_', _protocol_class=<type at remote 0x7f670f57b4c0>, host='offline_hbase', client=<TClient(_seqid=0, _service=<type at remote 0x41c81f0>, _iprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>, _oprot=<cybin.TCyBinaryProtocol at remote 0x7f66b1502dc0>) at remote 0x7f66b14c6790>, timeout=1000, port=29090, transport=<thriftpy.transport.buffered.cybuffered.TCyBufferedTransport at remote 0x7f66b15029b0>) at remote 0x7f66b14c6350>, host='offline_hbase', port=29090, timeout=1000, autoconnect=True, table_prefix=None, table_prefix_separator='_', compat='0.96', transport='buffered', protocol='binary')
self.open()
可以看到 thread35 在重新建立到 hbase 的链接,而 thread1 还在直接的 connection 上读取数据,由于 happybase 的 connection 并不是线程安全的,因此发生了程序崩溃的问题。
解决方式为考虑使用 happybase 的连接池。