title: 基于HiveSever2的Azkaban插件的实现思路
date: 2017-02-05 13:45:03
tags: [Azkaban插件,Hive,HiveServer2]
categories: "Azkaban"
关键字:HiveSever2、Azkaban插件
最近研究了下HiveServer2有关的内容,并且在Azkaban的插件模块实现了基于HiveServer2的插件类型作业。现在将自己一些经验总结如下。
HiveServer & HiveSever2
先来介绍HiveServer,原名是Thrift server。HiveServer 是一个服务端,允许远程客户端通过请求提交hive作业或者获取作业结果。HiveSever是基于Thrift框架实现的,但后来的HiveServer2也是基于Thrift框架,所以命名上从Thrift Server更名为HiveServer.
HiveServer2要比HiveSever更加优秀,支持高并发和安全认证。HiveServer已经不被推荐使用。下边是Hive官网原文描述:
HiveServer is an optional service that allows a remote client to submit requests to Hive, using a variety of programming languages, and retrieve results. HiveServer is built on Apache ThriftTM (http://thrift.apache.org/), therefore it is sometimes called the Thrift server although this can lead to confusion because a newer service named HiveServer2 is also built on Thrift. Since the introduction of HiveServer2, HiveServer has also been called HiveServer1. \
HiveServer cannot handle concurrent requests from more than one client. This is actually a limitation imposed by the Thrift interface that HiveServer exports, and can't be resolved by modifying the HiveServer code. \
HiveServer2 is a rewrite of HiveServer that addresses these problems, starting with Hive 0.11.0. Use of HiveServer2 is recommended.
Azkaban插件实现原理
Azkaban的hadoop相关作业插件类都是继承自JavaProcessJob类,JavaProcessJob类本质上是一个独立的Java进程,进程内调用Client客户端执行hadoop相关作业。示意图如下: \
所有的插件类型都要实现run方法和cancel方法。
基于HiveServer2 的插件实现
这里看下关于hiveServer的一张老图:
仔细研究Azkaban的Hive插件的代码,可以知道,Azkaban是通过Hive Client来提交Hive作业的,也就是图中的CLI方式。这种方式的问题还是挺多的,由于直接绕过了HiveServer2,所以不支持高并发和安全认证,存在很多隐患。
所以有必要开发基于HiveServer2的Azkaban插件。
Azakaban的实现已经在文章《Azkaban Learning》中有简单介绍,其实可以简单模仿HadoopJava类型作业的插件实现,这里不作过多的介绍。
HiveServer2提交作业其实是通过JDBC方式来提交的,那我们来看下HiveServer2都提供了哪些api:
同步提交HQL
ThriftCLIServiceClient.executeStatement(SessionHandle sessionHandle, String statement, Map<String, String> confOverlay) throws HiveSQLException异步提交HQL
ThriftCLIServiceClient.executeStatementAsync(SessionHandle sessionHandle, String statement, Map<String, String> confOverlay) throws HiveSQLException请求日志或者结果
ThriftCLIServiceClient.fetchResults(OperationHandle opHandle, FetchOrientation orientation, long maxRows, FetchType fetchType) throws HiveSQLException
(FetchType分为FetchType.LOG 和 FetchType.QUERY_OUTPUT,分别对应日志和结果)请求执行状态
ThriftCLIServiceClient.getOperationStatus(OperationHandle opHandle) throws HiveSQLException
(状态包括:INITIALIZED RUNNING FINISHED CANCELED CLOSED ERROR UNKNOWN PENDING)取消执行
ThriftCLIServiceClient.cancelOperation(OperationHandle opHandle) throws HiveSQLException关闭句柄
ThriftCLIServiceClient.closeOperation(OperationHandle opHandle) throws HiveSQLException
每执行完一条sql都要关闭句柄
通过这些丰富的api其实已经完全足够实现这个插件。具体的实现代码不便于公开,欢迎私聊咨询。
参考资料:
- https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview
- https://hive.apache.org/
- https://github.com/azkaban/azkaban-plugins
=============2017.06.14 补充 ====================
上边这套hiveserver2的api已经过时了,太底层了,现在有一套跟JDBC高度类似的api,底层其实也是调用上边的接口。