使用Solr+Hbase-solr(Hbase-indexer)配置实现HBase二级索引

前言：
因为项目需要，试着搭建了一下HBase二级索引的环境，网上看了一些教程，无一不坑，索性整理一份比较完整的。本文适当的精简和绕过了一些“老司机一看就知道”的内容，适合刚接触这一领域但是有一定Linux和Hadoop基础的读者，不适合完全初学者。

环境约束：
OS：CentOS6.7-x86_64
JDK：jdk1.7.0_109
hadoop-2.6.0+cdh5.4.1
hbase-solr-1.5+cdh5.4.1 (hbase-indexer-1.5-cdh5.4.1)
solr-4.10.3-cdh5.4.1
zookeeper-3.4.5-cdh5.4.1
hbase-1.0.0-cdh5.4.1

文中所用CDH软件下载页：
CDH 5.4.x Packaging and Tarball Information | 5.x | Cloudera Documentation

一、基本环境准备

1.一个3节点Hadoop集群，服务器计划角色分配如下：

服务器角色分配

先把Namenode、Datanode、zookeeper、Journalnode、ZKFC跑起来，具体技术自行突破，不是本文重点，无需多言。

2.下载好所需的CDH版本软件：

在文首的链接页面下载好tarball，需要注意的是HBase-solr的tarball是整个项目文件，但是我们用到的只是它的部署文件，解压缩hbase-solr-1.5+cdh5.4.1的tarball，在 hbase-solr-1.5-cdh5.4.1\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.4.1.tar.gz，后面会用到。

二、部署hbase-indexer

将hbase-indexer-1.5-cdh5.4.1.tar.gz拷贝到node2或者node3上
解压缩hbase-indexer-1.5-cdh5.4.1.tar.gz：

tar zxvf hbase-indexer-1.5-cdh5.4.1.tar.gz

修改hbase-indexer的参数：

vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-site.xml

<?xml version="1.0"?>
<configuration>
<property>
  <name>hbaseindexer.zookeeper.connectstring</name>
  <!--此处需根据zookeeper集群的实际配置修改-->
  <value>node1:2181,node2:2181,node3:2181</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <!--此处需根据zookeeper集群的实际配置修改-->
  <value>node1,node2,node3</value>
</property>
</configuration>

配置hbase-indexer-env.sh:

vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-env.sh

修改JAVA_HOME

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase-indexer process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase-indexer, etc.)

# The java implementation to use.  Java 1.6 required.
export JAVA_HOME=/usr/java/jdk1.7.0/
#根据实际环境修改

# Extra Java CLASSPATH elements.  Optional.
# export HBASE_INDEXER_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_INDEXER_HEAPSIZE=1000

# Extra Java runtime options.
# Below are what we set by default.  May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -XX:+UseConcMarkSweepGC"

使用scp命令把整个hbase-indexer-1.5-cdh5.4.1复制到node3上

三、部署HBase

解压缩hbase的tarball

tar zxvf hbase-1.0.0-cdh5.4.1.tar.gz

同样要修改hbase-site.xml

vim hbase-1.0.0-cdh5.4.1/conf/hbase-site.xml

需要在<configuration>标签内增加如下内容：

   <property>
    <name>hbase.rootdir</name>
    <value>hdfs://node1:9000/hbase</value>
    <description>The directory shared by RegionServers</description>
  </property>
  <property>
    <name>hbase.master</name>
    <value>node1:60000</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in.Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
  <property>
    <name>hbase.replication</name>
    <value>true</value>
    <description>SEP is basically replication, so enable it</description>
  </property>
  <property>
    <name>replication.source.ratio</name>
    <value>1.0</value>
    <description>Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)</description>
  </property>
  <property>
    <name>replication.source.nb.capacity</name>
    <value>1000</value>
    <description>Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.</description>
  </property>
  <property>
    <name>replication.replicationsource.implementation</name>
    <value>com.ngdata.sep.impl.SepReplicationSource</value>
    <description>A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).</description>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1,node2,node3</value>
    <description>The directory shared by RegionServers</description>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <!--注意这里配置的是zookeeper集群的数据目录，参照zookeeper的zoo.cfg-->
    <value>/home/HBasetest/zookeeperdata</value>
    <description>Property from ZooKeeper's config zoo.cfg.
      The directory where the snapshot is stored.
    </description>
  </property>

类似的，修改hbase-env.sh

vim hbase-1.0.0-cdh5.4.1/conf/hbase-env.sh

修改JAVA_HOME和HBASE_HOME

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)

# The java implementation to use.  Java 1.7+ required.
# export JAVA_HOME=/usr/java/jdk1.6.0/

export JAVA_HOME=/opt/jdk1.7.0_79
export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.4.1
#根据实际填写

# Extra Java CLASSPATH elements.  Optional.
# export HBASE_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_HEAPSIZE=1000

# Uncomment below if you intend to use off heap cache.
# export HBASE_OFFHEAPSIZE=1000

# For example, to allocate 8G of offheap, to 8G:
# export HBASE_OFFHEAPSIZE=8G

# Extra Java runtime options.
# Below are what we set by default.  May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"

将hbase-indexer-1.5-cdh5.4.1/lib目录下的这4个文件复制到hbase-1.0.0-cdh5.4.1/lib/目录下

hbase-sep-api-1.5-cdh5.4.1.jar
hbase-sep-impl-1.5-hbase1.0-cdh5.4.1.jar
hbase-sep-impl-common-1.5-cdh5.4.1.jar
hbase-sep-tools-1.5-cdh5.4.1.jar

修改hbase-1.0.0-cdh5.4.1/conf/regionservers为如下内容：

node2
node3

然后将目录hbase-1.0.0-cdh5.4.1复制到node2和node3上面

四、部署Solr

直接在node1上解压缩就好。。。

五、运行测试

1.运行HBase

在node1上执行：

./hbase-1.0.0-cdh5.4.1/bin/start-hbase.sh

2.运行HBase-indexer

分别在node2和node3上执行：

./hbase-indexer-1.5-cdh5.4.1/bin/hbase-indexer server

如果想以后台方式运行，可以使用screen或者nohup

3.运行Solr

分别在node1上进入solr下面的sample子目录，执行：

java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar

同样，如果想以后台方式运行，可以使用screen或者nohup
使用http://node1:8983/solr/#/访问solr的主页

六、数据索引测试

将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后，首先用HBase创建一个数据表：
在任一node上的HBase安装目录下运行：

./bin/hbase shell
create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' }

在部署了HBase-Indexer的节点上，进入HBase-Indexer部署目录，使用HBase-Indexer的demo下的配置文件创建一个索引：

./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1

编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件：

<?xml version="1.0"?>
<indexer table="indexdemo-user">
  <field name="firstname_s" value="info:firstname"/>
  <field name="lastname_s" value="info:lastname"/>
  <field name="age_i" value="info:age" type="int"/>
</indexer>

保存为indexdemo-indexer.xml

添加indexer实例
在hbase-indexer-1.5-cdh5.4.1/demo下运行：

./bin/hbase-indexer add-indexer -n myindexer -c indexdemo-indexer.xml -cp \
solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 -z node1,node2,node3

准备一些测试数据，因为项目需要对千万级以上的记录进行索引的测试，所以用命令行手敲的方式插入数据有点不大现实，HBase也支持使用shell命令批量执行以文本方式存储的命令集合，但在千万级别这个数量级的数据量面前还是很苍白，最后我还是选择了用Java编程的方式实现快速的批量插入记录。
Eclipse里面新建一个Java工程，导入HBase部署目录下lib内的所有内容。程序源代码如下：

package com.hbasetest.hbtest;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;

public class DataInput {
    private static Configuration configuration;
    static {
        configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.property.clientPort", "2181");
        configuration.set("hbase.zookeeper.quorum", "node1,node2,node3");
    }
    public static void main(String[] args) {        
        try {
            List<Put> putList = new ArrayList<Put>();
            HTable table = new HTable(configuration, "indexdemo-user");
            for (int i =0; i<=14000000 ;i++)
        {
            Put put = new Put(Integer.toString(i).getBytes());
            put.add("info".getBytes(), "firstname".getBytes(), ("Java.value.firstname"+Integer.toString(i)).getBytes());
            put.add("info".getBytes(), "lastname".getBytes(), ("Java.value.lastname"+Integer.toString(i)).getBytes());
            putList.add(put);
            System.out.println("put successfully！ " + Integer.toString(i) );        
           
        }   table.put(putList);     
        } catch (IOException e) {
            e.printStackTrace();
                }       
    }
}

这段代码使用了批量put的办法，如果运行这个程序的机器内存不够大，建议做问题分治，多搞几个putList。

剩下的检索测试就简单了，不再赘述。

最后编辑于：2018.09.27 09:53:22

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,362评论 5赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,330评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,247评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,560评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,580评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,569评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,929评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,587评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,840评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,596评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,678评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,366评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,945评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,929评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,165评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,271评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,403评论 2赞 342