前言:
因为项目需要,试着搭建了一下HBase二级索引的环境,网上看了一些教程,无一不坑,索性整理一份比较完整的。本文适当的精简和绕过了一些“老司机一看就知道”的内容,适合刚接触这一领域但是有一定Linux和Hadoop基础的读者,不适合完全初学者。
环境约束:
OS:CentOS6.7-x86_64
JDK:jdk1.7.0_109
hadoop-2.6.0+cdh5.4.1
hbase-solr-1.5+cdh5.4.1 (hbase-indexer-1.5-cdh5.4.1)
solr-4.10.3-cdh5.4.1
zookeeper-3.4.5-cdh5.4.1
hbase-1.0.0-cdh5.4.1
文中所用CDH软件下载页:
CDH 5.4.x Packaging and Tarball Information | 5.x | Cloudera Documentation
一、基本环境准备
1.一个3节点Hadoop集群,服务器计划角色分配如下:
先把Namenode、Datanode、zookeeper、Journalnode、ZKFC跑起来,具体技术自行突破,不是本文重点,无需多言。
2.下载好所需的CDH版本软件:
在文首的链接页面下载好tarball,需要注意的是HBase-solr的tarball是整个项目文件,但是我们用到的只是它的部署文件,解压缩hbase-solr-1.5+cdh5.4.1的tarball,在 hbase-solr-1.5-cdh5.4.1\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.4.1.tar.gz,后面会用到。
二、部署hbase-indexer
将hbase-indexer-1.5-cdh5.4.1.tar.gz拷贝到node2或者node3上
解压缩hbase-indexer-1.5-cdh5.4.1.tar.gz:
tar zxvf hbase-indexer-1.5-cdh5.4.1.tar.gz
修改hbase-indexer的参数:
vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>hbaseindexer.zookeeper.connectstring</name>
<!--此处需根据zookeeper集群的实际配置修改-->
<value>node1:2181,node2:2181,node3:2181</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<!--此处需根据zookeeper集群的实际配置修改-->
<value>node1,node2,node3</value>
</property>
</configuration>
配置hbase-indexer-env.sh:
vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-env.sh
修改JAVA_HOME
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase-indexer process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase-indexer, etc.)
# The java implementation to use. Java 1.6 required.
export JAVA_HOME=/usr/java/jdk1.7.0/
#根据实际环境修改
# Extra Java CLASSPATH elements. Optional.
# export HBASE_INDEXER_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_INDEXER_HEAPSIZE=1000
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -XX:+UseConcMarkSweepGC"
使用scp命令把整个hbase-indexer-1.5-cdh5.4.1复制到node3上
三、部署HBase
解压缩hbase的tarball
tar zxvf hbase-1.0.0-cdh5.4.1.tar.gz
同样要修改hbase-site.xml
vim hbase-1.0.0-cdh5.4.1/conf/hbase-site.xml
需要在<configuration>标签内增加如下内容:
<property>
<name>hbase.rootdir</name>
<value>hdfs://node1:9000/hbase</value>
<description>The directory shared by RegionServers</description>
</property>
<property>
<name>hbase.master</name>
<value>node1:60000</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in.Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.replication</name>
<value>true</value>
<description>SEP is basically replication, so enable it</description>
</property>
<property>
<name>replication.source.ratio</name>
<value>1.0</value>
<description>Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)</description>
</property>
<property>
<name>replication.source.nb.capacity</name>
<value>1000</value>
<description>Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.</description>
</property>
<property>
<name>replication.replicationsource.implementation</name>
<value>com.ngdata.sep.impl.SepReplicationSource</value>
<description>A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node1,node2,node3</value>
<description>The directory shared by RegionServers</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<!--注意这里配置的是zookeeper集群的数据目录,参照zookeeper的zoo.cfg-->
<value>/home/HBasetest/zookeeperdata</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
类似的,修改hbase-env.sh
vim hbase-1.0.0-cdh5.4.1/conf/hbase-env.sh
修改JAVA_HOME和HBASE_HOME
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)
# The java implementation to use. Java 1.7+ required.
# export JAVA_HOME=/usr/java/jdk1.6.0/
export JAVA_HOME=/opt/jdk1.7.0_79
export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.4.1
#根据实际填写
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_HEAPSIZE=1000
# Uncomment below if you intend to use off heap cache.
# export HBASE_OFFHEAPSIZE=1000
# For example, to allocate 8G of offheap, to 8G:
# export HBASE_OFFHEAPSIZE=8G
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
将hbase-indexer-1.5-cdh5.4.1/lib目录下的这4个文件复制到hbase-1.0.0-cdh5.4.1/lib/目录下
hbase-sep-api-1.5-cdh5.4.1.jar
hbase-sep-impl-1.5-hbase1.0-cdh5.4.1.jar
hbase-sep-impl-common-1.5-cdh5.4.1.jar
hbase-sep-tools-1.5-cdh5.4.1.jar
修改hbase-1.0.0-cdh5.4.1/conf/regionservers为如下内容:
node2
node3
然后将目录hbase-1.0.0-cdh5.4.1复制到node2和node3上面
四、部署Solr
直接在node1上解压缩就好。。。
五、运行测试
1.运行HBase
在node1上执行:
./hbase-1.0.0-cdh5.4.1/bin/start-hbase.sh
2.运行HBase-indexer
分别在node2和node3上执行:
./hbase-indexer-1.5-cdh5.4.1/bin/hbase-indexer server
如果想以后台方式运行,可以使用screen或者nohup
3.运行Solr
分别在node1上进入solr下面的sample子目录,执行:
java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar
同样,如果想以后台方式运行,可以使用screen或者nohup
使用http://node1:8983/solr/#/访问solr的主页
六、数据索引测试
将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后,首先用HBase创建一个数据表:
在任一node上的HBase安装目录下运行:
./bin/hbase shell
create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' }
在部署了HBase-Indexer的节点上,进入HBase-Indexer部署目录,使用HBase-Indexer的demo下的配置文件创建一个索引:
./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1
编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件:
<?xml version="1.0"?>
<indexer table="indexdemo-user">
<field name="firstname_s" value="info:firstname"/>
<field name="lastname_s" value="info:lastname"/>
<field name="age_i" value="info:age" type="int"/>
</indexer>
保存为indexdemo-indexer.xml
添加indexer实例
在hbase-indexer-1.5-cdh5.4.1/demo下运行:
./bin/hbase-indexer add-indexer -n myindexer -c indexdemo-indexer.xml -cp \
solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 -z node1,node2,node3
准备一些测试数据,因为项目需要对千万级以上的记录进行索引的测试,所以用命令行手敲的方式插入数据有点不大现实,HBase也支持使用shell命令批量执行以文本方式存储的命令集合,但在千万级别这个数量级的数据量面前还是很苍白,最后我还是选择了用Java编程的方式实现快速的批量插入记录。
Eclipse里面新建一个Java工程,导入HBase部署目录下lib内的所有内容。程序源代码如下:
package com.hbasetest.hbtest;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
public class DataInput {
private static Configuration configuration;
static {
configuration = HBaseConfiguration.create();
configuration.set("hbase.zookeeper.property.clientPort", "2181");
configuration.set("hbase.zookeeper.quorum", "node1,node2,node3");
}
public static void main(String[] args) {
try {
List<Put> putList = new ArrayList<Put>();
HTable table = new HTable(configuration, "indexdemo-user");
for (int i =0; i<=14000000 ;i++)
{
Put put = new Put(Integer.toString(i).getBytes());
put.add("info".getBytes(), "firstname".getBytes(), ("Java.value.firstname"+Integer.toString(i)).getBytes());
put.add("info".getBytes(), "lastname".getBytes(), ("Java.value.lastname"+Integer.toString(i)).getBytes());
putList.add(put);
System.out.println("put successfully! " + Integer.toString(i) );
} table.put(putList);
} catch (IOException e) {
e.printStackTrace();
}
}
}
这段代码使用了批量put的办法,如果运行这个程序的机器内存不够大,建议做问题分治,多搞几个putList。
剩下的检索测试就简单了,不再赘述。