Hive + Hbase + Hadoop 集成

前面有关`Hbase`与`Hadoop`环境的集成已经写过相关文章了，在这里就不做过多的赘述，地址：
[Hbase与Hadoop伪分布式环境搭建][1]
这里在之前环境的基础上增加`Hive`的集成。
首先，我们去官网下载 stable 版本：
> apache-hive-2.3.3-bin.tar.gz

解压到`/opt`目录下。
配置环境变量：
```
export HIVE_HOME=/opt/hive-2.3.3
export PATH=$PATH:$HIVE_HOME/bin
```
进入`hive`配置文件目录：
> \# cd $HIVE_HOME/conf

新建`hive-site.xml`文件：
```
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>

<property>
    <name>datanucleus.fixedDatastore</name>
    <value>false</value>
  </property>
  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.autoCreateTables</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.autoCreateColumns</name>
    <value>true</value>
  </property>
<property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
</property>
<property>
  <name>hive.aux.jars.path</name>
  <value>file:///opt/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar,file:///opt/hbase-2.0.2/lib/hbase-client-2.0.2.jar,file:///opt/hbase-2.0.2/lib/hbase-server-2.0.2.jar,file:///opt/hbase-2.0.2/lib/hbase-protocol-2.0.2.jar,file:///opt/hbase-2.0.2/lib/zookeeper-3.4.10.jar</value>
  <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
</property>

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>192.168.0.11</value>
  <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster.</description>
</property>

</configuration>
```
有人说要把`/opt/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar`复制到`Hbase`的`lib`目录去，把`Hbase`目录下`lib`中的所有`jar`复制到`Hive`的`lib`中，但是我使用上面配置`Hive`就可以启动成功并成功建表了。
还有人说要先将`Hive`中`lib`下的`hbase`开头的`jar`全部删除，然后再将`Hbase`中`lib`的所有`jar`复制过来，测试发现会出现`NoSuchMethod`之类的错误，所以目前的整合切勿再按照网上那些博文去操作了，亲测不用复制任何`Hbase`的`jar`包，直接进行上面配置即可。

**这里说下我集成的`Hive`、`Hbase`、`Hadoop`的版本吧：`apache-hive-2.3.3-bin.tar.gz`、`hbase-2.0.2-bin.tar.gz`、`hadoop-2.7.7.tar.gz`其中`Hbase`与`Hadoop`的集成是有明显限制的，官方文档有介绍。**

下面简单在`Hive`中建一个`hbase_table_emp`表并映射到`Hbase`中的`emp`表，`emp`表自动创建：
```
> CREATE TABLE hbase_table_emp(id int, name string, role string) 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role")
> TBLPROPERTIES ("hbase.table.name" = "emp");
```
在`Hbase`中刚查看自动创建的`emp`表：
```
hbase(main):031:0> describe 'emp'
Table emp is ENABLED                                                                                                                                                                                                                                                          
emp                                                                                                                                                                                                                                                                           
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'cf1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER
 => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                                    
1 row(s)
Took 0.0229 seconds
```
然后通过`Hive`插入两条数据：
> \# insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');

发现报错：
```
hive> insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20181105093018_02b369cd-d3a6-4c02-8fc4-f55122111e17
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1541380449256_0002, Tracking URL = http://bensongtan.com:8088/proxy/application_1541380449256_0002/
Kill Command = /opt/hadoop-2.7.7/bin/hadoop job  -kill job_1541380449256_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-11-05 09:30:31,260 Stage-3 map = 0%,  reduce = 0%
2018-11-05 09:30:47,697 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_1541380449256_0002 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1541380449256_0002_m_000000 (and more) from job job_1541380449256_0002

Task with the most failures(4): 
-----
Task ID:
  task_1541380449256_0002_m_000000

URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1541380449256_0002&tipid=task_1541380449256_0002_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.ClassNotFoundException: org.apache.commons.lang3.NotImplementedException
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat.getRecordWriter(HiveHBaseTableOutputFormat.java:105)
	at org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat.getHiveRecordWriter(HivePassThroughOutputFormat.java:65)
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:619)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:563)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1026)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
```
将`$HIVE_HOME/lib`下的 commons.lang3-3.1.jar 下载下来发现确实没有这个类，然后再查看`Hbase`lib 目录中的 commons.lang3-3.6.jar 发现有`NotImplementedException`这个类。于是将`Hive`中的低版本删除，复制`Hbase`中的3.6版本进去，问题得以解决，最终成功执行结果：
```
hive> insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20181105093349_1d78acd0-b9c7-40a9-853e-c8dc5ab9f658
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1541380449256_0003, Tracking URL = http://bensongtan.com:8088/proxy/application_1541380449256_0003/
Kill Command = /opt/hadoop-2.7.7/bin/hadoop job  -kill job_1541380449256_0003
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-11-05 09:34:03,500 Stage-3 map = 0%,  reduce = 0%
2018-11-05 09:34:08,685 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.35 sec
MapReduce Total cumulative CPU time: 3 seconds 350 msec
Ended Job = job_1541380449256_0003
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 3.35 sec   HDFS Read: 11092 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 350 msec
OK
Time taken: 21.341 seconds
```
我们也可以进入 web 页面查看任务执行结果（http://192.168.0.11:8088/cluster/apps）。
从上面的执行输出来看，我们可以看到仅仅是插入两条数据，却花了21秒的时间，这是因为`Hive`一般是用来处理大数据的，以前的版本连单条插入语句都是不支持的，只支持批量导入数据，所以对于少量数据来说感觉不到`Hive`的优势所在，下面用`explain`（类似 Mysql 的 explain）分析下原因：
```
hive> explain insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage
  Stage-2
  Stage-1 is a root stage
  Stage-3 is a root stage

STAGE PLANS:
  Stage: Stage-0
      Alter Table Operator:
        Alter Table
          type: drop props
          old name: default.hbase_table_emp
          properties:
            COLUMN_STATS_ACCURATE

Stage: Stage-2
      Insert operator:
        Insert

Stage: Stage-1
      Pre Insert operator:
        Pre-Insert task

Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: values__tmp__table__2
            Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: UDFToInteger(tmp_values_col1) (type: int), tmp_values_col2 (type: string), tmp_values_col3 (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat
                    output format: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat
                    serde: org.apache.hadoop.hive.hbase.HBaseSerDe
                    name: default.hbase_table_emp

Time taken: 0.706 seconds, Fetched: 42 row(s)
```
从上面可以看出`Hive`的执行要转换成若干步的`map-reduce`的过程，而且可能要在多个节点中通讯，所以即使很少的数据也至少需要一个最少的时间，数据量越多，才能体现出它的优势。

`Hive`验证结果：
```
hive> select * from hbase_table_emp;
OK
1	Tom	Admin
2	Alice	Student
Time taken: 0.205 seconds, Fetched: 2 row(s)
hive>
```
`Hbase`验证结果：
```
hbase(main):012:0> scan 'emp'
ROW                                                                  COLUMN+CELL                                                                                                                                                                                              
 1                                                                   column=cf1:name, timestamp=1541381647725, value=Tom                                                                                                                                                      
 1                                                                   column=cf1:role, timestamp=1541381647725, value=Admin                                                                                                                                                    
 2                                                                   column=cf1:name, timestamp=1541381647725, value=Alice                                                                                                                                                    
 2                                                                   column=cf1:role, timestamp=1541381647725, value=Student                                                                                                                                                  
2 row(s)
Took 0.0262 seconds
```

以上仅仅只是环境的集成，`Hive`的其他语法不做过多赘述，自己慢慢查看文档就是。

[1]: https://0o0.me/java/hadoop-hbase-install.html

Hive + Hbase + Hadoop 集成

发表评论：