admin

Hive + Hbase + Hadoop 集成
前面有关Hbase与Hadoop环境的集成已经写过相关文章了,在这里就不做过多的赘述,地址:Hbase与Hadoo...
扫描右侧二维码阅读全文
05
2018/11

Hive + Hbase + Hadoop 集成

前面有关HbaseHadoop环境的集成已经写过相关文章了,在这里就不做过多的赘述,地址:
Hbase与Hadoop伪分布式环境搭建
这里在之前环境的基础上增加Hive的集成。
首先,我们去官网下载 stable 版本:

apache-hive-2.3.3-bin.tar.gz

解压到/opt目录下。
配置环境变量:

export HIVE_HOME=/opt/hive-2.3.3
export PATH=$PATH:$HIVE_HOME/bin

进入hive配置文件目录:

# cd $HIVE_HOME/conf

新建hive-site.xml文件:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>false</value>
  </property>
  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.autoCreateTables</name>
    <value>true</value>
  </property>
  <property>
    <name>datanucleus.autoCreateColumns</name>
    <value>true</value>
  </property>
<property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
</property>
<property>
  <name>hive.aux.jars.path</name>
  <value>file:///opt/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar,file:///opt/hbase-2.0.2/lib/hbase-client-2.0.2.jar,file:///opt/hbase-2.0.2/lib/hbase-server-2.0.2.jar,file:///opt/hbase-2.0.2/lib/hbase-protocol-2.0.2.jar,file:///opt/hbase-2.0.2/lib/zookeeper-3.4.10.jar</value>
  <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description>
</property>

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>192.168.0.11</value>
  <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster.</description>
</property>

</configuration>

有人说要把/opt/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar复制到Hbaselib目录去,把Hbase目录下lib中的所有jar复制到Hivelib中,但是我使用上面配置Hive就可以启动成功并成功建表了。
还有人说要先将Hivelib下的hbase开头的jar全部删除,然后再将Hbaselib的所有jar复制过来,测试发现会出现NoSuchMethod之类的错误,所以目前的整合切勿再按照网上那些博文去操作了,亲测不用复制任何Hbasejar包,直接进行上面配置即可。

这里说下我集成的HiveHbaseHadoop的版本吧:apache-hive-2.3.3-bin.tar.gzhbase-2.0.2-bin.tar.gzhadoop-2.7.7.tar.gz其中HbaseHadoop的集成是有明显限制的,官方文档有介绍。

下面简单在Hive中建一个hbase_table_emp表并映射到Hbase中的emp表,emp表自动创建:

> CREATE TABLE hbase_table_emp(id int, name string, role string) 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role")
> TBLPROPERTIES ("hbase.table.name" = "emp");

Hbase中刚查看自动创建的emp表:

hbase(main):031:0> describe 'emp'
Table emp is ENABLED                                                                                                                                                                                                                                                          
emp                                                                                                                                                                                                                                                                           
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                                                                                   
{NAME => 'cf1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER
 => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}                                                                    
1 row(s)
Took 0.0229 seconds

然后通过Hive插入两条数据:

# insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');

发现报错:

hive> insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20181105093018_02b369cd-d3a6-4c02-8fc4-f55122111e17
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1541380449256_0002, Tracking URL = http://bensongtan.com:8088/proxy/application_1541380449256_0002/
Kill Command = /opt/hadoop-2.7.7/bin/hadoop job  -kill job_1541380449256_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-11-05 09:30:31,260 Stage-3 map = 0%,  reduce = 0%
2018-11-05 09:30:47,697 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_1541380449256_0002 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1541380449256_0002_m_000000 (and more) from job job_1541380449256_0002

Task with the most failures(4): 
-----
Task ID:
  task_1541380449256_0002_m_000000

URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1541380449256_0002&tipid=task_1541380449256_0002_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.ClassNotFoundException: org.apache.commons.lang3.NotImplementedException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat.getRecordWriter(HiveHBaseTableOutputFormat.java:105)
    at org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat.getHiveRecordWriter(HivePassThroughOutputFormat.java:65)
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
    at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:619)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:563)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1026)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

$HIVE_HOME/lib下的 commons.lang3-3.1.jar 下载下来发现确实没有这个类,然后再查看Hbaselib 目录中的 commons.lang3-3.6.jar 发现有NotImplementedException这个类。于是将Hive中的低版本删除,复制Hbase中的3.6版本进去,问题得以解决,最终成功执行结果:

hive> insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20181105093349_1d78acd0-b9c7-40a9-853e-c8dc5ab9f658
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1541380449256_0003, Tracking URL = http://bensongtan.com:8088/proxy/application_1541380449256_0003/
Kill Command = /opt/hadoop-2.7.7/bin/hadoop job  -kill job_1541380449256_0003
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-11-05 09:34:03,500 Stage-3 map = 0%,  reduce = 0%
2018-11-05 09:34:08,685 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.35 sec
MapReduce Total cumulative CPU time: 3 seconds 350 msec
Ended Job = job_1541380449256_0003
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 3.35 sec   HDFS Read: 11092 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 350 msec
OK
Time taken: 21.341 seconds

我们也可以进入 web 页面查看任务执行结果(http://192.168.0.11:8088/cluster/apps)。
从上面的执行输出来看,我们可以看到仅仅是插入两条数据,却花了21秒的时间,这是因为Hive一般是用来处理大数据的,以前的版本连单条插入语句都是不支持的,只支持批量导入数据,所以对于少量数据来说感觉不到Hive的优势所在,下面用explain(类似 Mysql 的 explain)分析下原因:

hive> explain insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student');
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage
  Stage-2
  Stage-1 is a root stage
  Stage-3 is a root stage

STAGE PLANS:
  Stage: Stage-0
      Alter Table Operator:
        Alter Table
          type: drop props
          old name: default.hbase_table_emp
          properties:
            COLUMN_STATS_ACCURATE 

  Stage: Stage-2
      Insert operator:
        Insert

  Stage: Stage-1
      Pre Insert operator:
        Pre-Insert task

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: values__tmp__table__2
            Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: UDFToInteger(tmp_values_col1) (type: int), tmp_values_col2 (type: string), tmp_values_col3 (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat
                    output format: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat
                    serde: org.apache.hadoop.hive.hbase.HBaseSerDe
                    name: default.hbase_table_emp

Time taken: 0.706 seconds, Fetched: 42 row(s)

从上面可以看出Hive的执行要转换成若干步的map-reduce的过程,而且可能要在多个节点中通讯,所以即使很少的数据也至少需要一个最少的时间,数据量越多,才能体现出它的优势。

Hive验证结果:

hive> select * from hbase_table_emp;
OK
1    Tom    Admin
2    Alice    Student
Time taken: 0.205 seconds, Fetched: 2 row(s)
hive>

Hbase验证结果:

hbase(main):012:0> scan 'emp'
ROW                                                                  COLUMN+CELL                                                                                                                                                                                              
 1                                                                   column=cf1:name, timestamp=1541381647725, value=Tom                                                                                                                                                      
 1                                                                   column=cf1:role, timestamp=1541381647725, value=Admin                                                                                                                                                    
 2                                                                   column=cf1:name, timestamp=1541381647725, value=Alice                                                                                                                                                    
 2                                                                   column=cf1:role, timestamp=1541381647725, value=Student                                                                                                                                                  
2 row(s)
Took 0.0262 seconds

以上仅仅只是环境的集成,Hive的其他语法不做过多赘述,自己慢慢查看文档就是。

Last modification:November 5th, 2018 at 09:54 am
If you think my article is useful to you, please feel free to appreciate

Leave a Comment