Hive + Hbase + Hadoop 集成 时间: 2018-11-05 09:52 分类: hadoop 前面有关`Hbase`与`Hadoop`环境的集成已经写过相关文章了,在这里就不做过多的赘述,地址: [Hbase与Hadoop伪分布式环境搭建][1] 这里在之前环境的基础上增加`Hive`的集成。 首先,我们去官网下载 stable 版本: > apache-hive-2.3.3-bin.tar.gz 解压到`/opt`目录下。 配置环境变量: ``` export HIVE_HOME=/opt/hive-2.3.3 export PATH=$PATH:$HIVE_HOME/bin ``` 进入`hive`配置文件目录: > \# cd $HIVE_HOME/conf 新建`hive-site.xml`文件: ``` datanucleus.fixedDatastore false datanucleus.autoCreateSchema true datanucleus.autoCreateTables true datanucleus.autoCreateColumns true hive.metastore.schema.verification false hive.aux.jars.path file:///opt/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar,file:///opt/hbase-2.0.2/lib/hbase-client-2.0.2.jar,file:///opt/hbase-2.0.2/lib/hbase-server-2.0.2.jar,file:///opt/hbase-2.0.2/lib/hbase-protocol-2.0.2.jar,file:///opt/hbase-2.0.2/lib/zookeeper-3.4.10.jar A comma separated list (with no spaces) of the jar files required for Hive-HBase integration hbase.zookeeper.quorum 192.168.0.11 A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster. ``` 有人说要把`/opt/hive-2.3.3/lib/hive-hbase-handler-2.3.3.jar`复制到`Hbase`的`lib`目录去,把`Hbase`目录下`lib`中的所有`jar`复制到`Hive`的`lib`中,但是我使用上面配置`Hive`就可以启动成功并成功建表了。 还有人说要先将`Hive`中`lib`下的`hbase`开头的`jar`全部删除,然后再将`Hbase`中`lib`的所有`jar`复制过来,测试发现会出现`NoSuchMethod`之类的错误,所以目前的整合切勿再按照网上那些博文去操作了,亲测不用复制任何`Hbase`的`jar`包,直接进行上面配置即可。 **这里说下我集成的`Hive`、`Hbase`、`Hadoop`的版本吧:`apache-hive-2.3.3-bin.tar.gz`、`hbase-2.0.2-bin.tar.gz`、`hadoop-2.7.7.tar.gz`其中`Hbase`与`Hadoop`的集成是有明显限制的,官方文档有介绍。** 下面简单在`Hive`中建一个`hbase_table_emp`表并映射到`Hbase`中的`emp`表,`emp`表自动创建: ``` > CREATE TABLE hbase_table_emp(id int, name string, role string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role") > TBLPROPERTIES ("hbase.table.name" = "emp"); ``` 在`Hbase`中刚查看自动创建的`emp`表: ``` hbase(main):031:0> describe 'emp' Table emp is ENABLED emp COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} 1 row(s) Took 0.0229 seconds ``` 然后通过`Hive`插入两条数据: > \# insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student'); 发现报错: ``` hive> insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student'); WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20181105093018_02b369cd-d3a6-4c02-8fc4-f55122111e17 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1541380449256_0002, Tracking URL = http://bensongtan.com:8088/proxy/application_1541380449256_0002/ Kill Command = /opt/hadoop-2.7.7/bin/hadoop job -kill job_1541380449256_0002 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2018-11-05 09:30:31,260 Stage-3 map = 0%, reduce = 0% 2018-11-05 09:30:47,697 Stage-3 map = 100%, reduce = 0% Ended Job = job_1541380449256_0002 with errors Error during job, obtaining debugging information... Examining task ID: task_1541380449256_0002_m_000000 (and more) from job job_1541380449256_0002 Task with the most failures(4): ----- Task ID: task_1541380449256_0002_m_000000 URL: http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1541380449256_0002&tipid=task_1541380449256_0002_m_000000 ----- Diagnostic Messages for this Task: Error: java.lang.ClassNotFoundException: org.apache.commons.lang3.NotImplementedException at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat.getRecordWriter(HiveHBaseTableOutputFormat.java:105) at org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat.getHiveRecordWriter(HivePassThroughOutputFormat.java:65) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:619) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:563) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1026) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-3: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec ``` 将`$HIVE_HOME/lib`下的 commons.lang3-3.1.jar 下载下来发现确实没有这个类,然后再查看`Hbase`lib 目录中的 commons.lang3-3.6.jar 发现有`NotImplementedException`这个类。于是将`Hive`中的低版本删除,复制`Hbase`中的3.6版本进去,问题得以解决,最终成功执行结果: ``` hive> insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student'); WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20181105093349_1d78acd0-b9c7-40a9-853e-c8dc5ab9f658 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1541380449256_0003, Tracking URL = http://bensongtan.com:8088/proxy/application_1541380449256_0003/ Kill Command = /opt/hadoop-2.7.7/bin/hadoop job -kill job_1541380449256_0003 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2018-11-05 09:34:03,500 Stage-3 map = 0%, reduce = 0% 2018-11-05 09:34:08,685 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.35 sec MapReduce Total cumulative CPU time: 3 seconds 350 msec Ended Job = job_1541380449256_0003 MapReduce Jobs Launched: Stage-Stage-3: Map: 1 Cumulative CPU: 3.35 sec HDFS Read: 11092 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 350 msec OK Time taken: 21.341 seconds ``` 我们也可以进入 web 页面查看任务执行结果(http://192.168.0.11:8088/cluster/apps)。 从上面的执行输出来看,我们可以看到仅仅是插入两条数据,却花了21秒的时间,这是因为`Hive`一般是用来处理大数据的,以前的版本连单条插入语句都是不支持的,只支持批量导入数据,所以对于少量数据来说感觉不到`Hive`的优势所在,下面用`explain`(类似 Mysql 的 explain)分析下原因: ``` hive> explain insert into table hbase_table_emp values ('001','Tom','Admin'),('002','Alice','Student'); OK STAGE DEPENDENCIES: Stage-0 is a root stage Stage-2 Stage-1 is a root stage Stage-3 is a root stage STAGE PLANS: Stage: Stage-0 Alter Table Operator: Alter Table type: drop props old name: default.hbase_table_emp properties: COLUMN_STATS_ACCURATE Stage: Stage-2 Insert operator: Insert Stage: Stage-1 Pre Insert operator: Pre-Insert task Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: values__tmp__table__2 Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: UDFToInteger(tmp_values_col1) (type: int), tmp_values_col2 (type: string), tmp_values_col3 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 32 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat output format: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat serde: org.apache.hadoop.hive.hbase.HBaseSerDe name: default.hbase_table_emp Time taken: 0.706 seconds, Fetched: 42 row(s) ``` 从上面可以看出`Hive`的执行要转换成若干步的`map-reduce`的过程,而且可能要在多个节点中通讯,所以即使很少的数据也至少需要一个最少的时间,数据量越多,才能体现出它的优势。 `Hive`验证结果: ``` hive> select * from hbase_table_emp; OK 1 Tom Admin 2 Alice Student Time taken: 0.205 seconds, Fetched: 2 row(s) hive> ``` `Hbase`验证结果: ``` hbase(main):012:0> scan 'emp' ROW COLUMN+CELL 1 column=cf1:name, timestamp=1541381647725, value=Tom 1 column=cf1:role, timestamp=1541381647725, value=Admin 2 column=cf1:name, timestamp=1541381647725, value=Alice 2 column=cf1:role, timestamp=1541381647725, value=Student 2 row(s) Took 0.0262 seconds ``` 以上仅仅只是环境的集成,`Hive`的其他语法不做过多赘述,自己慢慢查看文档就是。 [1]: https://0o0.me/java/hadoop-hbase-install.html 标签: hadoop hbase hive