Already 6000+ students are trained in ORIENIT under Mr.Kalyan, Cloudera CCA175 Certified Consultant, Apache Contributor, 12+ years of IT exp, IIT Kharagpur, Gold Medalist
One of the things I’m frequently asked about is how to use HBase from
Apache Hive. Not just how to do it, but what works, how well it works,
and how to make good use of it. I’ve done a bit of research in this area, so
hopefully this will be useful to someone besides myself. This is a topic that
we did not get to cover in HBase in Action, perhaps these notes will
become the basis for the 2nd edition ;) These notes are applicable to Hive
0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable
to 0.12.x +
0.96.x, though I haven’t tested everything yet.
The hive project includes an optional library for
interacting with HBase. This is where the bridge layer between the two systems
is implemented. The primary interface you use when accessing HBase from Hive
queries is called the HBaseStorageHandler. You can
also interact with HBase tables directly via Input and Output formats, but
the handler is simpler and works for most uses.
HBase tables from Hive
Use the HBaseStorageHandler to register HBase tables
with the Hive metastore. You can optionally specify the HBase table as
EXTERNAL, in which case Hive will not create to drop that
table directly – you’ll have to use the HBase shell to do so.
The above statement registers the HBase table named bar in the Hive
metastore, accessible from Hive by the name foo.
Under the hood, HBaseStorageHandler is delegating
interaction with the HBase table to
HiveHBaseTableInputFormat and
HiveHBaseTableOutputFormat. You can register
your HBase table in Hive using those classes directly if you desire. The above
statement is roughly equivalent to:
Also provided is the HiveHFileOutputFormat which
means it should be possible to generate HFiles for bulkloading from Hive as
well. In practice, I haven’t gotten this to work end-to-end (see
HIVE-4627).
Schema mapping
Registering the table is only the first step. As part of that registration, you
also need to specify a column mapping. This is how you link Hive column names
to the HBase table’s rowkey and columns. Do so using the
hbase.columns.mapping SerDe property.
The values provided in the mapping property correspond one-for-one with column
names of the hive table. HBase column names are fully qualified by column
family, and you use the special token :key to represent the rowkey. The above
example makes rows from the HBase table bar available via the Hive table
foo. The foo column rowkey maps to the HBase’s table’s rowkey, a to
c1 in the f column family, and b to c2, also in the f family.
You can also associate Hive’s MAP data structures to HBase column families.
In this case, only the STRING Hive type is used. The other Hive type
currently supported is BINARY. See the wiki page for more examples.
Interacting with data
With the column mappings defined, you can now access HBase data just like you
would any other Hive data. Only simple query predicates are currently
supported.
SELECT*FROMfooWHERE...;
You can also populate and HBase table using Hive. This works with both
INTO and OVERWRITE clauses.
Be advised that there is a regression in Hive 0.12.0 which breaks this feature,
see HIVE-5515.
In practice
There’s still a little finesse required to get everything wired up properly at
runtime. The HBase interaction module is completely optional, so you have to
make sure it and it’s HBase dependencies are available on Hive’s classpath.
The installation environment could do a better job of handling this for users,
but for the time being you must manage it yourself. Ideally the hive bin
script can detect the presence of HBase and automatically make the necessary
CLASSPATH adjustments. This enhancement appears to be tracked in
HIVE-2055. The last mile is provided by the distribution itself,
ensuring the environment variables are set for hive. This functionality is
provided by BIGTOP-955.
You also need to make sure the necessary jars are shipped out to the MapReduce
jobs when you execute your Hive statements. Hive provides a mechanism for
shipping additional job dependencies via the auxjars feature.
$ export HIVE_AUX_JARS_PATH=...
$ hive -e "SELECT * FROM ..."
I did discover a small bug in HDP-1.3 builds which masks user-specified values
of HIVE_AUX_JARS_PATH. With administrative rights, this is easily fixed by
correcting the line in hive-env.sh to respect an existing value. The
work-around in user scripts is to use the SET statement to
provide a value once you’ve launched the Hive CLI.
SEThive.aux.jars.path=...
Hive should be able to detect which jars are necessary and add them itself.
HBase provides the TableMapReduceUtils#addDependencyJars
methods for this purpose. It appears that this is done in hive-0.12.0, at least
according to HIVE-2379.
Future work
Much has been said about proper support for predicate pushdown
(HIVE-1643, HIVE-2854, HIVE-3617,
HIVE-3684) and data type awareness (HIVE-1245,
HIVE-2599). These go hand-in-hand as predicate semantics are
defined in terms of the types upon which they operate. More could be done to
map Hive’s complex data types like Maps and Structs onto HBase column families
as well (HIVE-3211). Support for HBase timestamps is a bit of a
mess; they’re not made available to Hive applications with any level of
granularity (HIVE-2828, HIVE-2306). The only
interaction a user has is via storage handler setting for writing a custom
timestamp with all operations.
From a performance perspective, there are things Hive can do today (ie, not
dependent on data types) to take advantage of HBase. There’s also the
possibility of an HBase-aware Hive to make use of HBase tables as intermediate
storage location (HIVE-3565), facilitating map-side joins against
dimension tables loaded into HBase. Hive could make use of HBase’s natural
indexed structure (HIVE-3634, HIVE-3727), potentially
saving huge scans.
Currently, the user doesn’t have (any?) control over the
scans which are executed. Configuration on a per-job, or at least per-table
basis should be enabled (HIVE-1233). That would enable an
HBase-savy user to provide Hive with hints regarding how it should interact
with HBase. Support for simple split sampling of HBase tables
(HIVE-3399) could also be easily done because HBase manages table
partitions already.
Other access channels
Everything discussed thus far has required Hive to interact with online HBase
RegionServers. Applications may stand to gain significant throughput and enjoy
greater flexibility by interacting directly with HBase data persisted to HDFS.
This also has the benefit of preventing Hive workloads from interfering with
online SLA-bound HBase applications (at least, until we see HBase improvements
in QOS isolation between tasks, HBASE-4441).
As mentioned earlier, there is the
HiveHFileOutputFormat. Resolving
HIVE-4627 should make Hive a straight-forward way to generate
HFiles for bulk loading. Once you’ve created the HFiles using Hive, there’s
still the last step of running the
LoadIncrementalHFiles utility to copy and register
them in the regions. For this, the HiveStorageHandler
interface will need some kind of hook to influence the query plan as it’s
created, allowing it to append steps. Once in place, it should be possible to
SET a runtime flag, switching an INSERT
operation to use bulkload.
HBase recently introduced the table snapshot feature. This allows
a user to create a persisted point-in-time view of a table, persisted to HDFS.
HBase is able to restore a table from a snapshot to a previous state, and to
create an entirely new table from an existing snapshot. Hive does not currently
support reading from an HBase snapshot. For that matter, HBase doesn’t yet
support MapReduce jobs over snapshots, though the feature is a work in progress
(HBASE-8369).
Conclusions
The interface between HBase and Hive is young, but has nice potential. There’s
a lot of low-hanging fruit that can be picked up to make things easier and
faster. The most glaring issue barring real application development is the
impedance mismatch between Hive’s typed, dense schema and HBase’s untyped,
sparse schema. This is as much a cognitive problem as technical issue.
Solutions here would allow a number of improvements to fall out, including much
in the way of performance improvements. I’m hopeful that continuing work to add
data types to HBase (HBASE-8089) can help bridge this gap.
Basic operations mostly work, at least in a rudimentary way. You can read data
out of and write data back into HBase using Hive. Configuring the environment
is an opaque and manual process, one which likely stymies novices from adopting
the tools. There’s also the question of bulk operations – support for writing
HFiles and reading HBase snapshots using Hive is entirely lacking at this
point. And of course, there are bugs sprinkled throughout. The biggest recent
improvement is the deprecation of HCatalog’s interface, removing the necessary
upfront decision regarding which interface to use.
Hive provides a very usable SQL interface on top of HBase, one which integrates
easily into many existing ETL workflows. That interface requires simplifying
some of the BigTable semantics HBase provides, but the result will be to open
up HBase to a much broader audience of users. The Hive interop compliments
extremely well the experience provided by Phoenix. Hive has the
benefit of not requiring the deployment complexities currently required by that
system. Hopefully the common definition of types will allow a complimentary
future.
Grab some data and register it in Hive
We’ll need some data to work with. For this purpose, grab some
traffic stats from wikipedia. Once we have some data, copy it up
to HDFS.
$ mkdir pagecounts ; cd pagecounts
$ for x in {0..9} ; do wget "http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-0${x}0000.gz" ; done$ hadoop fs -copyFromLocal $(pwd) ./
For reference, this is what the data looks like.
$ zcat pagecounts-20081001-000000.gz | head -n5
aa.b Special:Statistics 1 837
aa Main_Page 4 41431
aa Special:ListUsers 1 5555
aa Special:Listusers 1 1052
aa Special:PrefixIndex/Comparison_of_Guaze%27s_Law_and_Coulomb%27s_Law 1 4332
As I understand it, each record is a count of page views of a specific page on
Wikipedia. The first column is the language code, second is the
page name, third is the number of page views, and fourth is the size of the
page in bytes. Each file contains an hour’s worth of aggregated data. None of
the above pages were particularly popular that hour.
Now that we have data and understand its raw schema, create a Hive table over
it. To do that, we’ll use a DDL script that looks like this.
$cat00_pagecounts.ddl-- define an external table over raw pagecounts dataCREATETABLEIFNOTEXISTSpagecounts(projectcodeSTRING,pagenameSTRING,pageviewsSTRING,bytesSTRING)ROWFORMATDELIMITEDFIELDSTERMINATEDBY' 'LINESTERMINATEDBY'\n'STOREDASTEXTFILELOCATION'/user/ndimiduk/pagecounts';
Run the script to register our dataset with Hive.
$ hive -f 00_pagecounts.ddl
OK
Time taken: 2.268 seconds
Verify that the schema mapping works by calculating a simple statistic over the
dataset.
$ hive -e "SELECT count(*) FROM pagecounts;"Total MapReduce jobs= 1
Launching Job 1 out of 1
...
OK
36668549
Time taken: 25.31 seconds, Fetched: 1 row(s)
Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s
just confirm things are working as expected by getting a second opinion. This
isn’t that much data, so confirm on the command line.
$ zcat * | wc -l
36668549
The record counts match up – excellent.
Transform the schema for HBase
The next step is to transform the raw data into a schema that makes sense for
HBase. In our case, we’ll create a schema that allows us to calculate aggregate
summaries of pages according to their titles. To do this, we want all the data
for a single page grouped together. We’ll manage that by creating a Hive view
that represents our target HBase schema. Here’s the DDL.
$cat01_pgc.ddl-- create a view, building a custom hbase rowkeyCREATEVIEWIFNOTEXISTSpgc(rowkey,pageviews,bytes)ASSELECTconcat_ws('/',projectcode,concat_ws('/',pagename,regexp_extract(INPUT__FILE__NAME,'pagecounts-(\\d{8}-\\d{6})\\..*$',1))),pageviews,bytesFROMpagecounts;
The SELECT statement uses hive to build a compound rowkey for HBase. It
concatenates the project code, page name, and date, joined by the '/'
character. A handy trick: it uses a simple regex to extract the date from the
source file names. Run it now.
$ hive -f 01_pgc.ddl
OK
Time taken: 2.712 seconds
This is just a view, so the SELECT statement won’t be evaluated until we
query it for data. Registering it with hive doesn’t actually process any data.
Again, make sure it works by querying Hive for a subset of the data.
$hive-e"SELECT * FROM pgc WHERE rowkey LIKE 'en/q%' LIMIT 10;"TotalMapReducejobs=1LaunchingJob1outof1...OKen/q:Special:Search/Blues/20081001-09000011168en/q:Special:Search/rock/20081001-0900001985en/qadam_rasul/20081001-09000011108en/qarqay/20081001-0900001933en/qemu/20081001-09000011144en/qian_lin/20081001-0900001918en/qiang_(spear)/20081001-0900001973en/qin_dynasty/20081001-09000011120en/qinghe_special_steel_corporation_disaster/20081001-0900001963en/qmail/20081001-09000011146Timetaken:40.382seconds,Fetched:10row(s)
Register the HBase table
Now that we have a dataset in Hive, it’s time to introduce HBase. The first
step is to register our HBase table in Hive so that we can interact with it
using Hive queries. That means another DDL statement. Here’s what it looks
like.
$cat02_pagecounts_hbase.ddl-- create a table in hbase to host the viewCREATETABLEIFNOTEXISTSpagecounts_hbase(rowkeySTRING,pageviewsSTRING,bytesSTRING)STOREDBY'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key,f:c1,f:c2')TBLPROPERTIES('hbase.table.name'='pagecounts');
This statement will tell Hive to go create an HBase table named pagecounts
with the single column family f. It registers that HBase table in the Hive
metastore by the name pagecounts_hbase with 3 columns: rowkey, pageviews,
and bytes. The SerDe property hbase.columns.mapping makes the association
from Hive column to HBase column. It says the Hive column rowkey is mapped to
the HBase table’s rowkey, the Hive column pageviews to the HBase column
f:c1, and bytes to the HBase column f:c2. To keep the example simple, we
have Hive treat all these columns as the STRING type.
In order to use the HBase library, we need to make the HBase jars and
configuration available to the local Hive process (at least until
HIVE-5518 is resolved). Do that by specifying a value for the
HADOOP_CLASSPATH environment variable before executing the statement.
$ export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar:/usr/lib/zookeeper/zookeeper.jar
$ hive -f 02_pagecounts_hbase.ddl
OK
Time taken: 4.399 seconds
Populate the HBase table
Now it’s time to write data to HBase. This is done using a regular Hive
INSERT statement, sourcing data from the view with SELECT. There’s one more
bit of administration we need to take care of though. This INSERT statement
will run a mapreduce job that writes data to HBase. That means we need to tell
Hive to ship the HBase jars and dependencies with the job. Note that this is a separate step from the classpath modification we did
previously. Normally you can do this with an export statement from the shell,
the same way we specified the HADOOP_CLASSPATH. However there’s a bug in
HDP-1.3 that requires me to use Hive’s SET statement in the script instead.
$cat03_populate_hbase.hql-- ensure hbase dependency jars are shipped with the MR job-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.xSEThive.aux.jars.path=file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;-- populate our hbase tableFROMpgcINSERTINTOTABLEpagecounts_hbaseSELECTpgc.*WHERErowkeyLIKE'en/q%'LIMIT10;
Note there’s a big ugly bug in Hive 0.12.0 which means this
doesn’t work with that version. Never fear though, we have a patch in progress.
Follow along at HIVE-5515.
If you choose to use a different method for setting Hive’s auxpath, be advised
that it’s a tricky process – depending on how you specify it
(HIVE_AUX_JARS_PATH, --auxpath), Hive will interpret the argument
differently. HIVE-2349 seeks to remedy this unfortunate state of
affairs.
$ hive -f 03_populate_hbase.hql
Total MapReduce jobs= 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
Time taken: 40.296 seconds
Be advised also that this step is currently broken on secured HBase
deployments. Follow along HIVE-5523 if that’s of interest to you.
Query data from HBase-land
40 seconds later, you now have data in HBase. Let’s have a look using the HBase
shell.
Here we have 10 rows with two columns each containing the data loaded using
Hive. It’s now accessible in your online world using HBase. For example,
perhaps you receive an updated data file and have a corrected value for one of
the stats. You can update the record in HBase with a regular PUT command.
Verify data from from Hive
The HBase table remains available to you Hive world; Hive’s
HBaseStorageHandler works both ways, after all. Note that this command expects that the HADOOP_CLASSPATH is still set and
HIVE_AUX_JARS_PATH as well if your query is complex.
Since the HBase table is accessible from Hive, you can continue to use Hive for
your ETL processing with mapreduce. Keep in mind that the auxpath
considerations apply here too, so I’ve scripted out the query instead of just
running it directly at the command line.
$cat04_query_hbase.hql-- ensure hbase dependency jars are shipped with the MR job-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.xSEThive.aux.jars.path=file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;-- query hive dataSELECTcount(*)frompagecounts_hbase;
Run it the same way we did the others.
$ hive -f 04_query_hbase.hql
Total MapReduce jobs= 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
10
Time taken: 19.473 seconds, Fetched: 1 row(s)
There you have it: a hands-on, end to end demonstration of interacting with
HBase from Hive. You can learn more about the nitty-gritty details in Enis’s
deck on the topic, or see the presentation he and Ashutosh
gave at HBaseCon. If you’re inclined to make the intersection of these
technologies work better (faster, stronger), I encourage you to pick up any of
the JIRA issues mentioned in this post or the previous.
No comments:
Post a Comment