Revision 380e94bbc6097ebb62adbdf1a3d590965c25e184 authored by Dan LaRocque on 21 July 2014, 06:20:05 UTC, committed by Dan LaRocque on 21 July 2014, 06:20:05 UTC
1 parent 8f81954
Raw File
hbase.txt
[[hbase]]
HBase
~~~~~

[.tss-center.tss-width-250]
image:http://hbase.apache.org/images/hbase_logo.png[link="http://hbase.apache.org"]

[quote,'http://hbase.apache.org/[Apache HBase Homepage]']
Apache HBase is an open-source, distributed, versioned, non-relational database modeled after http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf[Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al.] Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

HBase Setup
^^^^^^^^^^^

The following sections outline the various ways in which Titan can be used in concert with HBase.

Local Server Mode
+++++++++++++++++

image:titan-modes-local.png[]

HBase can be run as a standalone database on the same local host as Titan and the end-user application. In this model, Titan and HBase communicate with one another via a `localhost` socket. Running Titan over HBase requires the following setup steps:

* Download and extract a stable HBase from http://www.apache.org/dyn/closer.cgi/hbase/stable/.
* Start HBase by invoking the `start-hbase.sh` script in the _bin_ directory inside the extracted HBase directory. To stop HBase, use `stop-hbase.sh`.

[source,bourne]
$ ./bin/start-hbase.sh 
starting master, logging to ../logs/hbase-master-machine-name.local.out

Now, you can create an HBase TitanGraph as follows:

[source,java]
Configuration conf = new BaseConfiguration();
conf.setProperty("storage.backend","hbase");
TitanGraph g = TitanFactory.open(conf);

Note, that you do not need to specify a hostname since a localhost connection is attempted by default. Also, in the Gremlin shell, you can not define the type of the variables `conf` and `g`. Therefore, simply leave off the type declaration.

Remote Server Mode
++++++++++++++++++

image:titan-modes-distributed.png[]

When the graph needs to scale beyond the confines of a single machine, then HBase and Titan are logically separated into different machines. In this model, the HBase cluster maintains the graph representation and any number of Titan instances maintain socket-based read/write access to the HBase cluster. The end-user application can directly interact with Titan within the same JVM as Titan.

For example, suppose we have a running HBase cluster with two machines at IP address 77.77.77.77 and 77.77.77.78, then connecting Titan with the cluster is accomplished as follows:

[source,java]
Configuration conf = new BaseConfiguration();
conf.setProperty("storage.backend","hbase");
conf.setProperty("storage.hostname","77.77.77.77,77.77.77.78");
TitanGraph g = TitanFactory.open(conf);

`storage.hostname` accepts a comma separated list of IP addresses and hostname for any subset of machines in the HBase cluster Titan should connect to. Also, in the Gremlin shell, you can not define the type of the variables `conf` and `g`. Therefore, simply leave off the type declaration.

Remote Server Mode with Rexster
+++++++++++++++++++++++++++++++

image:titan-modes-rexster.png[]

Finally, Rexster can be wrapped around each Titan instance defined in the previous subsection. In this way, the end-user application need not be a Java-based application as it can communicate with Rexster over REST. This type of deployment is great for polyglot architectures where various components written in different languages need to reference and compute on the graph.

----
http://rexster.titan.machine1/mygraph/vertices/1
http://rexster.titan.machine2/mygraph/tp/gremlin?script=g.v(1).out('follows').out('created')
----

In this case, each Rexster server would be configured to connect to the HBase cluster. The following shows the graph specific fragment of the Rexster configuration. Refer to <<rexster>> for a complete example.

[source,xml]
<graph>
  <graph-name>mygraph</graph-name>
  <graph-type>com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration</graph-type>
  <graph-location></graph-location>
  <graph-read-only>false</graph-read-only>
  <properties>
        <storage.backend>hbase</storage.backend>
        <storage.hostname>77.77.77.77,77.77.77.78</storage.hostname>
  </properties>
  <extensions>
    <allows>
      <allow>tp:gremlin</allow>
    </allows>
  </extensions>
</graph>

HBase Specific Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In addition to the general _TODO xref to Graph-Configuration_, there are the following HBase specific Titan configuration options:

.HBase Configuration Options
[cols="4,7,2,2,1",options="header"]
|==========================
|Option |Description |Value |Default |Modifiable

|storage.tablename |
Name of the HBase table in which to store the Titan specific column families
|String |titan |No

|storage.hostname |
Comma separated list of IP addresses or hostnames of the HBase cluster
nodes that this Titan instance connects to
|IP addresses or hostnames. Leave empty to connect to localhost. |- |Yes

|storage.port |
Port on which to connect to HBase cluster nodes. Leave empty to use default port.
|Integer |- |Yes

| storage.short-cf-names |
True to replace Titan's conventional column family names by single character
mnemonics. http://hbase.apache.org/book/rowkey.design.html#keysize.cf[Short
CF names save space because HBase stores the CF in every KeyValue].
|Boolean |false in 0.4.x, true in 0.5.x |No

|==========================

Please refer to the http://hbase.apache.org/book/config.files.html[HBase configuration documentation] for more HBase configuration options and their description. By prefixing the respective HBase configuration option with `storage.hbase-config` in the Titan configuration it will be passed on to HBase at initialization time. This allows arbitrary HBase configuration options to be configured through Titan.

Global Graph Operations
^^^^^^^^^^^^^^^^^^^^^^^

Titan over HBase supports global vertex and edge iteration. However, note that all these vertices and/or edges will be loaded into memory which can cause `OutOfMemoryException`. Use http://thinkaurelius.github.com/faunus/[Faunus] to iterate over all vertices or edges in large graphs.

Deploying on Amazon EC2
^^^^^^^^^^^^^^^^^^^^^^^

[.tss-center.tss-width-250]
image:http://cdn001.practicalclouds.com/user-content/1_Dave%20McCormick/logos/Amazon%20AWS%20plus%20EC2%20logo_scaled.png[link="http://aws.amazon.com/ec2/"]

[quote,'http://aws.amazon.com/ec2/[Amazon EC2]']
Amazon EC2 is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

Follow these steps to setup an HBase cluster on EC2 and deploy Titan over HBase. To follow these instructions, you need an Amazon AWS account with established authentication credentials and some basic knowledge of AWS and EC2.

The following commands first launch a four-node HBase cluster on EC2 via http://whirr.apache.org/[Whirr], then run a basic Titan test case using the cluster.

The configuration described below puts one HBase master server in charge of three HBase regionservers.  The master will be the sole member of the Zookeeper quorum by which Titan connects to HBase.

Whirr 0.7.1 sometimes fails when run on a machine behind a NAT https://issues.apache.org/jira/browse/WHIRR-459[WHIRR-459].  For this reason, it's recommended to use at least Whirr 0.7.2.  Whirr 0.8.0 was used to test the following commands on a t1.micro instance running Amazon Linux 2012.03.  These commands might need tweaking to produce the intended results on environments besides a t1.micro instance running Amazon Linux 2012.03.

[source,bourne]
----
# These commands were executed on a t1.micro instance running Amazon Linux 2012.03 x86_64.
# The AMI identifier for Amazon Linux 2012.03 x86_64 is ami-aecd60c7.
# https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-aecd60c7
export AWS_ACCESS_KEY_ID=... # Set your Access Key here
export AWS_SECRET_ACCESS_KEY=... # Set your Secret Key here
curl -O http://www.apache.org/dist/whirr/whirr-0.8.0/whirr-0.8.0.tar.gz
tar -xzf whirr-0.8.0.tar.gz && cd whirr-0.8.0
# Generate an SSH keypair with which Whirr will deploy and manage instances
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
# Download a Whirr recipe for deploying HBase 0.94.1 with hadoop-core 1.0.3
pushd recipes && wget 'https://raw.github.com/thinkaurelius/titan/master/config/whirr-hbase.properties' ; popd
bin/whirr launch-cluster --config recipes/whirr-hbase.properties --private-key-file ~/.ssh/id_rsa_whirr
# Run a superficial health check on the hbase-master node (this should print "imok")
echo "ruok" | nc $(awk '{print $3}' ~/.whirr/hbase-testing/instances | head -1) 2181; echo
# Login to the HBase master node to run the remaining commands
ssh -i ~/.ssh/id_rsa_whirr -o "UserKnownHostsFile /dev/null" \
      -o StrictHostKeyChecking=no \
      `grep hbase-master ~/.whirr/hbase-testing/instances \
      | awk '{print $3}'`
# Maven 2 is available through the package manager, but an incompatibility
# with surefire 2.12 makes it a pain to use; here we download Maven 3 without
# the OS package manager
wget 'http://archive.apache.org/dist/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz'
tar -xzf apache-maven-3.0.4-bin.tar.gz
# Install git
sudo apt-get install -y git-core
# Clone Titan
git clone 'git://github.com/thinkaurelius/titan.git' && cd titan
# Run a HBase-backed test of Titan
#
# This test should produce pages of output ending in something like this:
#
# -------------------------------------------------------
#  T E S T S
# -------------------------------------------------------
# Running com.thinkaurelius.titan.graphdb.hbase.ExternalHBaseGraphPerformanceTest
# Starting trial 1/1
# 10000
# 20000
# 30000
# 40000
# 50000
# Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 303.659 sec
#
# Results :
#
# Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
#
# [INFO] ------------------------------------------------------------------------
# [INFO] BUILD SUCCESS
# [INFO] ------------------------------------------------------------------------
~/apache-maven-3.0.4/bin/mvn test -Dtest=ExternalHBaseGraphPerformanceTest#unlabeledEdgeInsertion
# Check on hadoop 
hadoop version # Should print 1.0.3
# List the hadoop root; should print something like:
#
# Found 4 items
# drwxr-xr-x   - hadoop supergroup          0 2012-09-20 00:20 /hadoop
# drwxr-xr-x   - hadoop supergroup          0 2012-09-20 00:42 /hbase
# drwxrwxrwx   - hadoop supergroup          0 2012-09-20 00:20 /tmp
# drwxrwxrwx   - hadoop supergroup          0 2012-09-20 00:20 /user
hadoop fs -ls /
----

Tips and Tricks for Managing an HBase Cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The http://wiki.apache.org/hadoop/Hbase/Shell[HBase shell] on the master server can be used to get an overal status check of the cluster.

[source,bourne]
$HBASE_HOME/bin/hbase shell

From the shell, the following commands are generally useful for understanding the status of the cluster.

[source,ruby]
status 'titan'
status 'simple'
status 'detailed'

The above commands can identify if a region server has gone down. If so, it is possible to `ssh` into the failed region server machines and do the following:

[source,bourne]
sudo -u hadoop $HBASE_HOME/bin/hbase-daemon.sh stop regionserver
sudo -u hadoop $HBASE_HOME/bin/hbase-daemon.sh start regionserver

The use of http://code.google.com/p/parallel-ssh/[pssh] can make this process easy as there is no need to log into each machine individually to run the commands. Put the IP addresses of the regionservers into a `hosts.txt` file and then execute the following.

[source,bourne]
pssh -h host.txt sudo -u hadoop $HBASE_HOME/bin/hbase-daemon.sh stop regionserver
pssh -h host.txt sudo -u hadoop $HBASE_HOME/bin/hbase-daemon.sh start regionserver

Next, sometimes you need to restart the master server (e.g. connection refused exceptions). To do so, on the master execute the following:

[source,bourne]
sudo -u hadoop $HBASE_HOME/bin/hbase-daemon.sh stop master
sudo -u hadoop $HBASE_HOME/bin/hbase-daemon.sh start master

Finally, if an HBase cluster has already been deployed and more memory is required of the master or region servers, simply edit the `$HBASE_HOME/conf/hbase-env.sh` files on the respective machines with requisite `-Xmx -Xms` parameters. Once edited, stop/start the master and/or region servers as described previous.

Targeting Hadoop 2
^^^^^^^^^^^^^^^^^^

Titan's HBase storage backend depends on the Apache HBase 0.94 Maven artifact.  This in turn depends on Hadoop 1.0.  However, http://hbase.apache.org/book/configuration.html#hadoop[HBase supports multiple Hadoop versions and recommends Hadoop 2].

Titan is supported with its released version of HBase and the associated default Hadoop version (see <<version-compat>>).  Due to the sheer number of interoperable HBase and Hadoop versions, Titan's HBase backend is not tested in every version combination.  However, Titan attempts to maintain as wide a version compatibility range as possible by using only a minimal set of core Hadoop and HBase API classes and methods.  Furthermore, since version 0.4.1, Titan includes Maven build options to make targeting Hadoop 2.0 or 2.2 easier.  It should be possible to target any recent Hadoop version and compatible HBase version without modifying Titan's source code, though targeting Hadoop versions besides 1.0, 2.0, or 2.2 may require tweaking dependency versions in <code>titan-hbase/pom.xml</code>.

Broadly, targeting Titan for Hadoop 2 is a two step process:

. Build the HBase Maven artifact with a newer Hadoop version and install it in the local Maven repository
. Build the titan-hbase artifact using the HBase artifact from step #1

These steps are described in the following two subsections.

Rebuilding HBase with Hadoop 2
++++++++++++++++++++++++++++++

In this subsection, we download an HBase 0.94.12 release tarball and compile it against Hadoop 2.2.0.

. Download HBase 0.94.12 from the closest mirror on this page: http://www.apache.org/dyn/closer.cgi/hbase/hbase-0.94.12/hbase-0.94.12.tar.gz
. Extract the tarball and change directory: <code>tar -xzf hbase-0.94.12.tar.gz && cd hbase-0.94.12</code>
. Invoke <code>mvn clean install -Dhadoop.profile=2.0 -Dhadoop.version=2.2.0 -DskipTests=true</code>

The last command builds HBase using Hadoop 2.2.0 and installs the resulting Maven artifact into the local repository, which is an elaborate way of saying it copies the HBase jar into <code>~/.m2/repository/</code>.  This is necessary for the next step to work.

Rebuilding Titan-HBase
++++++++++++++++++++++

In this subsection, we clone the Titan repository from GitHub and rebuild the titan-hbase artifact.  These commands should be run on the same machine as the commands in the "Rebuilding HBase with Hadoop 2" section preceding this one, or at least two machines that share a single <code>~/.m2/repository/</code> filesystem.

. Clone the repo and change directory: <code>git clone https://github.com/thinkaurelius/titan.git && cd titan</code>
. Optional: to build against a specific release, check out the associated tag.  For instance, to use 0.4.1, run <code>git checkout refs/tags/0.4.1</code>.  These instructions are only supported in Titan 0.4.1 and later.  Git checks out the master by default.  Titan's master branch usually corresponds to the latest release plus bugfixes and optimizations that have yet to be released.
. Rebuild and locally install Titan; this creates artifacts needed by titan-hbase, including titan-core, titan-test, and titan-es (for testing): <code>mvn clean install -DskipTests=true</code>
. Build and test Titan-HBase: <code>pushd titan-hbase; mvn clean install -Dhadoop.profile=2.0 -Dhadoop.version=2.2.0; popd</code>

The final step writes a titan-hbase-$VERSION.jar file to <code>titan-hbase/target/</code> and installs it in the local Maven repository.
back to top