The GridDB Hadoop MapReduce Connector allows you to use GridDB as the data storage engine for Hadoop MapReduce applications with a few small changes to their source code. In this post we’ll take a look at how to install and use GridDB’s Hadoop HDFS (Hadoop Distributed File System) Connector.

Install Dependencies

Starting from a fresh CentOS 6.8 image, we first need to install the Oracle JDK, Hadoop (from Bigtop), and GridDB.

Oracle JDK

Download the Linux x64 JDK RPM from Java SE Development Kit 8 – Downloads and install it using RPM.

Bigtop

curl http://www.apache.org/dist/bigtop/bigtop-1.1.0/repos/centos6/bigtop.repo > /etc/yum.repos.d/bigtop.repo
yum -y install hadoop-\*

Ensure your /etc/hosts has an entry for your machine’s hostname like so:

10.2.0.4 griddbhadoopblog

Since we’re using an Azure VM to test, we also need to mount the local SSD as Hadoop’s data directory.

# mkdir /mnt/resource/hadoop-hdfs
# chown -R hdfs.hdfs /mnt/resource/hadoop-hdfs
# mount -o bind /mnt/resource/hadoop-hdfs/ /var/lib/hadoop-hdfs/

Now you can format the namenode and start the Hadoop HDFS services:

# sudo -u hdfs hadoop namenode -format
# service hadoop-hdfs-namenode start
# service hadoop-hdfs-datanode start

Now we need to create all of the HDFS directories as required (From

How to install Hadoop distribution from Bigtop 0.5.0):

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -mkdir -p /user/history
sudo -u hdfs hadoop fs -chown mapred:mapred /user/history
sudo -u hdfs hadoop fs -chmod 770 /user/history
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging

For each system user that will have access to HDFS, create a HDFS home directory:

sudo -u hdfs hadoop fs -mkdir -p /user/$USER
sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
sudo -u hdfs hadoop fs -chmod 770 /user/$USER

Now we can start YARN and test the cluster.

# service hadoop-yarn-resourcemanager start
# service hadoop-yarn-nodemanager start

My favourite Hadoop test is a wordcount of Java’s documentation.

$ hdfs dfs -put /usr/share/doc/java-1.6.0-openjdk-1.6.0.40/
$ time yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount java-1.6.0-openjdk-1.6.0.40 wordcount-out
$ hdfs dfs -get wordcount-out

GridDB

We will install GridDB via the instructions in the GridDB Community Edition RPM Install Guide. If you’re using GridDB SE or AE, the details may be slightly different. Please refer to the GridDB SE/AE Quick Start Guide for more information.

# rpm -Uvh https://github.com/griddb/griddb_nosql/releases/download/v3.0.0/griddb_nosql-3.0.0-1.linux.x86_64.rpm

We€™re going to use a profile script to set the GridDB environment variables, as root create /etc/profile.d/griddb_nosql.sh:

#!/bin/bash
export GS_HOME=/var/lib/gridstore
export GS_LOG=$GS_HOME/log

You can log out and then back in and the settings will be applied. Now you create the GridDB password for admin.

# sudo su - gsadm -c "gs_passwd admin"

(input your_password)

The default gs_cluster.json and gs_node.json configuration files are fine for single node usage except that you will need input a name for the cluster in gs_cluster.json. Like with Hadoop, since we’re using Azure we need to mount the local SSD as GridDB’s data directory:

mkdir -p /mnt/resource/griddb_data
chown -R gsadm.gridstore /mnt/resource/griddb_data
mount -o bind /mnt/resource/griddb_data /var/lib/gridstore/data

Now start the cluster and confirm the node is ACTIVE in gs_stat:

sudo su - gsadm -c gs_startnode
sudo su - gsadm -c "gs_joincluster -c defaultCluster -u admin/admin -n 1"
sudo su - gsadm -c "gs_stat -u admin/admin"

Maven

You’ll need Maven to build the Connector, and the easiest way to get it on CentOS/RHEL 6.8 is by downloading the binaries from Apache’s website:

wget http://apache.mirror.iweb.ca/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar zxvf apache-maven-3.3.9-bin.tar.gz
export PATH=$PATH:/path/to/apache-maven-3.3.9/bin

Since you only need it for the initial build, we just manually set the path rather than making a profile script.

Download and Build the GridDB Hadoop Connector

You can either use git to clone the GitHub repository with:

$ git clone https://github.com/griddb/griddb_hadoop_mapreduce.git

Or download the

ZIP file and unzip it:

$ wget https://github.com/griddb/griddb_hadoop_mapreduce/archive/master.zip
$ unzip master.zip

Now build the Connector:

$ cd griddb_hadoop_mapreduce-master
$ cp /usr/share/java/gridstore.jar lib
$ mvn package

Using the Connector

One example (wordcount) is provided with the GridDB connector. It can be run with the following command:

$ cd gs-hadoop-mapreduce-examples/
$ ./exec-example.sh --job wordcount \
    --define clusterName=$CLUSTER_NAME \
    --define user=$GRIDDB_USER \
    --define password=$GRIDDB_PASSWORD \
    <list of files to count>

So how does performance compare? Well, for the above small example, HDFS typically takes 36 seconds to complete the job while GridDB takes 35 seconds. Of course the test is so small that that the results are meaningless so look forward to a future blog post that does a proper performance comparison across a variety of configurations. Also coming soon is a look at how to take an existing MapReduce application and port it to use the GridDB Hadoop Connector.

If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.

Getting started with GridDB connector for Hadoop MapReduce