The GridDB Hadoop MapReduce Connector allows you to use GridDB as the data storage engine for Hadoop MapReduce applications with a few small changes to their source code. In this post we’ll take a look at how to install and use GridDB’s Hadoop HDFS (Hadoop Distributed File System) Connector.
Install Dependencies
Starting from a fresh CentOS 6.8 image, we first need to install the Oracle JDK, Hadoop (from Bigtop), and GridDB.
Oracle JDK
Download the Linux x64 JDK RPM from Java SE Development Kit 8 – Downloads and install it using RPM.
Bigtop
curl http://www.apache.org/dist/bigtop/bigtop-1.1.0/repos/centos6/bigtop.repo > /etc/yum.repos.d/bigtop.repo yum -y install hadoop-\*
Ensure your /etc/hosts has an entry for your machine’s hostname like so:
10.2.0.4 griddbhadoopblog
Since we’re using an Azure VM to test, we also need to mount the local SSD as Hadoop’s data directory.
# mkdir /mnt/resource/hadoop-hdfs # chown -R hdfs.hdfs /mnt/resource/hadoop-hdfs # mount -o bind /mnt/resource/hadoop-hdfs/ /var/lib/hadoop-hdfs/
Now you can format the namenode and start the Hadoop HDFS services:
# sudo -u hdfs hadoop namenode -format # service hadoop-hdfs-namenode start # service hadoop-hdfs-datanode start
Now we need to create all of the HDFS directories as required (From
How to install Hadoop distribution from Bigtop 0.5.0):
sudo -u hdfs hadoop fs -mkdir /tmp sudo -u hdfs hadoop fs -chmod -R 1777 /tmp sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn sudo -u hdfs hadoop fs -mkdir -p /user/history sudo -u hdfs hadoop fs -chown mapred:mapred /user/history sudo -u hdfs hadoop fs -chmod 770 /user/history sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging/history/done_intermediate sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
For each system user that will have access to HDFS, create a HDFS home directory:
sudo -u hdfs hadoop fs -mkdir -p /user/$USER sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER sudo -u hdfs hadoop fs -chmod 770 /user/$USER
Now we can start YARN and test the cluster.
# service hadoop-yarn-resourcemanager start # service hadoop-yarn-nodemanager start
My favourite Hadoop test is a wordcount of Java’s documentation.
$ hdfs dfs -put /usr/share/doc/java-1.6.0-openjdk-1.6.0.40/ $ time yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount java-1.6.0-openjdk-1.6.0.40 wordcount-out $ hdfs dfs -get wordcount-out
GridDB
We will install GridDB via the instructions in the GridDB Community Edition RPM Install Guide. If you’re using GridDB SE or AE, the details may be slightly different. Please refer to the GridDB SE/AE Quick Start Guide for more information.
# rpm -Uvh https://github.com/griddb/griddb_nosql/releases/download/v3.0.0/griddb_nosql-3.0.0-1.linux.x86_64.rpm
We€™re going to use a profile script to set the GridDB environment variables, as root create /etc/profile.d/griddb_nosql.sh:
#!/bin/bash export GS_HOME=/var/lib/gridstore export GS_LOG=$GS_HOME/log
You can log out and then back in and the settings will be applied. Now you create the GridDB password for admin.
# sudo su - gsadm -c "gs_passwd admin"
(input your_password)
The default gs_cluster.json and gs_node.json configuration files are fine for single node usage except that you will need input a name for the cluster in gs_cluster.json. Like with Hadoop, since we’re using Azure we need to mount the local SSD as GridDB’s data directory:
mkdir -p /mnt/resource/griddb_data chown -R gsadm.gridstore /mnt/resource/griddb_data mount -o bind /mnt/resource/griddb_data /var/lib/gridstore/data
Now start the cluster and confirm the node is ACTIVE in gs_stat:
sudo su - gsadm -c gs_startnode sudo su - gsadm -c "gs_joincluster -c defaultCluster -u admin/admin -n 1" sudo su - gsadm -c "gs_stat -u admin/admin"
Maven
You’ll need Maven to build the Connector, and the easiest way to get it on CentOS/RHEL 6.8 is by downloading the binaries from Apache’s website:
wget http://apache.mirror.iweb.ca/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz tar zxvf apache-maven-3.3.9-bin.tar.gz export PATH=$PATH:/path/to/apache-maven-3.3.9/bin
Since you only need it for the initial build, we just manually set the path rather than making a profile script.
Download and Build the GridDB Hadoop Connector
You can either use git to clone the GitHub repository with:
$ git clone https://github.com/griddb/griddb_hadoop_mapreduce.git
Or download the
ZIP file and unzip it:
$ wget https://github.com/griddb/griddb_hadoop_mapreduce/archive/master.zip $ unzip master.zip
Now build the Connector:
$ cd griddb_hadoop_mapreduce-master $ cp /usr/share/java/gridstore.jar lib $ mvn package
Using the Connector
One example (wordcount) is provided with the GridDB connector. It can be run with the following command:
$ cd gs-hadoop-mapreduce-examples/ $ ./exec-example.sh --job wordcount \ --define clusterName=$CLUSTER_NAME \ --define user=$GRIDDB_USER \ --define password=$GRIDDB_PASSWORD \ <list of files to count>
So how does performance compare? Well, for the above small example, HDFS typically takes 36 seconds to complete the job while GridDB takes 35 seconds. Of course the test is so small that that the results are meaningless so look forward to a future blog post that does a proper performance comparison across a variety of configurations. Also coming soon is a look at how to take an existing MapReduce application and port it to use the GridDB Hadoop Connector.
If you have any questions about the blog, please create a Stack Overflow post here https://stackoverflow.com/questions/ask?tags=griddb .
Make sure that you use the “griddb” tag so our engineers can quickly reply to your questions.