Build a Custom Hadoop/Giraph Project with Maven

Context

 

Hey mates! Did you have dependencies problems on packing a jar file and submitting it to the Hadoop/Giraph Cluster? Well, you can see in the mail lists many people suggesting to use Apache Ant or Apache Maven, but without giving any concrete example of how do actually do it. In the next sections I will give you a quick example of how to actually do it.

Maven startup configuration

 

Well, let’s assume that you have Maven properly installed already – if not, check here.

To check if maven was properly installed, run:

mvn --version

You should see an output similar to (may differ slightly):

Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 14:51:28+0100)
Maven home: D:\apache-maven-3.0.5\bin\..
Java version: 1.6.0_25, vendor: Sun Microsystems Inc.
Java home: C:\Program Files\Java\jdk1.6.0_25\jre
Default locale: nl_NL, platform encoding: Cp1252
OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows"

Creating the Maven Project

 

Ok, now create the directory where you would like your project to reside. Once you have done that, run the following command into the empty project directory.

mvn archetype:generate -DgroupId=com.marcolotz.giraph -DartifactId=TripletExample -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Please choose a DgroupId and DartifactId that matches your project. The names in those fields above were just examples that I used in my personal project. Basically it belonged to package com.marcolotz.giraph and the project name was TripletExample

Modifying the POM file

 

Pom stands for Project Object Model and is the core of a project configuration in maven. You may have already realized that Giraph contains several pom files. When you created a maven project on the command that I mentioned above, a pom.xml file was also created. We now need to modify this file in order to build giraph code. Thus, modify the pom.xml content to the following:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>com.marcolotz.giraph</groupId>
<artifactId>tripletExample</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>tripletExample</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifactId>
<version>1.1.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.203.0</version>
</dependency>

</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>

</plugins>
</build>
</project>

Should you actually use it to build a Hadoop project, just make sure that you are using the correct Hadoop version in the pom file to build and remove the Giraph dependency. Also please note that the “@Algorithm” notation used in some Giraph examples is not defined in the Giraph-Core, but actually in the Giraph-Examples. This will cause some build problems should your source code contain it.

Inserting Custom Code and Building

 

Now you just need to insert you java files in the package folder (for the example above in tripletExample/src/main/java/com/marcolotz/giraph) and build the solution. In order to build it, run the following command line in the folder where the pom.xml file is located:

mvn clean compile assembly:single

This command will clean the target folder (if not already empty) and prepare a single jar file with all the project dependencies inside. The final product of the command will be in the target folder. Since it contain all dependencies, the size of this jar file may be quite large. Finally you can submit the jar file to your hadoop cluster ;)

 

Please note that there are other solutions to reduce the size of the jar file, which are giving a class path argument to hadoop/giraph instead of packing everything in inside of jar. This is also an elegant solution for when the jar file would be too large to be easily distributed in a cluster.

Giraph configuration details

Heeeey!

Today I had to deploy a single-node Apache Giraph installation in my machine. Giraph has a really good quick start guide (here). A few corrections need to be made in order to keep it up-to-date with the latest version available (1.1.0), but hopefully I will be able to do it this week.

There are, however, a few glitches that can happen during the installation process. Here I will present quick solutions to it.

OpenJDK management.properties problem

I must confess that I rather use the oracle java than the openjdk. As most of you may know, in order to run hadoop you need at least java 1.6 installed. In my current configuration, I was trying to run hadoop with Openjdk java 7, on Ubuntu 14.04.2

With all my $JAVA_HOME variables correctly set, i was getting the following error:

hduser@prometheus-UX32A:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/management$ hadoop namenode -format
Error: Config file not found: /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/management/management.properties

This code also appeared for other hadoop scripts like start-dfs.sh, start-mapred.sh – and of course start-all.sh.

Taking a quick look at the installation, I realized that management.properties was just a broken symbolic link. Also, looks like it is directly related with Java RMI capabilities (Remove Method Invocation). This is probably how Hadoop is going to start all the daemons across the nodes. I must confess that I installed all the opendjk packages available in the Ubuntu repository and none solve my problem.

This can be solved with the following workaround:

  1. Download the equivalent Oracle Java version HERE. Keep in mind to download the same version of your openjdk (i.e. download java 6, 7 or 8 depending on your jdk).
  2. Open the Oracle Java files and copy all the content of the folder /jre/lib/management/management.properties into the OpenJDK equivalent folder.
  3. Run again

Hopefully, this will solve that problem.

Giraph Mapper Allocation problem

Let’s suppose you installed Giraph correctly and Hadoop MapReduce examples are running as you expect in your machine.

Giraph gives, however, this very interesting line when you submit a job:

15/06/04 16:58:03 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers

If you want to make sure that we are talking about the same thing, here is a complete error output:

hduser@prometheus-UX32A:~$ hadoop jar /opt/apache/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
15/06/04 16:57:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
15/06/04 16:57:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
15/06/04 16:57:59 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)
15/06/04 16:58:03 INFO job.GiraphJob: Tracking URL: http://hdhost:50030/jobdetails.jsp?jobid=job_201506041652_0003
15/06/04 16:58:03 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers
15/06/04 16:58:31 INFO mapred.JobClient: Running job: job_201506041652_0003
15/06/04 16:58:31 INFO mapred.JobClient: Job complete: job_201506041652_0003
15/06/04 16:58:31 INFO mapred.JobClient: Counters: 5
15/06/04 16:58:31 INFO mapred.JobClient:   Job Counters
15/06/04 16:58:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=31987
15/06/04 16:58:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/04 16:58:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/06/04 16:58:31 INFO mapred.JobClient:     Launched map tasks=2
15/06/04 16:58:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

Good news: It is not your fault (unless you think that giving bad hostnames to your machine is your fault). According to this JIRA there is a problem parsing hostnames. Basically, it does not recognize upper and lower letters in the hostname. This can be solved by setting a new hostname with only lower letters. To do this perform as sudo:

 sudo hostname NEW_HOST_NAME 

also, do not forget to change your /etc/hosts table to your new hostname. Keep in mind that this may affect other softwares that rely on your hostname to run. Once you change it, restart the machine and the hadoop daemons. Hopefully you will get the correct output, that in my case is:

hduser@prometheusmobile:/opt/apache/giraph/giraph-examples$ hadoop jar target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
15/06/04 17:06:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
15/06/04 17:06:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
15/06/04 17:06:58 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)
15/06/04 17:07:02 INFO job.GiraphJob: Tracking URL: http://hdhost:50030/jobdetails.jsp?jobid=job_201506041705_0002
15/06/04 17:07:02 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers
15/06/04 17:07:31 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer prometheusmobile.ironnetwork:22181 --zkNode /_hadoopBsp/job_201506041705_0002/_haltComputation'
15/06/04 17:07:31 INFO mapred.JobClient: Running job: job_201506041705_0002
15/06/04 17:07:32 INFO mapred.JobClient:  map 100% reduce 0%
15/06/04 17:07:43 INFO mapred.JobClient: Job complete: job_201506041705_0002
15/06/04 17:07:43 INFO mapred.JobClient: Counters: 37
15/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper halt node
15/06/04 17:07:43 INFO mapred.JobClient:     /_hadoopBsp/job_201506041705_0002/_haltComputation=0
15/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper base path
15/06/04 17:07:43 INFO mapred.JobClient:     /_hadoopBsp/job_201506041705_0002=0
15/06/04 17:07:43 INFO mapred.JobClient:   Job Counters
15/06/04 17:07:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=40730
15/06/04 17:07:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/04 17:07:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/06/04 17:07:43 INFO mapred.JobClient:     Launched map tasks=2
15/06/04 17:07:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/06/04 17:07:43 INFO mapred.JobClient:   Giraph Timers
15/06/04 17:07:43 INFO mapred.JobClient:     Input superstep (ms)=188
15/06/04 17:07:43 INFO mapred.JobClient:     Total (ms)=9260
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 2 SimpleShortestPathsComputation (ms)=62
15/06/04 17:07:43 INFO mapred.JobClient:     Shutdown (ms)=8795
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 0 SimpleShortestPathsComputation (ms)=75
15/06/04 17:07:43 INFO mapred.JobClient:     Initialize (ms)=3254
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 3 SimpleShortestPathsComputation (ms)=44
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 1 SimpleShortestPathsComputation (ms)=61
15/06/04 17:07:43 INFO mapred.JobClient:     Setup (ms)=32
15/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper server:port
15/06/04 17:07:43 INFO mapred.JobClient:     prometheusmobile.ironnetwork:22181=0
15/06/04 17:07:43 INFO mapred.JobClient:   Giraph Stats
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate edges=12
15/06/04 17:07:43 INFO mapred.JobClient:     Sent message bytes=0
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep=4
15/06/04 17:07:43 INFO mapred.JobClient:     Last checkpointed superstep=0
15/06/04 17:07:43 INFO mapred.JobClient:     Current workers=1
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate sent messages=12
15/06/04 17:07:43 INFO mapred.JobClient:     Current master task partition=0
15/06/04 17:07:43 INFO mapred.JobClient:     Sent messages=0
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate finished vertices=5
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate sent message message bytes=267
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate vertices=5
15/06/04 17:07:43 INFO mapred.JobClient:   File Output Format Counters
15/06/04 17:07:43 INFO mapred.JobClient:     Bytes Written=0
15/06/04 17:07:43 INFO mapred.JobClient:   FileSystemCounters
15/06/04 17:07:43 INFO mapred.JobClient:     HDFS_BYTES_READ=200
15/06/04 17:07:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=43376
15/06/04 17:07:43 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=30
15/06/04 17:07:43 INFO mapred.JobClient:   File Input Format Counters
15/06/04 17:07:43 INFO mapred.JobClient:     Bytes Read=0
15/06/04 17:07:43 INFO mapred.JobClient:   Map-Reduce Framework
15/06/04 17:07:43 INFO mapred.JobClient:     Map input records=2
15/06/04 17:07:43 INFO mapred.JobClient:     Spilled Records=0
15/06/04 17:07:43 INFO mapred.JobClient:     Map output records=0
15/06/04 17:07:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=88

A New Dawn

Hey fellows! I know that quite a decent amount of time has passed since my last post here. I have, however, a really good excuse for it ;)

No, no, I didn’t quit with Big Data, neither stopped with Embedded Systems (to be sincere, I’ve been working a lot with the second in the last past month). Actually, I spent all this time doing researches and publishing a brand new Big Data project! I used it as my graduation thesis and is all about using MapReduce to process CT images to detect lung nodules.

Yes! How about the doctors use the Big Data strengths to help in the diagnostics of the deadliest Cancer in US?
It’s all about my work :)
I’ll describe it in more details later. If you want to check it, the document (in Portuguese – but with an English abstract) is available here.

The project itself is going to be available in my GitHub soon, and when I get I’ll post the links here too!

See you guys in a bit!

Update:

The gitHub links are here. Soon I will write a post getting into more details about how it works.
MatLung: A matlab script that shows the image processing features of the distributed software.
LungProcessor: A single machine program that is fully-code-compatible with the distributed hadoop application. Used for Hadoop code debugging.
LungPostProcessor: The Image metadata post-processor. Used to detect the lung nodules and remove false positives.
LungPostProcessor: The Hadoop image processing application

MapReduce 2.0: Hadoop with YARN

Context

After the release of Hadoop version 0.20.203 the complete framework was restructured. Actually, almost a complete new framework was created. This framework is called MapReduce 2.0 (MRv2) or YARN.

The short YARN stands for “Yet Another Resources Negotiator” (but if you are GNU mate and you are into recursive anonymous, call it: YARN Application Resource Negotiator). This modification has as a main goal to make the MapReduce component less dependent of Hadoop itself. Furthermore, this will improve the possibility of using programming interfaces other than MapReduce over it and increase the scalability of the framework.

Originally YARN was expected to be in production by the end of 2012. Guest what: that didn’t happen. Sound like the world cup building in Brazil huh? Actually, it went from alpha to beta only on 25th of August of 2013 (with version 2.1.0-beta). On 7th of October  of 2013 the first stable YARN version was released, with Hadoop 2.2.0. The alpha and beta versions were however  available on most of the third parties distributions, like Cloudera CDH4. Cloudera included warnings to remind consumers from the instability of the build by that time.

Why?

This is a solution to a scalability bottleneck that was happening with MapReduce 1.0. For very large clusters (and with that I mean clusters larger than 4000 nodes) Hadoop was not performing as it should be. Then Yahoo though “How about we make some modifications in the framework in order to solve these problems?” and that how it appeared. You can read more about it in the following link:
The Next Generation of Apache MapReduce

 

What are the differences?

One of the main characteristics of YARN is that the original JobTracker was split into two daemons: the ResourceManager and the ApplicationMaster. In a first though, this removes the single point of failure from the JobTracker and puts it in the ResourceManager. I will get into more detail on it later.

Here’s a small diagram of how it works, as depicted in the Apache Hadoop website:

Hadoop Yarn Diagram

ResourceManager

It monitors the cluster status, indicating which resources are available to which nodes. This is only possible due to the concept of container (which I will explain later). The Resource Manager arbitrates the resources among all the applications in the system.

ApplicationMaster:

It manages the parallel execution of any job. In Yarn this is done separately for each individual job. It is a framework specific library and negotiates the resources from the ResourceManager and works with the NodeManager to execute and monitor tasks.

Oh! Almost forgot to say. The TaskTracker became the NodeManager! So let’s get into it:

NodeManager:

The TaskTracker was transformed in the NodeManager, that with the ResourceManager compose the data-computation framework. It is a machine individual framework agent that takes care of the containers by monitoring the resource usage (memory, disk, CPU, network) and reporting the results to the Scheduler that is inside the ResourceManager. This is a quite interesting modification, once with this other Apaches frameworks are not running over the Hadoop TaskTracker structure, but are actually running on the NodeManager structure that belong to YARN. This way, they can freely implement how they agent is going to behave and communicate with the ApplicationMaster.

 

Back to the ResourceManager:

The ResourceManager daemon has two main components: the Scheduler and the ApplicationsManager. The first allocates the resource for the various running applications and takes into considerations capacity and queues constraints. The second is a basic scheduler, without tracking status or monitoring an application. It only performs the scheduling taking into account the resource requirements of an application, by taking an abstract notion of the ResourceContainer.

ResourceContainer:

Also called container – it incorporates elements such memory, CPU, disk and network. In the early versions only memory is supported – and I guess that in the moment that I am writing this post, it is still like that.

Now, let’s get back to the ApplicationsManager. It does the job-submissions acceptance work. It makes the negotiation with the first container for executing the application specific ApplicationMaster and also provides the service for restarting the ApplicationMaster container on failure – therefore helping to solve one of the single points of failure that was present in the original Hadoop framework (yey! :D )

One of the major features of YARN is that it is backward compatible with Hadoop non-YARN. For this reason, applications that were developed for the previous versions do not need to be recompiled in order to run on a YARN cluster.