Build a Custom Hadoop/Giraph Project with Maven

Context

Hey mates! Did you have dependencies problems on packing a jar file and submitting it to the Hadoop/Giraph Cluster? Well, you can see in the mail lists many people suggesting to use Apache Ant or Apache Maven, but without giving any concrete example of how do actually do it. In the next sections I will give you a quick example of how to actually do it.

Maven startup configuration

Well, let’s assume that you have Maven properly installed already – if not, check here.

To check if maven was properly installed, run:

mvn --version

You should see an output similar to (may differ slightly):

Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 14:51:28+0100)
Maven home: D:\apache-maven-3.0.5\bin\..
Java version: 1.6.0_25, vendor: Sun Microsystems Inc.
Java home: C:\Program Files\Java\jdk1.6.0_25\jre
Default locale: nl_NL, platform encoding: Cp1252
OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows"

Creating the Maven Project

Ok, now create the directory where you would like your project to reside. Once you have done that, run the following command into the empty project directory.

mvn archetype:generate -DgroupId=com.marcolotz.giraph -DartifactId=TripletExample -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Please choose a DgroupId and DartifactId that matches your project. The names in those fields above were just examples that I used in my personal project. Basically it belonged to package com.marcolotz.giraph and the project name was TripletExample

Modifying the POM file

Pom stands for Project Object Model and is the core of a project configuration in maven. You may have already realized that Giraph contains several pom files. When you created a maven project on the command that I mentioned above, a pom.xml file was also created. We now need to modify this file in order to build giraph code. Thus, modify the pom.xml content to the following:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>com.marcolotz.giraph</groupId>
<artifactId>tripletExample</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>tripletExample</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifactId>
<version>1.1.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.203.0</version>
</dependency>

</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>

</plugins>
</build>
</project>

Should you actually use it to build a Hadoop project, just make sure that you are using the correct Hadoop version in the pom file to build and remove the Giraph dependency. Also please note that the “@Algorithm” notation used in some Giraph examples is not defined in the Giraph-Core, but actually in the Giraph-Examples. This will cause some build problems should your source code contain it.

Inserting Custom Code and Building

Now you just need to insert you java files in the package folder (for the example above in tripletExample/src/main/java/com/marcolotz/giraph) and build the solution. In order to build it, run the following command line in the folder where the pom.xml file is located:

mvn clean compile assembly:single

This command will clean the target folder (if not already empty) and prepare a single jar file with all the project dependencies inside. The final product of the command will be in the target folder. Since it contain all dependencies, the size of this jar file may be quite large. Finally you can submit the jar file to your hadoop cluster ;)

Please note that there are other solutions to reduce the size of the jar file, which are giving a class path argument to hadoop/giraph instead of packing everything in inside of jar. This is also an elegant solution for when the jar file would be too large to be easily distributed in a cluster.

Leave a Comment