Context

Hey mates! Did you have dependencies problems on packing a jar file and submitting it to the Hadoop/Giraph Cluster? Well, you can see in the mail lists many people suggesting to use Apache Ant or Apache Maven, but without giving any concrete example of how do actually do it. In the next sections I will give you a quick example of how to actually do it.

Maven startup configuration

Well, let’s assume that you have Maven properly installed already – if not, check here.

To check if maven was properly installed, run:

mvn --version

You should see an output similar to (may differ slightly):

Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 14:51:28+0100)
Maven home: D:\apache-maven-3.0.5\bin\..
Java version: 1.6.0_25, vendor: Sun Microsystems Inc.
Java home: C:\Program Files\Java\jdk1.6.0_25\jre
Default locale: nl_NL, platform encoding: Cp1252
OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows"

Creating the Maven Project

Ok, now create the directory where you would like your project to reside. Once you have done that, run the following command into the empty project directory.

mvn archetype:generate -DgroupId=com.marcolotz.giraph -DartifactId=TripletExample -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Please choose a DgroupId and DartifactId that matches your project. The names in those fields above were just examples that I used in my personal project. Basically it belonged to package com.marcolotz.giraph and the project name was TripletExample

Modifying the POM file

Pom stands for Project Object Model and is the core of a project configuration in maven. You may have already realized that Giraph contains several pom files. When you created a maven project on the command that I mentioned above, a pom.xml file was also created. We now need to modify this file in order to build giraph code. Thus, modify the pom.xml content to the following:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>com.marcolotz.giraph</groupId>
<artifactId>tripletExample</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>tripletExample</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifactId>
<version>1.1.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.203.0</version>
</dependency>

</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>

</plugins>
</build>
</project>

Should you actually use it to build a Hadoop project, just make sure that you are using the correct Hadoop version in the pom file to build and remove the Giraph dependency. Also please note that the “@Algorithm” notation used in some Giraph examples is not defined in the Giraph-Core, but actually in the Giraph-Examples. This will cause some build problems should your source code contain it.

Inserting Custom Code and Building

Now you just need to insert you java files in the package folder (for the example above in tripletExample/src/main/java/com/marcolotz/giraph) and build the solution. In order to build it, run the following command line in the folder where the pom.xml file is located:

mvn clean compile assembly:single

This command will clean the target folder (if not already empty) and prepare a single jar file with all the project dependencies inside. The final product of the command will be in the target folder. Since it contain all dependencies, the size of this jar file may be quite large. Finally you can submit the jar file to your hadoop cluster ;)

Please note that there are other solutions to reduce the size of the jar file, which are giving a class path argument to hadoop/giraph instead of packing everything in inside of jar. This is also an elegant solution for when the jar file would be too large to be easily distributed in a cluster.

Heeeey!

Today I had to deploy a single-node Apache Giraph installation in my machine. Giraph has a really good quick start guide (here). A few corrections need to be made in order to keep it up-to-date with the latest version available (1.1.0), but hopefully I will be able to do it this week.

There are, however, a few glitches that can happen during the installation process. Here I will present quick solutions to it.

OpenJDK management.properties problem

I must confess that I rather use the oracle java than the openjdk. As most of you may know, in order to run hadoop you need at least java 1.6 installed. In my current configuration, I was trying to run hadoop with Openjdk java 7, on Ubuntu 14.04.2

With all my $JAVA_HOME variables correctly set, i was getting the following error:

hduser@prometheus-UX32A:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/management$ hadoop namenode -format
Error: Config file not found: /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/management/management.properties

This code also appeared for other hadoop scripts like start-dfs.sh, start-mapred.sh – and of course start-all.sh.

Taking a quick look at the installation, I realized that management.properties was just a broken symbolic link. Also, looks like it is directly related with Java RMI capabilities (Remove Method Invocation). This is probably how Hadoop is going to start all the daemons across the nodes. I must confess that I installed all the opendjk packages available in the Ubuntu repository and none solve my problem.

This can be solved with the following workaround:

  1. Download the equivalent Oracle Java version HERE. Keep in mind to download the same version of your openjdk (i.e. download java 6, 7 or 8 depending on your jdk).
  2. Open the Oracle Java files and copy all the content of the folder /jre/lib/management/management.properties into the OpenJDK equivalent folder.
  3. Run again

Hopefully, this will solve that problem.

Giraph Mapper Allocation problem

Let’s suppose you installed Giraph correctly and Hadoop MapReduce examples are running as you expect in your machine.

Giraph gives, however, this very interesting line when you submit a job:

15/06/04 16:58:03 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers

If you want to make sure that we are talking about the same thing, here is a complete error output:

hduser@prometheus-UX32A:~$ hadoop jar /opt/apache/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
15/06/04 16:57:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
15/06/04 16:57:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
15/06/04 16:57:59 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)
15/06/04 16:58:03 INFO job.GiraphJob: Tracking URL: http://hdhost:50030/jobdetails.jsp?jobid=job_201506041652_0003
15/06/04 16:58:03 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers
15/06/04 16:58:31 INFO mapred.JobClient: Running job: job_201506041652_0003
15/06/04 16:58:31 INFO mapred.JobClient: Job complete: job_201506041652_0003
15/06/04 16:58:31 INFO mapred.JobClient: Counters: 5
15/06/04 16:58:31 INFO mapred.JobClient:   Job Counters
15/06/04 16:58:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=31987
15/06/04 16:58:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/04 16:58:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/06/04 16:58:31 INFO mapred.JobClient:     Launched map tasks=2
15/06/04 16:58:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

Good news: It is not your fault (unless you think that giving bad hostnames to your machine is your fault). According to this JIRA there is a problem parsing hostnames. Basically, it does not recognize upper and lower letters in the hostname. This can be solved by setting a new hostname with only lower letters. To do this perform as sudo:

 sudo hostname NEW_HOST_NAME 

also, do not forget to change your /etc/hosts table to your new hostname. Keep in mind that this may affect other softwares that rely on your hostname to run. Once you change it, restart the machine and the hadoop daemons. Hopefully you will get the correct output, that in my case is:

hduser@prometheusmobile:/opt/apache/giraph/giraph-examples$ hadoop jar target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
15/06/04 17:06:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
15/06/04 17:06:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
15/06/04 17:06:58 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)
15/06/04 17:07:02 INFO job.GiraphJob: Tracking URL: http://hdhost:50030/jobdetails.jsp?jobid=job_201506041705_0002
15/06/04 17:07:02 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers
15/06/04 17:07:31 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer prometheusmobile.ironnetwork:22181 --zkNode /_hadoopBsp/job_201506041705_0002/_haltComputation'
15/06/04 17:07:31 INFO mapred.JobClient: Running job: job_201506041705_0002
15/06/04 17:07:32 INFO mapred.JobClient:  map 100% reduce 0%
15/06/04 17:07:43 INFO mapred.JobClient: Job complete: job_201506041705_0002
15/06/04 17:07:43 INFO mapred.JobClient: Counters: 37
15/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper halt node
15/06/04 17:07:43 INFO mapred.JobClient:     /_hadoopBsp/job_201506041705_0002/_haltComputation=0
15/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper base path
15/06/04 17:07:43 INFO mapred.JobClient:     /_hadoopBsp/job_201506041705_0002=0
15/06/04 17:07:43 INFO mapred.JobClient:   Job Counters
15/06/04 17:07:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=40730
15/06/04 17:07:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/04 17:07:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/06/04 17:07:43 INFO mapred.JobClient:     Launched map tasks=2
15/06/04 17:07:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
15/06/04 17:07:43 INFO mapred.JobClient:   Giraph Timers
15/06/04 17:07:43 INFO mapred.JobClient:     Input superstep (ms)=188
15/06/04 17:07:43 INFO mapred.JobClient:     Total (ms)=9260
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 2 SimpleShortestPathsComputation (ms)=62
15/06/04 17:07:43 INFO mapred.JobClient:     Shutdown (ms)=8795
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 0 SimpleShortestPathsComputation (ms)=75
15/06/04 17:07:43 INFO mapred.JobClient:     Initialize (ms)=3254
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 3 SimpleShortestPathsComputation (ms)=44
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep 1 SimpleShortestPathsComputation (ms)=61
15/06/04 17:07:43 INFO mapred.JobClient:     Setup (ms)=32
15/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper server:port
15/06/04 17:07:43 INFO mapred.JobClient:     prometheusmobile.ironnetwork:22181=0
15/06/04 17:07:43 INFO mapred.JobClient:   Giraph Stats
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate edges=12
15/06/04 17:07:43 INFO mapred.JobClient:     Sent message bytes=0
15/06/04 17:07:43 INFO mapred.JobClient:     Superstep=4
15/06/04 17:07:43 INFO mapred.JobClient:     Last checkpointed superstep=0
15/06/04 17:07:43 INFO mapred.JobClient:     Current workers=1
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate sent messages=12
15/06/04 17:07:43 INFO mapred.JobClient:     Current master task partition=0
15/06/04 17:07:43 INFO mapred.JobClient:     Sent messages=0
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate finished vertices=5
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate sent message message bytes=267
15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate vertices=5
15/06/04 17:07:43 INFO mapred.JobClient:   File Output Format Counters
15/06/04 17:07:43 INFO mapred.JobClient:     Bytes Written=0
15/06/04 17:07:43 INFO mapred.JobClient:   FileSystemCounters
15/06/04 17:07:43 INFO mapred.JobClient:     HDFS_BYTES_READ=200
15/06/04 17:07:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=43376
15/06/04 17:07:43 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=30
15/06/04 17:07:43 INFO mapred.JobClient:   File Input Format Counters
15/06/04 17:07:43 INFO mapred.JobClient:     Bytes Read=0
15/06/04 17:07:43 INFO mapred.JobClient:   Map-Reduce Framework
15/06/04 17:07:43 INFO mapred.JobClient:     Map input records=2
15/06/04 17:07:43 INFO mapred.JobClient:     Spilled Records=0
15/06/04 17:07:43 INFO mapred.JobClient:     Map output records=0
15/06/04 17:07:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=88

Hey fellows! I know that quite a decent amount of time has passed since my last post here. I have, however, a really good excuse for it ;)

No, no, I didn’t quit with Big Data, neither stopped with Embedded Systems (to be sincere, I’ve been working a lot with the second in the last past month). Actually, I spent all this time doing researches and publishing a brand new Big Data project! I used it as my graduation thesis and is all about using MapReduce to process CT images to detect lung nodules.

Yes! How about the doctors use the Big Data strengths to help in the diagnostics of the deadliest Cancer in US?
It’s all about my work :)
I’ll describe it in more details later. If you want to check it, the document (in Portuguese – but with an English abstract) is available here.

The project itself is going to be available in my GitHub soon, and when I get I’ll post the links here too!

See you guys in a bit!

Update:

The gitHub links are here. Soon I will write a post getting into more details about how it works.
MatLung: A matlab script that shows the image processing features of the distributed software.
LungProcessor: A single machine program that is fully-code-compatible with the distributed hadoop application. Used for Hadoop code debugging.
LungPostProcessor: The Image metadata post-processor. Used to detect the lung nodules and remove false positives.
LungPostProcessor: The Hadoop image processing application

Graphs

“How cute! So many arrows and lines!”
– Anyone who never had to traverse a large graph.

Finally, I will be able to quote myself, yey!

Graphs, in computer science, are an abstract data type that implements the graph and hypergraph concepts from mathematics. This data structure has a finite – and possible mutable – set of ordered pairs consisting of edges (or arcs), and nodes (or vertices). An edge can be described mathematically as edge(x,y) which is said to go from x to y. There may be some data structures associated with edges, called edge value. This structure can represent numeric attributes (for example costs) or just a symbolic label. Also there is the important definition of adjacent(G, x, y), that is a function that tests for a given graph data structure G if the edge(x,y) exists.”

As mentioned by (Lotz, 2013)

There are many flavours of graphs, each one of them take into account some detail of the graph structure. In example, an undirected graph is a graph in which the edges do not take into account the orientation, as opposed to the direct graphs in which they take.

Here’s a quick example of a directed graph:

Directed Graph Example

Example of a directed graph (Wolfram Alpha)

Large Scale Graph Processing Problem


“Houston, we have a problem”
– The same person, after having to compute a large Graph.

In the big data analysis, graphs are usually known as hard to compute. (Malewicz & all, 2010)(not my quote this time) points that the main factors that contribute to these difficulties are:

  1. Poor locality of memory access.
  2. Very little work per vertex.
  3. Changing degree of parallelism.
  4. Running over many machines makes the problem worse (i.e. how will I make a machine communicate with another without generating I/O bottleneck?)

Large Scale Graph Pre-Pregel Solutions

Before the introduction of Pregel (Malewicz & all, 2010) – yep, the same guy that I quoted above… that’s because he knows all in this field –  the state-of-art solutions to this problem were:

  • The implementation of a customised framework: that demands lots of engineering effort. It’s like: “Oh, I have a large graph problem that will demand me a Big Data approach. Since I have nothing else to do, I will implement my own framework for the problem. It’s ok if it only takes 10 years and I be really problem specific.”
  • Implementation of a graph analysis algorithm using the MapReduce platform: this proves to be inefficient, since it has to store too much data after each MapReduce job and requires too much communication between the stages (which may cause excessive I/O traffic). Aside from that, it is quite hard to implement graph algorithms, since there is no option to make calculations node/vertex-oriented on the framework.

In order to completely solve the I/O problem, one may use a single computer graph solution – you said a super computer? – This would, however, require not commodity hardware and would not be scalable. The existing distributed graph systems that existed before the Pregel introduction were not fault tolerant, which may cause the loss of huge amount of work in the case of any failure.

and that’s why my next post will be about Pregel, a Large Graph processing framework developed by Google.