Marco Aurelio Lotz

Thoughts about Big Data and Embedded Systems

Build a Custom Hadoop/Giraph Project with Maven

June 18, 2015

Marco Aurélio Lotz

Big Data

Context

Hey mates! Did you have dependencies problems on packing a jar file and submitting it to the Hadoop/Giraph Cluster? Well, you can see in the mail lists many people suggesting to use Apache Ant or Apache Maven, but without giving any concrete example of how do actually do it. In the next sections I will give you a quick example of how to actually do it.

Maven startup configuration

Well, let’s assume that you have Maven properly installed already. If not you can download maven here.

To check if maven was properly installed, run:

mvn --version

You should see an output similar to (may differ slightly):

Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 14:51:28+0100)Maven home: D:\apache-maven-3.0.5\bin\..Java version: 1.6.0_25, vendor: Sun Microsystems Inc.Java home: C:\Program Files\Java\jdk1.6.0_25\jreDefault locale: nl_NL, platform encoding: Cp1252OS name: "windows 7", version: "6.1", arch: "amd64", family: "windows"

Creating the Maven Project

Ok, now create the directory where you would like your project to reside. Once you have done that, run the following command into the empty project directory.

mvn archetype:generate -DgroupId=com.marcolotz.giraph -DartifactId=TripletExample -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Please choose a DgroupId and DartifactId that matches your project. The names in those fields above were just examples that I used in my personal project. Basically it belonged to package com.marcolotz.giraph and the project name was TripletExample

Modifying the POM file

Pom stands for Project Object Model and is the core of a project configuration in maven. You may have already realized that Giraph contains several pom files. When you created a maven project on the command that I mentioned above, a pom.xml file was also created. We now need to modify this file in order to build giraph code. Thus, modify the pom.xml content to the following:

<modelversion>4.0.0</modelversion>
  <groupid>com.marcolotz.giraph</groupid>
  <artifactid>tripletExample</artifactid>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>tripletExample</name>
  <url>http://maven.apache.org</url>
  <dependencies>
      <dependency>
          <groupid>junit</groupid>
          <artifactid>junit</artifactid>
          <version>3.8.1</version>
          <scope>test</scope>
      </dependency>

      <dependency>
          <groupid>org.apache.giraph</groupid>
          <artifactid>giraph-core</artifactid>
          <version>1.1.0</version>
      </dependency>

      <dependency>
          <groupid>org.apache.hadoop</groupid>
          <artifactid>hadoop-core</artifactid>
          <version>0.20.203.0</version>
      </dependency>

  </dependencies>
  <build>
      <plugins>
          <plugin>
              <groupid>org.apache.maven.plugins</groupid>
              <artifactid>maven-compiler-plugin</artifactid>
              <version>3.1</version>
              <configuration>
                  <source>1.7
                  <target>1.7</target>
              </configuration>
          </plugin>
          <plugin>
              <artifactid>maven-assembly-plugin</artifactid>
              <configuration>
                  <archive>
                  </archive>
                  <descriptorrefs>
                      <descriptorref>jar-with-dependencies</descriptorref>
                  </descriptorrefs>
              </configuration>
          </plugin>

      </plugins>
  </build>

Should you actually use it to build a Hadoop project, just make sure that you are using the correct Hadoop version in the pom file to build and remove the Giraph dependency. Also please note that the “@Algorithm” notation used in some Giraph examples is not defined in the Giraph-Core, but actually in the Giraph-Examples. This will cause some build problems should your source code contain it.

Inserting Custom Code and Building

Now you just need to insert you java files in the package folder (for the example above in tripletExample/src/main/java/com/marcolotz/giraph) and build the solution. In order to build it, run the following command line in the folder where the pom.xml file is located:

mvn clean compile assembly:single

This command will clean the target folder (if not already empty) and prepare a single jar file with all the project dependencies inside. The final product of the command will be in the target folder. Since it contain all dependencies, the size of this jar file may be quite large. Finally you can submit the jar file to your hadoop cluster ;)

Please note that there are other solutions to reduce the size of the jar file, which are giving a class path argument to hadoop/giraph instead of packing everything in inside of jar. This is also an elegant solution for when the jar file would be too large to be easily distributed in a cluster.

Giraph configuration details

June 4, 2015

Marco Aurélio Lotz

Big Data

Heeeey!

Today I had to deploy a single-node Apache Giraph installation in my machine. Giraph has a really good quick start guide. A few corrections need to be made in order to keep it up-to-date with the latest version available (1.1.0), but hopefully I will be able to do it this week.

There are, however, a few glitches that can happen during the installation process. Here I will present quick solutions to it.

OpenJDK management.properties problem

I must confess that I rather use the oracle java than the openjdk. As most of you may know, in order to run hadoop you need at least java 1.6 installed. In my current configuration, I was trying to run hadoop with Openjdk java 7, on Ubuntu 14.04.2

With all my $JAVA_HOME variables correctly set, i was getting the following error:

hduser@prometheus-UX32A:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/management$ hadoop namenode -formatError: Config file not found: /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/management/management.properties

This code also appeared for other hadoop scripts like start-dfs.sh, start-mapred.sh – and of course start-all.sh.

Taking a quick look at the installation, I realized that management.properties was just a broken symbolic link. Also, looks like it is directly related with Java RMI capabilities (Remove Method Invocation). This is probably how Hadoop is going to start all the daemons across the nodes. I must confess that I installed all the opendjk packages available in the Ubuntu repository and none solve my problem.

This can be solved with the following workaround:

Download the equivalent Oracle Java version. Keep in mind to download the same version of your openjdk (i.e. download java 6, 7 or 8 depending on your jdk).
Open the Oracle Java files and copy all the content of the folder /jre/lib/management/management.properties into the OpenJDK equivalent folder.
Run again

Hopefully, this will solve that problem.

Giraph Mapper Allocation problem

Let’s suppose you installed Giraph correctly and Hadoop MapReduce examples are running as you expect in your machine.

Giraph gives, however, this very interesting line when you submit a job:

15/06/04 16:58:03 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers

If you want to make sure that we are talking about the same thing, here is a complete error output:

hduser@prometheus-UX32A:~$ hadoop jar /opt/apache/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 115/06/04 16:57:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.15/06/04 16:57:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.15/06/04 16:57:59 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)15/06/04 16:58:03 INFO job.GiraphJob: Tracking URL: http://hdhost:50030/jobdetails.jsp?jobid=job_201506041652_000315/06/04 16:58:03 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers15/06/04 16:58:31 INFO mapred.JobClient: Running job: job_201506041652_000315/06/04 16:58:31 INFO mapred.JobClient: Job complete: job_201506041652_000315/06/04 16:58:31 INFO mapred.JobClient: Counters: 515/06/04 16:58:31 INFO mapred.JobClient:   Job Counters15/06/04 16:58:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3198715/06/04 16:58:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=015/06/04 16:58:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=015/06/04 16:58:31 INFO mapred.JobClient:     Launched map tasks=215/06/04 16:58:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

Good news: It is not your fault (unless you think that giving bad hostnames to your machine is your fault). According to this JIRA there is a problem parsing hostnames. Basically, it does not recognize upper and lower letters in the hostname. This can be solved by setting a new hostname with only lower letters. To do this perform as sudo:

 sudo hostname NEW_HOST_NAME

also, do not forget to change your /etc/hosts table to your new hostname. Keep in mind that this may affect other softwares that rely on your hostname to run. Once you change it, restart the machine and the hadoop daemons. Hopefully you will get the correct output, that in my case is:

hduser@prometheusmobile:/opt/apache/giraph/giraph-examples$ hadoop jar target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-1.2.1-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 115/06/04 17:06:58 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.15/06/04 17:06:58 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.15/06/04 17:06:58 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)15/06/04 17:07:02 INFO job.GiraphJob: Tracking URL: http://hdhost:50030/jobdetails.jsp?jobid=job_201506041705_000215/06/04 17:07:02 INFO job.GiraphJob: Waiting for resources... Job will start only when it gets all 2 mappers15/06/04 17:07:31 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer prometheusmobile.ironnetwork:22181 --zkNode /_hadoopBsp/job_201506041705_0002/_haltComputation'15/06/04 17:07:31 INFO mapred.JobClient: Running job: job_201506041705_000215/06/04 17:07:32 INFO mapred.JobClient:  map 100% reduce 0%15/06/04 17:07:43 INFO mapred.JobClient: Job complete: job_201506041705_000215/06/04 17:07:43 INFO mapred.JobClient: Counters: 3715/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper halt node15/06/04 17:07:43 INFO mapred.JobClient:     /_hadoopBsp/job_201506041705_0002/_haltComputation=015/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper base path15/06/04 17:07:43 INFO mapred.JobClient:     /_hadoopBsp/job_201506041705_0002=015/06/04 17:07:43 INFO mapred.JobClient:   Job Counters15/06/04 17:07:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4073015/06/04 17:07:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=015/06/04 17:07:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=015/06/04 17:07:43 INFO mapred.JobClient:     Launched map tasks=215/06/04 17:07:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=015/06/04 17:07:43 INFO mapred.JobClient:   Giraph Timers15/06/04 17:07:43 INFO mapred.JobClient:     Input superstep (ms)=18815/06/04 17:07:43 INFO mapred.JobClient:     Total (ms)=926015/06/04 17:07:43 INFO mapred.JobClient:     Superstep 2 SimpleShortestPathsComputation (ms)=6215/06/04 17:07:43 INFO mapred.JobClient:     Shutdown (ms)=879515/06/04 17:07:43 INFO mapred.JobClient:     Superstep 0 SimpleShortestPathsComputation (ms)=7515/06/04 17:07:43 INFO mapred.JobClient:     Initialize (ms)=325415/06/04 17:07:43 INFO mapred.JobClient:     Superstep 3 SimpleShortestPathsComputation (ms)=4415/06/04 17:07:43 INFO mapred.JobClient:     Superstep 1 SimpleShortestPathsComputation (ms)=6115/06/04 17:07:43 INFO mapred.JobClient:     Setup (ms)=3215/06/04 17:07:43 INFO mapred.JobClient:   Zookeeper server:port15/06/04 17:07:43 INFO mapred.JobClient:     prometheusmobile.ironnetwork:22181=015/06/04 17:07:43 INFO mapred.JobClient:   Giraph Stats15/06/04 17:07:43 INFO mapred.JobClient:     Aggregate edges=1215/06/04 17:07:43 INFO mapred.JobClient:     Sent message bytes=015/06/04 17:07:43 INFO mapred.JobClient:     Superstep=415/06/04 17:07:43 INFO mapred.JobClient:     Last checkpointed superstep=015/06/04 17:07:43 INFO mapred.JobClient:     Current workers=115/06/04 17:07:43 INFO mapred.JobClient:     Aggregate sent messages=1215/06/04 17:07:43 INFO mapred.JobClient:     Current master task partition=015/06/04 17:07:43 INFO mapred.JobClient:     Sent messages=015/06/04 17:07:43 INFO mapred.JobClient:     Aggregate finished vertices=515/06/04 17:07:43 INFO mapred.JobClient:     Aggregate sent message message bytes=26715/06/04 17:07:43 INFO mapred.JobClient:     Aggregate vertices=515/06/04 17:07:43 INFO mapred.JobClient:   File Output Format Counters15/06/04 17:07:43 INFO mapred.JobClient:     Bytes Written=015/06/04 17:07:43 INFO mapred.JobClient:   FileSystemCounters15/06/04 17:07:43 INFO mapred.JobClient:     HDFS_BYTES_READ=20015/06/04 17:07:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4337615/06/04 17:07:43 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=3015/06/04 17:07:43 INFO mapred.JobClient:   File Input Format Counters15/06/04 17:07:43 INFO mapred.JobClient:     Bytes Read=015/06/04 17:07:43 INFO mapred.JobClient:   Map-Reduce Framework15/06/04 17:07:43 INFO mapred.JobClient:     Map input records=215/06/04 17:07:43 INFO mapred.JobClient:     Spilled Records=015/06/04 17:07:43 INFO mapred.JobClient:     Map output records=015/06/04 17:07:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=88

Electronics for Kids

January 19, 2015

Marco Aurélio Lotz

Embedded Systems

Hey mates!

Originally I had planned to release this as a Christmas Gift for all the readers and followers, but I ran short on time.

A mate of mine had the amazing idea to write an introductory book of Electronics for kids, just to be able to explain to his soon, in a really simple way, the principles behind a few well-known components and circuits. When he showed me his work, I’ve found it so brilliant that I asked his permission to translate and modify the material to other languages.

This material dedicated to all the Electronic Engineers/fans/adepts that always wanted to explain in a simple way to their beloved kids a bit more about their work.

I hope you enjoy.

Update:

I would like to say a special thanks to Maína Worm Silva for fixing some typos in the portuguese version of the document. Should you also find a typo in any document language, please let me know ;)

Electronics for Kids – German Version
Electronics for Kids – English Version
Electronics for Kids – Portuguese Version