What is the Streaming Interface?
Let’s suppose you are not friends with javenese. Let’s suppose that you don’t like Java at all. Whether it is due to its slower Virtual Machine performance – although there are java examples that run quicker than optimized C in some special scenarios) or just because it is too high level for you, it is fine. Hadoop understands you and for this reason implements the called Streaming Interface.
The Streaming Interface literally allows any application to use Hadoop for MapReduce. Thus, you can write your mappers in python, jython, or just give the binaries for it (even shell binaries!).
The syntax for using this kind of behavior is the following:
$hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input [input dir] -output [output dir] -mapper [map application] -reducer [reducer application]
One needs to keep in mind that supplying a mapper is mandatory, since the default mapper of Hadoop won’t work. This is due to the fact that the default mapper output is of the type TextInputFormat.
TextInputFormat generates LongWritable keys and Text values. By convention, the Streaming output keys and values are both of type Text. Since there is no conversion from LongWritable to Text, the job will tragically fail.
Yeah Yeah. But isn’t the LongWritable part of the input key really important for TextInputFormat? Not really. In the TextInputFormat, the LongWritable key is just the offset in the line where the given Text value (word) is. Probably you won’t use it. (and if you are going to use it, maybe it’s better to code in java and create your own InputFormat).
Now, here’s a catchy part: Although the Hadoop keeps the default TextInputFormat for reading the file, it will only pass the Value to the mapper (and thus the Text part). The Key will be ignored.
(Hm, but i want to use another input format, how can i make it ignore the key? Just set the stream.map.input.ignoreKey to true).
Ok, ok. What are all the parameters for the streaming process?
% hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input [your input file] \
-output [your output file] \
-inputformat org.apache.hadoop.mapred.TextInputFormat [or another that you wish] \
-mapper [your defined mapper] \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner [or another that you wish] \
-numReduceTasks [How many Reduce Tasks do you want] \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer [or the Reducer application that you want] \
-outputformat org.apache.hadoop.mapred.TextOutputFormat [or another one that you fancy]
-combiner [the combiner that you want - optional]
What about the key-value separator?
By default the separator is a tab character ( ‘\t’ ) . But worry not, one can change it. You can do it by setting the following attributes:
Also, the key from the mapper/reducer may be composed of more than one field.
Let’s suppose the following case and assume that the field separator is ‘;’
one should be able to select cat;dog as a key, and define 1 has a value.
this can be done by modifying the following attributes:
The value will always be the remaining fields.
Nothing is true in Hadoop if cannot be proved by a simple word counter or just cat example. So, how about if we use the classic Unix binaries in order to make cat example?
I’ll let you define your own input. After that, one just needs to do the following:
$hadoop jar hadoop-streaming.jar -input [your text input] -output [your output] -mapper /bin/cat
It should print the exact content of each of the file, splitting it into several files at the end (due to the partitions scheme, which I will explain in a future post).
If you want to know how many newlines, words and letters are in each partition, use the wc unix tool as a reducer. Just do the following command:
$hadoop jar hadoop-streaming.jar -input [your text input] -output [your output] -mapper /bin/cat -reducer /usr/bin/cat
And that’s it. Now you can use your own application for MapReduce, without having deep troubles into the Java aspect of it. Best of luck ;)