Running a MapReduce Job

Running a MapReduce Job (Nov 2015)

Test on:

  • Ubuntu 14.04.3 x64
  • Hadoop 2.7.1 (Pseudo-Distributed Mode)

I will use one of the examples that come with Hadoop package.

1. Preparation
2. Pi
3. WordCount

3.1 Download example input data
3.2 Copy local example data to HDFS
3.3 Run the MapReduce job
3.4 Retrieve the job result from HDFS

1. Preparation

change directory to $HADOOP_INSTALL

MapReduce มีคำสั่งมาให้เราได้ใช้เยอะเลย

ดูซิว่าตัวอย่าง MapReduce ที่ให้มา (hadoop-mapreduce-examples-2.7.1.jar) ทำอะไรได้บ้าง

Ready to go

2. Pi

pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
Usage: org.apache.hadoop.examples.QuasiMonteCarlo <nMaps> <nSamples>

Pi มีค่าเท่ากับ 3.6 ยังไม่ค่อยใกล้เคียงเท่าไหร่ ลองปรับ nMaps และ nSamples ดู

Pi มีค่าเท่ากับ 3.16 ค่อยใกล้เคียงหน่อยงั้นลองปรับอีกนิด

Ok, i got it!

3. WordCount

We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

3.1 Download example input data

We will use three ebooks from Project Gutenberg for this example:

Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg.

3.2 Copy local example data to HDFS

3.3 Run the MapReduce job

Now, we actually run the WordCount example job.

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

3.4 Retrieve the job result from HDFS

Link

Pi, http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php

WordCount, http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/