Contents
Running a MapReduce Job (Nov 2015)
Test on:
- Ubuntu 14.04.3 x64
- Hadoop 2.7.1 (Pseudo-Distributed Mode)
I will use one of the examples that come with Hadoop package.
1. Preparation
2. Pi
3. WordCount
3.1 Download example input data
3.2 Copy local example data to HDFS
3.3 Run the MapReduce job
3.4 Retrieve the job result from HDFS
1. Preparation
change directory to $HADOOP_INSTALL
1 2 |
$ su hduser $ cd /usr/local/hadoop |
MapReduce มีคำสั่งมาให้เราได้ใช้เยอะเลย
1 2 3 4 5 6 7 8 9 10 |
$ ls share/hadoop/mapreduce/ -gh -rw-r--r-- 1 hadoop 501K มิ.ย. 29 13:15 hadoop-mapreduce-client-app-2.7.1.jar -rw-r--r-- 1 hadoop 734K มิ.ย. 29 13:15 hadoop-mapreduce-client-common-2.7.1.jar -rw-r--r-- 1 hadoop 1.5M มิ.ย. 29 13:15 hadoop-mapreduce-client-core-2.7.1.jar -rw-r--r-- 1 hadoop 160K มิ.ย. 29 13:15 hadoop-mapreduce-client-hs-2.7.1.jar -rw-r--r-- 1 hadoop 4.1K มิ.ย. 29 13:15 hadoop-mapreduce-client-hs-plugins-2.7.1.jar -rw-r--r-- 1 hadoop 37K มิ.ย. 29 13:15 hadoop-mapreduce-client-jobclient-2.7.1.jar -rw-r--r-- 1 hadoop 1.5M มิ.ย. 29 13:15 hadoop-mapreduce-client-jobclient-2.7.1-tests.jar -rw-r--r-- 1 hadoop 44K มิ.ย. 29 13:15 hadoop-mapreduce-client-shuffle-2.7.1.jar -rw-r--r-- 1 hadoop 268K มิ.ย. 29 13:15 hadoop-mapreduce-examples-2.7.1.jar |
ดูซิว่าตัวอย่าง MapReduce ที่ให้มา (hadoop-mapreduce-examples-2.7.1.jar
) ทำอะไรได้บ้าง
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. |
Ready to go
2. Pi
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
Usage: org.apache.hadoop.examples.QuasiMonteCarlo <nMaps> <nSamples>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5 Number of Maps = 2 Samples per Map = 5 15/11/21 22:17:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Wrote input for Map #0 Wrote input for Map #1 Starting Job 15/11/21 22:17:15 INFO Configuration.deprecation: <a href="http://session.id">session.id</a> is deprecated. Instead, use dfs.metrics.session-id 15/11/21 22:17:15 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 15/11/21 22:17:15 INFO input.FileInputFormat: Total input paths to process : 2 15/11/21 22:17:15 INFO mapreduce.JobSubmitter: number of splits:2 15/11/21 22:17:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local374679984_0001 15/11/21 22:17:17 INFO mapreduce.Job: The url to track the job: <a href="http://localhost:8080/">http://localhost:8080/</a> 15/11/21 22:17:17 INFO mapreduce.Job: Running job: job_local374679984_0001 ... File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 3.885 seconds Estimated value of Pi is 3.60000000000000000000 |
Pi มีค่าเท่ากับ 3.6 ยังไม่ค่อยใกล้เคียงเท่าไหร่ ลองปรับ nMaps
และ nSamples
ดู
1 2 3 4 |
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 10 50 ... Job Finished in 7.021 seconds Estimated value of Pi is 3.16000000000000000000 |
Pi มีค่าเท่ากับ 3.16 ค่อยใกล้เคียงหน่อยงั้นลองปรับอีกนิด
1 2 3 4 |
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 20 50 ... Job Finished in 8.837 seconds Estimated value of Pi is 3.14800000000000000000 |
Ok, i got it!
3. WordCount
We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
3.1 Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson,
http://www.gutenberg.org/etext/20417 - The Notebooks of Leonardo Da Vinci,
http://www.gutenberg.org/etext/5000 - Ulysses by James Joyce,
http://www.gutenberg.org/etext/4300
Download each ebook as text files in Plain Text UTF-8
encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg
.
1 2 3 4 5 |
$ ls -l /tmp/gutenberg/ total 3516 -rw-rw-r-- 1 hduser hadoop 661808 Nov 21 22:51 pg20417.txt -rw-rw-r-- 1 hduser hadoop 1540094 Nov 21 22:51 pg4300.txt -rw-rw-r-- 1 hduser hadoop 1396147 Nov 21 22:51 pg5000.txt |
3.2 Copy local example data to HDFS
1 2 |
$ hadoop fs -put /tmp/gutenberg/ . 15/11/21 22:57:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable |
1 2 3 4 |
$ hadoop fs -ls 15/11/21 22:58:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items drwxr-xr-x - hduser supergroup 0 2015-11-21 22:57 gutenberg |
1 2 3 4 5 6 |
$ hadoop fs -ls gutenberg 15/11/21 22:58:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items -rw-r--r-- 1 hduser supergroup 661808 2015-11-21 22:57 gutenberg/pg20417.txt -rw-r--r-- 1 hduser supergroup 1540094 2015-11-21 22:57 gutenberg/pg4300.txt -rw-r--r-- 1 hduser supergroup 1396147 2015-11-21 22:57 gutenberg/pg5000.txt |
3.3 Run the MapReduce job
Now, we actually run the WordCount example job.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount gutenberg gutenberg-output 15/11/21 23:01:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/11/21 23:01:15 INFO Configuration.deprecation: <a href="http://session.id">session.id</a> is deprecated. Instead, use dfs.metrics.session-id 15/11/21 23:01:15 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 15/11/21 23:01:15 INFO input.FileInputFormat: Total input paths to process : 3 15/11/21 23:01:15 INFO mapreduce.JobSubmitter: number of splits:3 15/11/21 23:01:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local347951791_0001 15/11/21 23:01:17 INFO mapreduce.Job: The url to track the job: <a href="http://localhost:8080/">http://localhost:8080/</a> 15/11/21 23:01:17 INFO mapreduce.Job: Running job: job_local347951791_0001 15/11/21 23:01:17 INFO mapred.LocalJobRunner: OutputCommitter set in config null 15/11/21 23:01:17 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/11/21 23:01:17 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 15/11/21 23:01:17 INFO mapred.LocalJobRunner: Waiting for map tasks 15/11/21 23:01:17 INFO mapred.LocalJobRunner: Starting task: attempt_local347951791_0001_m_000000_0 15/11/21 23:01:17 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/11/21 23:01:17 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 15/11/21 23:01:17 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/hduser/gutenberg/pg4300.txt:0+1540094 15/11/21 23:01:17 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 15/11/21 23:01:17 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 15/11/21 23:01:17 INFO mapred.MapTask: soft limit at 83886080 15/11/21 23:01:17 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 15/11/21 23:01:17 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 15/11/21 23:01:17 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer ... File Input Format Counters Bytes Read=3598049 File Output Format Counters Bytes Written=883466 |
Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:
1 2 3 4 5 |
$ hadoop fs -ls 15/11/21 23:02:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items drwxr-xr-x - hduser supergroup 0 2015-11-21 22:57 gutenberg drwxr-xr-x - hduser supergroup 0 2015-11-21 23:01 gutenberg-output |
1 2 3 4 5 |
$ hadoop fs -ls gutenberg-output 15/11/21 23:02:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rw-r--r-- 1 hduser supergroup 0 2015-11-21 23:01 gutenberg-output/_SUCCESS -rw-r--r-- 1 hduser supergroup 883466 2015-11-21 23:01 gutenberg-output/part-r-00000 |
1 |
$ hadoop fs -cat gutenberg-output/part-r-00000 |
3.4 Retrieve the job result from HDFS
1 2 3 4 |
$ hadoop fs -get gutenberg-output /tmp/ 15/11/21 23:07:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/11/21 23:07:09 WARN hdfs.DFSClient: DFSInputStream has been closed already 15/11/21 23:07:09 WARN hdfs.DFSClient: DFSInputStream has been closed already |
1 2 3 4 5 |
$ ls /tmp/gu* /tmp/gutenberg: pg20417.txt pg4300.txt pg5000.txt /tmp/gutenberg-output: part-r-00000 _SUCCESS |
1 2 3 4 5 6 7 8 9 10 11 |
$ head /tmp/gutenberg-output/part-r-00000 "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 1 "A_ 1 "Absoluti 1 "Aesopi" 1 |
Link
Pi, http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php
WordCount, http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/