必备软件
JDK
下载Oracle JDK1.8 jdk-8u251-linux-x64.tar.gz
配置JAVA_HOME、CLASSPATH、PATH等参数。
Hadoop 安装
下载Hadoop
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
解压到 /opt/middleware
下
字符搜索案例
cd $HADOOP_HOME
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
cat output/* #查看结果
output为结果输出目录。必须是不存在的,如果存在会报目录已存在的错误信息。<br />
[root@bogon hadoop-3.2.1]# ll output
总用量 4
-rw-r--r--. 1 root root 11 6月 18 00:23 part-r-00000
-rw-r--r--. 1 root root 0 6月 18 00:23 _SUCCESS
_SUCCESS
仅仅是一个标识,表示任务成功执行。<br />
WordCount案例
[root@bogon hadoop-3.2.1]# mkdir -p study/wcinput
[root@bogon hadoop-3.2.1]# vim study/wcinput/wc.input
输入如下内容(可以是任意单词)
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
[root@bogon hadoop-3.2.1]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount study/wcinput/ study/wcoutput
2020-06-18 02:52:27,056 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-06-18 02:52:27,189 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-06-18 02:52:27,189 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-06-18 02:52:27,473 INFO input.FileInputFormat: Total input files to process : 1
2020-06-18 02:52:27,503 INFO mapreduce.JobSubmitter: number of splits:1
2020-06-18 02:52:27,621 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local329843332_0001
2020-06-18 02:52:27,621 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-06-18 02:52:27,720 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2020-06-18 02:52:27,721 INFO mapreduce.Job: Running job: job_local329843332_0001
2020-06-18 02:52:27,725 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2020-06-18 02:52:27,731 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2020-06-18 02:52:27,731 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2020-06-18 02:52:27,732 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2020-06-18 02:52:27,758 INFO mapred.LocalJobRunner: Waiting for map tasks
2020-06-18 02:52:27,759 INFO mapred.LocalJobRunner: Starting task: attempt_local329843332_0001_m_000000_0
2020-06-18 02:52:27,782 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2020-06-18 02:52:27,782 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2020-06-18 02:52:27,796 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2020-06-18 02:52:27,800 INFO mapred.MapTask: Processing split: file:/opt/middleware/hadoop-3.2.1/study/wcinput/wc.input:0+661
2020-06-18 02:52:27,826 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2020-06-18 02:52:27,826 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2020-06-18 02:52:27,826 INFO mapred.MapTask: soft limit at 83886080
2020-06-18 02:52:27,826 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2020-06-18 02:52:27,826 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2020-06-18 02:52:27,830 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2020-06-18 02:52:27,838 INFO mapred.LocalJobRunner:
2020-06-18 02:52:27,838 INFO mapred.MapTask: Starting flush of map output
2020-06-18 02:52:27,838 INFO mapred.MapTask: Spilling map output
2020-06-18 02:52:27,838 INFO mapred.MapTask: bufstart = 0; bufend = 1057; bufvoid = 104857600
2020-06-18 02:52:27,838 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214004(104856016); length = 393/6553600
2020-06-18 02:52:27,853 INFO mapred.MapTask: Finished spill 0
2020-06-18 02:52:27,871 INFO mapred.Task: Task:attempt_local329843332_0001_m_000000_0 is done. And is in the process of committing
2020-06-18 02:52:27,874 INFO mapred.LocalJobRunner: map
2020-06-18 02:52:27,874 INFO mapred.Task: Task 'attempt_local329843332_0001_m_000000_0' done.
2020-06-18 02:52:27,880 INFO mapred.Task: Final Counters for attempt_local329843332_0001_m_000000_0: Counters: 18
File System Counters
FILE: Number of bytes read=317372
FILE: Number of bytes written=837373
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2
Map output records=99
Map output bytes=1057
Map output materialized bytes=1014
Input split bytes=121
Combine input records=99
Combine output records=75
Spilled Records=75
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
Total committed heap usage (bytes)=212860928
File Input Format Counters
Bytes Read=661
2020-06-18 02:52:27,880 INFO mapred.LocalJobRunner: Finishing task: attempt_local329843332_0001_m_000000_0
2020-06-18 02:52:27,881 INFO mapred.LocalJobRunner: map task executor complete.
2020-06-18 02:52:27,886 INFO mapred.LocalJobRunner: Waiting for reduce tasks
2020-06-18 02:52:27,887 INFO mapred.LocalJobRunner: Starting task: attempt_local329843332_0001_r_000000_0
2020-06-18 02:52:27,896 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2020-06-18 02:52:27,896 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2020-06-18 02:52:27,897 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2020-06-18 02:52:27,899 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2d8d2e2
2020-06-18 02:52:27,900 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-06-18 02:52:27,922 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=617296704, maxSingleShuffleLimit=154324176, mergeThreshold=407415840, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2020-06-18 02:52:27,926 INFO reduce.EventFetcher: attempt_local329843332_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2020-06-18 02:52:27,949 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local329843332_0001_m_000000_0 decomp: 1010 len: 1014 to MEMORY
2020-06-18 02:52:27,953 INFO reduce.InMemoryMapOutput: Read 1010 bytes from map-output for attempt_local329843332_0001_m_000000_0
2020-06-18 02:52:27,955 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1010, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->1010
2020-06-18 02:52:27,957 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2020-06-18 02:52:27,959 INFO mapred.LocalJobRunner: 1 / 1 copied.
2020-06-18 02:52:27,959 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2020-06-18 02:52:27,965 INFO mapred.Merger: Merging 1 sorted segments
2020-06-18 02:52:27,965 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1001 bytes
2020-06-18 02:52:27,967 INFO reduce.MergeManagerImpl: Merged 1 segments, 1010 bytes to disk to satisfy reduce memory limit
2020-06-18 02:52:27,967 INFO reduce.MergeManagerImpl: Merging 1 files, 1014 bytes from disk
2020-06-18 02:52:27,967 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2020-06-18 02:52:27,968 INFO mapred.Merger: Merging 1 sorted segments
2020-06-18 02:52:27,977 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1001 bytes
2020-06-18 02:52:27,977 INFO mapred.LocalJobRunner: 1 / 1 copied.
2020-06-18 02:52:27,980 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2020-06-18 02:52:27,985 INFO mapred.Task: Task:attempt_local329843332_0001_r_000000_0 is done. And is in the process of committing
2020-06-18 02:52:27,985 INFO mapred.LocalJobRunner: 1 / 1 copied.
2020-06-18 02:52:27,986 INFO mapred.Task: Task attempt_local329843332_0001_r_000000_0 is allowed to commit now
2020-06-18 02:52:27,987 INFO output.FileOutputCommitter: Saved output of task 'attempt_local329843332_0001_r_000000_0' to file:/opt/middleware/hadoop-3.2.1/study/wcoutput
2020-06-18 02:52:27,988 INFO mapred.LocalJobRunner: reduce > reduce
2020-06-18 02:52:27,988 INFO mapred.Task: Task 'attempt_local329843332_0001_r_000000_0' done.
2020-06-18 02:52:27,989 INFO mapred.Task: Final Counters for attempt_local329843332_0001_r_000000_0: Counters: 24
File System Counters
FILE: Number of bytes read=319432
FILE: Number of bytes written=839111
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Combine input records=0
Combine output records=0
Reduce input groups=75
Reduce shuffle bytes=1014
Reduce input records=75
Reduce output records=75
Spilled Records=75
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=212860928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Output Format Counters
Bytes Written=724
2020-06-18 02:52:27,990 INFO mapred.LocalJobRunner: Finishing task: attempt_local329843332_0001_r_000000_0
2020-06-18 02:52:27,990 INFO mapred.LocalJobRunner: reduce task executor complete.
2020-06-18 02:52:28,725 INFO mapreduce.Job: Job job_local329843332_0001 running in uber mode : false
2020-06-18 02:52:28,728 INFO mapreduce.Job: map 100% reduce 100%
2020-06-18 02:52:28,729 INFO mapreduce.Job: Job job_local329843332_0001 completed successfully
2020-06-18 02:52:28,735 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=636804
FILE: Number of bytes written=1676484
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2
Map output records=99
Map output bytes=1057
Map output materialized bytes=1014
Input split bytes=121
Combine input records=99
Combine output records=75
Reduce input groups=75
Reduce shuffle bytes=1014
Reduce input records=75
Reduce output records=75
Spilled Records=150
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=7
Total committed heap usage (bytes)=425721856
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=661
File Output Format Counters
Bytes Written=724
[root@bogon hadoop-3.2.1]# cat study/wcoutput/*
Apache 1
Apache™ 1
Hadoop 1
Hadoop® 1
It 1
Rather 1
The 2
a 3
across 1
allows 1
and 2
application 1
at 1
be 1
cluster 1
clusters 1
computation 1
computers 1
computers, 1
computing. 1
data 1
deliver 1
delivering 1
designed 2
detect 1
develops 1
distributed 2
each 2
failures 1
failures. 1
for 2
framework 1
from 1
handle 1
hardware 1
high-availability, 1
highly-available 1
is 3
itself 1
large 1
layer, 1
library 2
local 1
machines, 1
may 1
models. 1
of 6
offering 1
on 2
open-source 1
processing 1
programming 1
project 1
prone 1
reliable, 1
rely 1
scalable, 1
scale 1
servers 1
service 1
sets 1
simple 1
single 1
so 1
software 2
storage. 1
than 1
that 1
the 3
thousands 1
to 5
top 1
up 1
using 1
which 1