Hadoop之本地运行WordCount

本文主要记录在windows搭建Hadoop开发环境并编写一个WordCount的mapreduce在本地环境执行。

主要内容：

1.搭建本地环境
2.编写WordCount并在本地运行

1.搭建本地环境

1.1.解压

去官网下载指定的hadoop版本

hadoop-2.7.3.tar.gz

将下载好的hadoop压缩包解压到任意目录
拷贝winutils.exe 到 hadoop-2.7.3/bin 目录下

1.2 配置环境变量

新建环境变量执行hadoop解压路径

HADOOP_HOME：D:\soft\dev\hadoop-2.7.3

在Path后新增

%HADOOP_HOME%\bin;

2.编写WordCount

输入文件格式如下：

hello java
hello hadoop

输出如下：

hello 2
hadoop 1
java 1

项目目录如下：

image.png

2.1.引入Maven依赖

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.3</version>
        </dependency>
    </dependencies>

2.2.加入log4j.properties配置文件

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
log4j.rootLogger=INFO, console

2.3.编写Mapper

读取输入文本中的每一行，并切分单词，记录单词的数量并输出，输出类型为Text,IntWritable 例如：java，1

public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        System.out.println("--->Map-->" + Thread.currentThread().getName());
        String[] words = StringUtils.split(value.toString(), ' ');
        for (String w : words) {
            context.write(new Text(w), new IntWritable(1));
        }
    }
}

2.4.编写Reducer

接收Mapper的输出结果进行累加并输出结果，接收类型为Mapper的输出类型Text,Iterable<IntWritable> 例如：java （1，1），输出类型为 Text,intWritable 例如：java 2

public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        System.out.println("--->Reducer-->" + Thread.currentThread().getName());
        int sum = 0;
        for (IntWritable i : values) {
            sum = sum + i.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

2.5.编写Job

将Mapper和Reducer组装起来封装成功一个Job，作为一个执行单元。计算WordCount就是一个Job。

public class RunWcJob {
    public static void main(String[] args) throws Exception {
        // 创建本次mr程序的job实例
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 指定本次job运行的主类
        job.setJarByClass(RunWcJob.class);

        // 指定本次job的具体mapper reducer实现类
        job.setMapperClass(WcMapper.class);
        job.setReducerClass(WcReducer.class);

        // 指定本次job map阶段的输出数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 指定本次job reduce阶段的输出数据类型 也就是整个mr任务的最终输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 指定本次job待处理数据的目录 和程序执行完输出结果存放的目录
        FileInputFormat.setInputPaths(job, "D:\\hadoop\\input");
        FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\output"));

        // 提交本次job
        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0 : 1);
    }
}

在本地文件夹D:\hadoop\input下新建 words.txt，内容为上面给出的输入内容作为输入
同样输出文件夹为output，那么直接运行程序:

可能出现的错误：

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
原因：
没有拷贝winutils拷贝到hadoop-2.7.3/bin目录下或者没有配置HADOOP_HOME环境变量或者配置HADOOP_HOME环境变量没生效
解决：
1.下载winutils拷贝到hadoop-2.7.3/bin目录下
2.检查环境变量是否配置
3.如果已经配置好环境变量，重启idea或这电脑，有可能是环境变量没生效
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
原因：
不太清楚
解决：
拷贝org.apache.hadoop.io.nativeio.NativeIO源码，重写access方法的返回值

image.png

2.6运行结果

允许如果出现一下信息就表示已经正确执行了。

14:40:01,813  WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14:40:02,058  INFO deprecation:1173 - session.id is deprecated. Instead, use dfs.metrics.session-id
14:40:02,060  INFO JvmMetrics:76 - Initializing JVM Metrics with processName=JobTracker, sessionId=
14:40:02,355  WARN JobResourceUploader:64 - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14:40:02,387  WARN JobResourceUploader:171 - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14:40:02,422  INFO FileInputFormat:283 - Total input paths to process : 1
14:40:02,685  INFO JobSubmitter:198 - number of splits:1
14:40:02,837  INFO JobSubmitter:287 - Submitting tokens for job: job_local866013445_0001
14:40:03,035  INFO Job:1294 - The url to track the job: http://localhost:8080/
14:40:03,042  INFO Job:1339 - Running job: job_local866013445_0001
14:40:03,044  INFO LocalJobRunner:471 - OutputCommitter set in config null
14:40:03,110  INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,115  INFO LocalJobRunner:489 - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14:40:03,211  INFO LocalJobRunner:448 - Waiting for map tasks
14:40:03,211  INFO LocalJobRunner:224 - Starting task: attempt_local866013445_0001_m_000000_0
14:40:03,238  INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,383  INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,439  INFO Task:612 -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4d11cc8c
14:40:03,445  INFO MapTask:756 - Processing split: file:/D:/hadoop/input/words.txt:0+24
14:40:03,509  INFO MapTask:1205 - (EQUATOR) 0 kvi 26214396(104857584)
14:40:03,509  INFO MapTask:998 - mapreduce.task.io.sort.mb: 100
14:40:03,509  INFO MapTask:999 - soft limit at 83886080
14:40:03,509  INFO MapTask:1000 - bufstart = 0; bufvoid = 104857600
14:40:03,510  INFO MapTask:1001 - kvstart = 26214396; length = 6553600
14:40:03,515  INFO MapTask:403 - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
--->Map-->LocalJobRunner Map Task Executor #0
--->Map-->LocalJobRunner Map Task Executor #0
14:40:03,522  INFO LocalJobRunner:591 - 
14:40:03,522  INFO MapTask:1460 - Starting flush of map output
14:40:03,522  INFO MapTask:1482 - Spilling map output
14:40:03,522  INFO MapTask:1483 - bufstart = 0; bufend = 40; bufvoid = 104857600
14:40:03,522  INFO MapTask:1485 - kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
14:40:03,573  INFO MapTask:1667 - Finished spill 0
14:40:03,583  INFO Task:1038 - Task:attempt_local866013445_0001_m_000000_0 is done. And is in the process of committing
14:40:03,589  INFO LocalJobRunner:591 - map
14:40:03,589  INFO Task:1158 - Task 'attempt_local866013445_0001_m_000000_0' done.
14:40:03,589  INFO LocalJobRunner:249 - Finishing task: attempt_local866013445_0001_m_000000_0
14:40:03,590  INFO LocalJobRunner:456 - map task executor complete.
14:40:03,593  INFO LocalJobRunner:448 - Waiting for reduce tasks
14:40:03,593  INFO LocalJobRunner:302 - Starting task: attempt_local866013445_0001_r_000000_0
14:40:03,597  INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,597  INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,627  INFO Task:612 -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@2ae5eb6
14:40:03,658  INFO ReduceTask:362 - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@72ddfb0b
14:40:03,686  INFO MergeManagerImpl:197 - MergerManager: memoryLimit=1314232704, maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14:40:03,688  INFO EventFetcher:61 - attempt_local866013445_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14:40:03,720  INFO LocalFetcher:144 - localfetcher#1 about to shuffle output of map attempt_local866013445_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
14:40:03,729  INFO InMemoryMapOutput:100 - Read 50 bytes from map-output for attempt_local866013445_0001_m_000000_0
14:40:03,730  INFO MergeManagerImpl:315 - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50
14:40:03,731  INFO EventFetcher:76 - EventFetcher is interrupted.. Returning
14:40:03,731  INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,731  INFO MergeManagerImpl:687 - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
14:40:03,744  INFO Merger:606 - Merging 1 sorted segments
14:40:03,744  INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes
14:40:03,746  INFO MergeManagerImpl:754 - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit
14:40:03,748  INFO MergeManagerImpl:784 - Merging 1 files, 54 bytes from disk
14:40:03,748  INFO MergeManagerImpl:799 - Merging 0 segments, 0 bytes from memory into reduce
14:40:03,748  INFO Merger:606 - Merging 1 sorted segments
14:40:03,749  INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes
14:40:03,749  INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,847  INFO deprecation:1173 - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
--->Reducer-->pool-3-thread-1
--->Reducer-->pool-3-thread-1
--->Reducer-->pool-3-thread-1
14:40:03,867  INFO Task:1038 - Task:attempt_local866013445_0001_r_000000_0 is done. And is in the process of committing
14:40:03,868  INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,868  INFO Task:1199 - Task attempt_local866013445_0001_r_000000_0 is allowed to commit now
14:40:03,873  INFO FileOutputCommitter:535 - Saved output of task 'attempt_local866013445_0001_r_000000_0' to file:/D:/hadoop/output/_temporary/0/task_local866013445_0001_r_000000
14:40:03,877  INFO LocalJobRunner:591 - reduce > reduce
14:40:03,877  INFO Task:1158 - Task 'attempt_local866013445_0001_r_000000_0' done.
14:40:03,877  INFO LocalJobRunner:325 - Finishing task: attempt_local866013445_0001_r_000000_0
14:40:03,877  INFO LocalJobRunner:456 - reduce task executor complete.
14:40:04,044  INFO Job:1360 - Job job_local866013445_0001 running in uber mode : false
14:40:04,045  INFO Job:1367 -  map 100% reduce 100%
14:40:04,045  INFO Job:1378 - Job job_local866013445_0001 completed successfully
14:40:04,050  INFO Job:1385 - Counters: 30
    File System Counters
        FILE: Number of bytes read=488
        FILE: Number of bytes written=566782
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=2
        Map output records=4
        Map output bytes=40
        Map output materialized bytes=54
        Input split bytes=96
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=54
        Reduce input records=4
        Reduce output records=3
        Spilled Records=8
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=7
        Total committed heap usage (bytes)=498073600
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=24
    File Output Format Counters 
        Bytes Written=36

Process finished with exit code 0

会在D:\hadoop\output输出结果如下：

image.png

其中part-r-00000的内容如下：

hadoop  1
hello   2
java    1

下一篇我们介绍在集群中运行WordCount，Hadoop之集群运行WordCount

3.参考

1.http://blog.csdn.net/congcong68/article/details/42043093