Hadoop之本地运行WordCount

本文主要记录在windows搭建Hadoop开发环境并编写一个WordCount的mapreduce在本地环境执行。

主要内容:

  • 1.搭建本地环境
  • 2.编写WordCount并在本地运行

相关文章:
1.VM12安装配置CentOS7
2.Hadoop集群环境搭建(三台)
3.Hadoop之本地运行WordCount
4.Hadoop之集群运行WordCount
5.Log4j2+Flume+Hdfs日志采集

1.搭建本地环境

1.1.解压

去官网下载指定的hadoop版本

hadoop-2.7.3.tar.gz

将下载好的hadoop压缩包解压到任意目录
拷贝winutils.exe 到 hadoop-2.7.3/bin 目录下

1.2 配置环境变量

新建环境变量执行hadoop解压路径

HADOOP_HOME:D:\soft\dev\hadoop-2.7.3

在Path后新增

%HADOOP_HOME%\bin;

2.编写WordCount

输入文件格式如下:

hello java
hello hadoop

输出如下:

hello 2
hadoop 1
java 1

项目目录如下:


image.png

2.1.引入Maven依赖

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.3</version>
        </dependency>
    </dependencies>

2.2.加入log4j.properties配置文件

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.Target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
log4j.rootLogger=INFO, console

2.3.编写Mapper

读取输入文本中的每一行,并切分单词,记录单词的数量并输出,输出类型为Text,IntWritable 例如:java,1

public class WcMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        System.out.println("--->Map-->" + Thread.currentThread().getName());
        String[] words = StringUtils.split(value.toString(), ' ');
        for (String w : words) {
            context.write(new Text(w), new IntWritable(1));
        }
    }
}

2.4.编写Reducer

接收Mapper的输出结果进行累加并输出结果,接收类型为Mapper的输出类型Text,Iterable<IntWritable> 例如:java (1,1),输出类型为 Text,intWritable 例如:java 2

public class WcReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        System.out.println("--->Reducer-->" + Thread.currentThread().getName());
        int sum = 0;
        for (IntWritable i : values) {
            sum = sum + i.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

2.5.编写Job

将Mapper和Reducer组装起来封装成功一个Job,作为一个执行单元。计算WordCount就是一个Job。

public class RunWcJob {
    public static void main(String[] args) throws Exception {
        // 创建本次mr程序的job实例
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        // 指定本次job运行的主类
        job.setJarByClass(RunWcJob.class);

        // 指定本次job的具体mapper reducer实现类
        job.setMapperClass(WcMapper.class);
        job.setReducerClass(WcReducer.class);

        // 指定本次job map阶段的输出数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 指定本次job reduce阶段的输出数据类型 也就是整个mr任务的最终输出类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 指定本次job待处理数据的目录 和程序执行完输出结果存放的目录
        FileInputFormat.setInputPaths(job, "D:\\hadoop\\input");
        FileOutputFormat.setOutputPath(job, new Path("D:\\hadoop\\output"));

        // 提交本次job
        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0 : 1);
    }
}

在本地文件夹D:\hadoop\input下新建 words.txt,内容为上面给出的输入内容作为输入
同样输出文件夹为output,那么直接运行程序:

可能出现的错误:

  • java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    原因:
    没有拷贝winutils拷贝到hadoop-2.7.3/bin目录下或者没有配置HADOOP_HOME环境变量或者配置HADOOP_HOME环境变量没生效
    解决:
    1.下载winutils拷贝到hadoop-2.7.3/bin目录下
    2.检查环境变量是否配置
    3.如果已经配置好环境变量,重启idea或这电脑,有可能是环境变量没生效

  • Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    原因:
    不太清楚
    解决:
    拷贝org.apache.hadoop.io.nativeio.NativeIO源码,重写access方法的返回值

    image.png

2.6运行结果

允许如果出现一下信息就表示已经正确执行了。

14:40:01,813  WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14:40:02,058  INFO deprecation:1173 - session.id is deprecated. Instead, use dfs.metrics.session-id
14:40:02,060  INFO JvmMetrics:76 - Initializing JVM Metrics with processName=JobTracker, sessionId=
14:40:02,355  WARN JobResourceUploader:64 - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14:40:02,387  WARN JobResourceUploader:171 - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14:40:02,422  INFO FileInputFormat:283 - Total input paths to process : 1
14:40:02,685  INFO JobSubmitter:198 - number of splits:1
14:40:02,837  INFO JobSubmitter:287 - Submitting tokens for job: job_local866013445_0001
14:40:03,035  INFO Job:1294 - The url to track the job: http://localhost:8080/
14:40:03,042  INFO Job:1339 - Running job: job_local866013445_0001
14:40:03,044  INFO LocalJobRunner:471 - OutputCommitter set in config null
14:40:03,110  INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,115  INFO LocalJobRunner:489 - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14:40:03,211  INFO LocalJobRunner:448 - Waiting for map tasks
14:40:03,211  INFO LocalJobRunner:224 - Starting task: attempt_local866013445_0001_m_000000_0
14:40:03,238  INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,383  INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,439  INFO Task:612 -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4d11cc8c
14:40:03,445  INFO MapTask:756 - Processing split: file:/D:/hadoop/input/words.txt:0+24
14:40:03,509  INFO MapTask:1205 - (EQUATOR) 0 kvi 26214396(104857584)
14:40:03,509  INFO MapTask:998 - mapreduce.task.io.sort.mb: 100
14:40:03,509  INFO MapTask:999 - soft limit at 83886080
14:40:03,509  INFO MapTask:1000 - bufstart = 0; bufvoid = 104857600
14:40:03,510  INFO MapTask:1001 - kvstart = 26214396; length = 6553600
14:40:03,515  INFO MapTask:403 - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
--->Map-->LocalJobRunner Map Task Executor #0
--->Map-->LocalJobRunner Map Task Executor #0
14:40:03,522  INFO LocalJobRunner:591 - 
14:40:03,522  INFO MapTask:1460 - Starting flush of map output
14:40:03,522  INFO MapTask:1482 - Spilling map output
14:40:03,522  INFO MapTask:1483 - bufstart = 0; bufend = 40; bufvoid = 104857600
14:40:03,522  INFO MapTask:1485 - kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
14:40:03,573  INFO MapTask:1667 - Finished spill 0
14:40:03,583  INFO Task:1038 - Task:attempt_local866013445_0001_m_000000_0 is done. And is in the process of committing
14:40:03,589  INFO LocalJobRunner:591 - map
14:40:03,589  INFO Task:1158 - Task 'attempt_local866013445_0001_m_000000_0' done.
14:40:03,589  INFO LocalJobRunner:249 - Finishing task: attempt_local866013445_0001_m_000000_0
14:40:03,590  INFO LocalJobRunner:456 - map task executor complete.
14:40:03,593  INFO LocalJobRunner:448 - Waiting for reduce tasks
14:40:03,593  INFO LocalJobRunner:302 - Starting task: attempt_local866013445_0001_r_000000_0
14:40:03,597  INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
14:40:03,597  INFO ProcfsBasedProcessTree:192 - ProcfsBasedProcessTree currently is supported only on Linux.
14:40:03,627  INFO Task:612 -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@2ae5eb6
14:40:03,658  INFO ReduceTask:362 - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@72ddfb0b
14:40:03,686  INFO MergeManagerImpl:197 - MergerManager: memoryLimit=1314232704, maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14:40:03,688  INFO EventFetcher:61 - attempt_local866013445_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14:40:03,720  INFO LocalFetcher:144 - localfetcher#1 about to shuffle output of map attempt_local866013445_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
14:40:03,729  INFO InMemoryMapOutput:100 - Read 50 bytes from map-output for attempt_local866013445_0001_m_000000_0
14:40:03,730  INFO MergeManagerImpl:315 - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50
14:40:03,731  INFO EventFetcher:76 - EventFetcher is interrupted.. Returning
14:40:03,731  INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,731  INFO MergeManagerImpl:687 - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
14:40:03,744  INFO Merger:606 - Merging 1 sorted segments
14:40:03,744  INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes
14:40:03,746  INFO MergeManagerImpl:754 - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit
14:40:03,748  INFO MergeManagerImpl:784 - Merging 1 files, 54 bytes from disk
14:40:03,748  INFO MergeManagerImpl:799 - Merging 0 segments, 0 bytes from memory into reduce
14:40:03,748  INFO Merger:606 - Merging 1 sorted segments
14:40:03,749  INFO Merger:705 - Down to the last merge-pass, with 1 segments left of total size: 41 bytes
14:40:03,749  INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,847  INFO deprecation:1173 - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
--->Reducer-->pool-3-thread-1
--->Reducer-->pool-3-thread-1
--->Reducer-->pool-3-thread-1
14:40:03,867  INFO Task:1038 - Task:attempt_local866013445_0001_r_000000_0 is done. And is in the process of committing
14:40:03,868  INFO LocalJobRunner:591 - 1 / 1 copied.
14:40:03,868  INFO Task:1199 - Task attempt_local866013445_0001_r_000000_0 is allowed to commit now
14:40:03,873  INFO FileOutputCommitter:535 - Saved output of task 'attempt_local866013445_0001_r_000000_0' to file:/D:/hadoop/output/_temporary/0/task_local866013445_0001_r_000000
14:40:03,877  INFO LocalJobRunner:591 - reduce > reduce
14:40:03,877  INFO Task:1158 - Task 'attempt_local866013445_0001_r_000000_0' done.
14:40:03,877  INFO LocalJobRunner:325 - Finishing task: attempt_local866013445_0001_r_000000_0
14:40:03,877  INFO LocalJobRunner:456 - reduce task executor complete.
14:40:04,044  INFO Job:1360 - Job job_local866013445_0001 running in uber mode : false
14:40:04,045  INFO Job:1367 -  map 100% reduce 100%
14:40:04,045  INFO Job:1378 - Job job_local866013445_0001 completed successfully
14:40:04,050  INFO Job:1385 - Counters: 30
    File System Counters
        FILE: Number of bytes read=488
        FILE: Number of bytes written=566782
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=2
        Map output records=4
        Map output bytes=40
        Map output materialized bytes=54
        Input split bytes=96
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=54
        Reduce input records=4
        Reduce output records=3
        Spilled Records=8
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=7
        Total committed heap usage (bytes)=498073600
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=24
    File Output Format Counters 
        Bytes Written=36

Process finished with exit code 0

会在D:\hadoop\output输出结果如下:


image.png

其中part-r-00000的内容如下:

hadoop  1
hello   2
java    1

下一篇我们介绍在集群中运行WordCount,Hadoop之集群运行WordCount

3.参考

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,098评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,213评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,960评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,519评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,512评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,533评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,914评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,574评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,804评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,563评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,644评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,350评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,933评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,908评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,146评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,847评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,361评论 2 342