博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
【hadoop】17.MapReduce-wordcount案例
阅读量:6914 次
发布时间:2019-06-27

本文共 6401 字,大约阅读时间需要 21 分钟。

hot3.png

简介

从本章节您可以学习到:wordcount案例。

1、简单实现

1.1、Mapper类

package com.zhaoyi.wordcount;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;/** * 4个参数分别对应指定输入k-v类型以及输出k-v类型 */public class WordCountMapper extends Mapper
{ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { super.map(key, value, context); // 1. transport the Text to Java String, this is a line. String line = value.toString(); // 2. split to the line by " " String[] words = line.split(" "); // 3. output the word-1 key-val to context. for (String word:words) { // set word as key,number 1 as value // 根据单词分发,以便于相同单词会到相同的reducetask中 context.write(new Text(word), new IntWritable(1)); } }}

Mapper类需要通过继承Mapper类来编写。我们可以查看Mapper的源码:

//// Source code recreated from a .class file by IntelliJ IDEA// (powered by Fernflower decompiler)//package org.apache.hadoop.mapreduce;import java.io.IOException;import org.apache.hadoop.classification.InterfaceAudience.Public;import org.apache.hadoop.classification.InterfaceStability.Stable;@Public@Stablepublic class Mapper
{ public Mapper() { } protected void setup(Mapper
.Context context) throws IOException, InterruptedException { } protected void map(KEYIN key, VALUEIN value, Mapper
.Context context) throws IOException, InterruptedException { context.write(key, value); } protected void cleanup(Mapper
.Context context) throws IOException, InterruptedException { } public void run(Mapper
.Context context) throws IOException, InterruptedException { this.setup(context); try { while(context.nextKeyValue()) { this.map(context.getCurrentKey(), context.getCurrentValue(), context); } } finally { this.cleanup(context); } } public abstract class Context implements MapContext
{ public Context() { } }}

可以看到,他需要我们指定四个形参类型,分别对应Mapper的输入key-val类型以及输出key-val类型。

我们处理的逻辑很简单,单纯的根据空格进行单词划分。当然,严格意义下来说,需要考虑到多个空格的情况,这些逻辑如果您需要的话可以在这里封装实现。

1.2、Reducer类

Reducer类和Mapper的模式大致相同,他也需要指定四个形参类型,输入的key-val类型对应Mapper的输出key-val类型。而输出则是Text、IntWritable类型。至于为什么不用我们java自己的封装类型,我们以后会提到,现在有个大致印象即可。

package com.zhaoyi.wordcount;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;import java.io.IOException;/** * 输入K-V即为mapper的输出K-V类型,我们要的结果是word-count,因此输出K-V类型是Text-IntWritable */public class WordCountReducer extends Reducer
{ @Override protected void reduce(Text key, Iterable
values, Context context) throws IOException, InterruptedException { int count = 0; // 1.汇总各个key的总数 for (IntWritable value : values) { count += value.get(); } // 2.输出该key的总数 context.write(key, new IntWritable(count)); }}

1.3、驱动类

该类负责加载Mapper、reducer执行任务。

package com.zhaoyi.wordcount;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCountDriver {    public static void  main(String[] args) throws Exception {        // 0.检测参数        if(args.length != 2){            System.out.println("Please enter the parameter: data input and output paths...");            System.exit(-1);        }        // 1.根据配置信息创建任务        Configuration configuration = new Configuration();        Job job = Job.getInstance(configuration);        // 2.设置驱动类        job.setJarByClass(WordCountDriver.class);        // 3.指定mapper和reducer类        job.setMapperClass(WordCountMapper.class);        job.setReducerClass(WordCountReducer.class);        // 4.设置输出结果的类型(reducer output)        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        // 5.设置输入数据路径和输出数据路径,由程序执行参数指定        FileInputFormat.setInputPaths(job, new Path(args[0]));        FileOutputFormat.setOutputPath(job,  new Path(args[1]));        // 6.提交工作        //job.submit();        boolean result = job.waitForCompletion(true);        System.exit(result? 1:0);    }}

1.4、打包

1、进入我们的项目目录,使用maven打包

cd word-countmvn install

执行完成后,将会在输出目录得到一个wordcount-1.0-SNAPSHOT.jar文件,将其拷贝到我们的Hadoop服务器上用户目录下。

1.5、测试

现在我们在/input目录下(HDFS目录)上传了一个文件,文件内容如下,该文件将会是我们分析的输入对象:

this is a testjust a testAlice was beginning to get very tired of sitting by her sister on the bankand of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?' So she was considering in her own mind

接下来,直接运行我们的任务:

[root@h133 ~]# hadoop jar wordcount-1.0-SNAPSHOT.jar com.zhaoyi.wordcount.WordCountDriver /input /output...19/01/07 10:21:20 INFO client.RMProxy: Connecting to ResourceManager at h134/192.168.102.134:803219/01/07 10:21:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.19/01/07 10:21:23 INFO input.FileInputFormat: Total input paths to process : 119/01/07 10:21:25 INFO mapreduce.JobSubmitter: number of splits:119/01/07 10:21:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1546821218045_000119/01/07 10:21:27 INFO impl.YarnClientImpl: Submitted application application_1546821218045_000119/01/07 10:21:27 INFO mapreduce.Job: The url to track the job: http://h134:8088/proxy/application_1546821218045_0001/19/01/07 10:21:27 INFO mapreduce.Job: Running job: job_1546821218045_0001...

com.zhaoyi.wordcount.WordCountDriver 是我们的驱动类的全路径名,请根据您的实际环境填写。后面的两个参数分别是输入路径和输出路径。

等待执行完成,任务进行的过程也可以通过web界面

最后得到我们想要的输出结果:

[root@h133 ~]# hadoop fs -cat /output/part-r-00000Alice	2So	1`and	1`without	1a	3and	1and	1bank	1beginning	1book	1book,'	1but	1by	1considering	1conversation?'	1conversations	1...

转载于:https://my.oschina.net/u/3091870/blog/3000615

你可能感兴趣的文章
JVM系列(一)
查看>>
Spring data jpa 梳理
查看>>
mybatis中的choose标签的使用
查看>>
mysql数据库与web主机分离实验
查看>>
struts2配置过程
查看>>
Java编程安全漏洞之:数据或系统信息安全
查看>>
windows下 nginx 实例
查看>>
HTTP Status 400 - Required MultipartFile parameter 'logoFole' is not present
查看>>
Google 宣布将会关闭消费者版本 Google+
查看>>
关于java字符串常用一些api 效率比拼小结(java对大型的字符串api处理效率比拼)...
查看>>
discuzX3* 开启 https 后 UCenter应用通信失败解决
查看>>
CentOS7 中使用 firewall-cmd 控制端口和端口转发
查看>>
如何优化tomcat配置(从内存、并发、缓存4个方面)优化
查看>>
iptables命令
查看>>
Spring支持的CacheManager
查看>>
gulp备忘录
查看>>
linux中的jiffies变量
查看>>
java集合框架
查看>>
Java异常处理
查看>>
5 Questions Great Job Candidates Ask
查看>>