最近疯狂面试,看来每一步都是有必要总结一下的:
最近被问了如何写Mapper 和Reducer: 计算平均值,这个是稍微有点复杂的案例而已
给出四个column:
merchant name, category, transaction in 2014, transaction dollar in 2014
现在要求的是the average dollar transaction per category.
如何设计Mapper 和 Reducer
总之只是所有事情都分开两步走:
Average Dollar transaction per category = 2014 年所有的交易金额 / 2014 年所有的交易笔数 而已
所以这个肯定会是Reducer 的最后一步:
key(货品的种类)——> 用SumCount(2014所有的交易金额)/SumCount(2014 所有的交易笔数)
Mapper函数:
key(货品的种类)——> [SumCount(2014所有的交易金额), sumCount(2014所有的交易笔数)]
所以Mapper 和 Reducer 的函数实现: (并非真正代码)
把数据都存在Hashmap 里面
map(key, value) <- {
String[] values= input.spilt()
Hashmap map =new Hashmap
String category;
double transaction.number;
double transaction.value;
list=[transaction.number, transaction.value]
map.put(category,list)
map.get(category).add(list)
}
使用loop把数据的和求出来:
Sumcount(hashmap) <-{
for( category :map.keyset()){
list transaction.number=map.get(transaction.number)
list transaction.value=map.get(transaction.value)
sum=0
for( double number; transaction.number){
sum+=number;
}
sum2=0
for(double value: transaction.value) {
sum2 += value;
}
Arraylist resultsum= new list(number, value)
//emits category as a key and a list as value
output(category, new list(number, value)
}
}
所以这里打印出来的应该是这样的:
A<- ([15, 2222] )
B<-([10,12999])
C<-([25,1390])
这里就是Mapper 要做的,reducer就很简单了:
If instead of emitting the mean we emit the sum of the values and the number of values, we can overcome the problem. In the example we saw before, the first mapper will emit the pair (30.0, 2) and the second (9.0, 3); if we sum the values and divide it by the sum of the numbers, we obtain the right result.
reducer:
从Mapper 获得所有的SumCount 数据,放在Hashmap 里面
然后生成iterator, 遍历所有的category 计算平均值
for( category : map.keySet()) {
double sum = map.get(category).getSum();
double coount= map.get(category).getcount();
//emit value
write(category, (sum/count))
}