COMP9313_WEEK3

声明:由于本人也是处于学习阶段,有些理解可能并不深刻,甚至会携带一定错误,因此请以批判的态度来进行阅读,如有错误,请留言或直接联系本人。

WEEK3内容概要:1)week2回顾;2)Design Pattern 1: In-mapper Combining;3)Design Pattern 2: Pairs vs Stripes;4)Writable(定义及相关Java语句); 5)Design Pattern 3: Order Inversion

关键词:In-mapper Combining; commutative; associative; Pairs; Stripes; Term Co-occurrence; serialization; Writable; Order Inversion

Overview of Previous Lecture
1)Motivation of MapReduce
2)Data Structures in MapReduce: (key, value) pairs
3)Map and Reduce Functions
4)Hadoop MapReduce Programming
4.1)Mapper
4.2)Reducer
4.3)Combiner
4.4)Partitioner
4.5)Driver

问题,combiner function的作用和使用规则是什么?
1)To minimize the data transferred between map and reduce tasks
2)Combiner function is run on the map output
3)Both input and output data types must be consistent with the output of mapper (or input of reducer)
4)But Hadoop do not guarantee how many times it will call combiner function for a particular map output record
4.1)It is just optimization
4.2)The number of calling (even zero) does not affect the output of Reducers
5)Applicable on problems that are commutative(分配律) and associative(结合律)(注意,只有满足交换律和结合律才能将reducer当作combiner用。PPT.13)
(记住:1)Combiners and reducers share same method signature因此Sometimes, reducers can serve as combiners;2)Remember: combiner are optional optimizations,所以Should not affect algorithm correctness,以及May be run 0, 1, or multiple times)

Design Pattern 1: In-mapper Combining
在处理数据情况下,Twice the data, twice the running time,如果我们Twice the resources, half the running time(可以理解为使用两台计算机并行处理,那么速度就倍增),但是Data synchronization requires communication, but communication kills performance.因此,我们通过先进行本地aggregation然后再通过Reducer的HTTP 请求传输给Reducer,以提高效率。我们已知,combiner是工作在mapper node中的optimization option,但是它的工作流程是mapper output(memory) -> disk -> combiner processing(memory)。因为要通过I/0传输给disk,所以受限于I/O带宽,效率也会受到影响。那么有没有什么好的算法设计,使得aggregation直接在memory中进行呢?有,就是采用In-mapper Combining。

下面我们用用Word Count来作为例子进行解释:


图一

图一伪代码将mapper output 直接作为reducer input,因此会有成百上千的(termt, count 1)的intermediate records 直接传输给reducer。由于网络带宽的限制,这会降低mapreduce的工作效率。


图二

由此,我们采用图二的方式,设一个ASSOCIATIVEARRAY将一个document下所有term t sum到H{t}中,然后EMIT(term t, count H{t})。
那么这里我们可以忽略combiner吗?

不可以,因为每次执行一个document,针对term t会建立一个新的ASSOCIATIVEARRAY。然而一个mapper node在一个task中可能会执行多个document(假设n个document),而每个document中都可能有term t,那么将会生成n个关于term t的ASSOCIATIVEARRAY。于是,还需要使用combiner将这n个term t的ASSOCIATIVEARRAY进行combine。

图三

由于combiner的效率还是不高,于是我们采用图三的in-mapper combining。这里的先创立一个总的ASSOCIATIVEARRAY,通过不断的读入documents来不断的将term t的数量sum到这个总的ASSOCIATIVEARRAY中的H{t}中。

由此,我们可以总结in-mapper combining的特点:
1)Fold the functionality of the combiner into the mapper by preserving state across multiple map calls
2)Advantages
2.1)Speed(通过网络传输的数据更加小(就wordcount而言,一个mapper传给一个reducer只有一个pair))
2.2)It's faster than actual combiners(因为in-mapper combining是完全在memory中进行的,而combiner combine数据要将map output从memory传到disk,再从disk传到memory进行combine)
3)Disadvantages
3.1)Explicit memory management required
3.2)Potential for order-dependent bugs
下面我们再用一个求平均值的例子更加详细的说明in-mapper combining:

图四

在图四中,我们直接将mapper output作为reducer input,而省略了combiner,那么这里我们将reducer当作combiner用可不可以呢?

不可以,因为只有满足交换律和结合律才能将reducer当作combiner用。

(注意,这是个错误示例,是为了说明combiner的使用是要满足commutative(分配律)和 associative(结合律)的)
class MAPPER
    method MAP(string t, interger r)
        EMIT(string t, integer r)

class COMBINER
    method COMBINER(string t, intgers[r1, r2, ...])
        sum <- 0
        count <- 0
        for all integer r 属于 intergers[r1, r2, ...] do
            sum <- sum + r
            count <- count + 1
        r(avg) <- sum/count
        EMIT(string t, integer r(avg))

class REDUCER
    method REDUCER(string t, integers[r1(avg), r2(avg),...])
        sum <- 0
        count <- 0
        for all integer r 属于 intergers[r1(avg), r2(avg),...] do
            sum <- sum + r(avg)
            count <- count + 1
        r_all(avg) <- sum/count
        EMIT(string t, integer r_all(avg))

然而在这里,如果每一个mapper node的关于string t的平均值都被combiner算出来了,那么在reducer中首先得到的是每个mapper node关于string t的平均值,接着这些平均值被sum然后求average,这是不正确的。例如:Mean(1, 2, 3, 4, 5) != Mean(Mean(1, 2), Mean(3, 4, 5))

图五

那么,我们采用图5的方法可以吗?
不可以,因为mapper input的类型与reducer input 的类型不符,这里经过combiner的combine之后,数据类型被改变了,而根据MapReduce的作业要求:Both input and output data types must be consistent with the output of mapper (or input of reducer)。所以图五的方式不正确。


图六

图六的写法是对的,然而有combiner还是不那么高效,可以将其转化为更高效的in-mapper combining 吗?可以,见图七。


图七

问题,How to Implement In-mapper Combiner in MapReduce?
1)Lifecycle: setup -> map -> cleanup
1.1)setup(): called once at the beginning of the task
1.2)map(): do the map
1.3)cleanup(): called once at the end of the task.
1.4)We do not invoke these functions
2)In-mapper Combining:
1.1)Use setup() to initialize the state preserving data structure
2.2)Use clearnup() to emit the final key-value pairs

LifeCycle

Design Pattern 2: Pairs vs Stripes
Term Co-occurrence Computation
1)Term co-occurrence matrix for a text collection
1.1)M = N x N matrix (N = vocabulary size)
1.2)Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)
1.3)specific instance of a large counting problem
1.3.1)A large event space (number of terms)
1.3.2)A large number of observations (the collection itself)
1.3.3)Goal: keep track of interesting statistics about the events
2)Basic approach
2.1)Mappers generate partial counts
2.2)Reducers aggregate partial counts

那我们用什么方法来计算Term Co-occurrence呢?
1)Pairs
2)Stripes

pairs:
1)Each mapper takes a sentence
1.1)Generate all co-occurring term pairs
1.2)For all pairs, emit (a, b) → count
2)Reducers sum up counts associated with these pairs
3)Use combiners

image.png

Pairs’ Advantages
1)Easy to implement, easy to understand

Pairs’ Disadvantages
1)Lots of pairs to sort and shuffle around (upper bound?)
2)Not many opportunities for combiners to work(因为,相对于一个词的wordcount,相同单词对在一个mapper node中显得较少,因此combiner作用的机会少)

Stripes:

image.png

image.png

上图,MAPPER中,第一个for循环找出,key:w并创立新ASSOCIATIVEARRAY: H(用来存储stripes),然后第二个for循环找出w的neighbor word: u,然后对u计数存入H{u}。最后输出EMIT(Term w, Stripe H)。在Reducer中,通过对以Key:w的数据类中的stripes里的neighbor words的H进行sum,来得到一个最终的以key: w; stripe: Hf的数据类型输出。(例如, 2个mapper各向reducer传入:a→{b:1; d:5; e:3}; a→{b:1; c:2; d:2; f:2}, 于是reducer将他们aggregate后得到 a→{b:2; c:2; d:7; e:3; f:2})

Stripes’ Advantages
1)Far less sorting and shuffling of key-value pairs
2)Can make better use of combiners

Stripes’ Disadvantages
1)More difficult to implement
2)Underlying object more heavyweight
3)Fundamental limitation in terms of size of event space

Pairs vs. Stripes:
1)The pairs approach
1.1)Keep track of each team co-occurrence separately
1.2)Generates a large number of key-value pairs (also intermediate)
1.3)The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a word
2)The stripe approach
2.1)Keep track of all terms that co-occur with the same term
2.2)Generates fewer and shorted intermediate keys
2.3)The framework has less sorting to do
2.4)Greatly benefits from combiners, as the key space is the vocabulary
2.5)More efficient, but may suffer from memory problem
3)These two design patterns are broadly useful and frequently observed in a variety of applications
3.1)Text processing, data mining, and bioinformatics
相较于Pairs,Stripes面对大量数据时体现出更高的时间效率。

问题,How to Implement “Pairs” and “Stripes”in MapReduce?
首先介绍以下Serialization:Process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage

Requirements:
1)Compact:To make efficient use of storage space
2)Fast:The overhead in reading and writing of data is minimal
3)Extensible:We can transparently read data written in an older format
4)Interoperable:We can read or write persistent data using different language

再介绍以下Writable Interface:Writable is a serializable object which implements a simple, efficient, serialization protocol

public interface Writable {
    void write(DataOutput out) throws IOException;
    void readFields(DataInput in) throws IOException;
}

Writable Interface要求:
1)All values must implement interface Writable
2)All keys must implement interface WritableComparable
3)context.write(WritableComparable, Writable)
3.1)You cannot use java primitives here!!

Writable 再pairs和stripes中的应用:


image.png

image.png

Design Pattern 3: Order Inversion

在WordCount的时候我们,可能不仅需要统计一对单词组(例如,computer science)的出现次数,而且还要统计它(computer, science)在所有的出现computer的单词对(computer, *)中的出现频率,因此我们需要统计出所有(computer, *)(设为N(Wi, w’))的数量,然后再用(computer, science)(设为N(Wi, wj))出现的次数除以N(Wi, w’)。见下图


image.png

注意:1)N(·,·) is the number of times a co-occurring word pair is observed
2)The denominator is called the marginal

这里有两种方式统计词频:1)Stripes; 2)Pairs (统计词频是在reducer中进行的)
1)Stripes:(这个方法面对的问题是memory,因为所有的stripes的活动都得在内存中进行)
In the reducer, the counts of all words that co-occur with the conditioning variable (wi) are available in the associative array. Hence, the sum of all those counts gives the marginal.

a →  {b1:3, b2 :12, b3 :7, b4 :1, … }
f(b1|a) = 3 / (3 + 12 + 7 + 1 + …)

2)Pairs:
Reducer直接接收pairs然后进行词频统计,然而这里有个问题,就是分母必须是在扫描了所有的(wi, *)后才能得到,不然无法进行词频计算。

这里有两种解决方式:
2.1) Fortunately, as for the mapper, also the reducer can preserve state across multiple keys. We can buffer in memory all the words that co-occur with wi and their counts. This is basically building the associative array in the stripes method.(类似于stripes的方式,问题也是memory)
但是这个方法在mapper传来的pairs没有被sort的时候,需要对其先按照(wi, *)的方式进行sort,然后才是(wi, wj)的sort,然而在sort的过程中(即sort(wi, )),我们遍历了所有(wi,)对,所以分母就知道了。(所有的一切都是在memory中进行)
这里还有个问题就是,reducer1可能索取了(wi, wj),reducer2可能索取了(wi, wq),这就会产生词频统计不正确的现象(因为分母错误了),所以我们需要自己设定partitioner来使一个reducer来请求(wi, *)的单词对。这里的partitioner是基于采用hash value的方式,因为hash值是具有身份鉴别功能,所以只需要(wi, *)中的wi的hash value相同,就会被同一个reducer fetch。

2.2)将统计好的marginal先传输给reducer(例如,(pair(wi, *), all_count)),再将各个单词对及其出现次数(pair(wi, wu), count)传输给reducer进行词频计算(即 frequency_of_(wi, wu) = count/all_count)。在这里关键就是properly sequence data presented to reducers,那么我们就需要采用一种叫“order inversion”的方法:
(1)The mapper:
(1.1)additionally emits a “special” key of the form (wi, ∗)
(1.2)The value associated to the special key is one, that represents the contribution of the word pair to the marginal
(1.3)Using combiners, these partial marginal counts will be aggregated before being sent to the reducers
(2)The reducer:
(2.1)We must make sure that the special key-value pairs are processed before any other key-value pairs where the left word is wi (define sort order)
(2.2)We also need to guarantee that all pairs associated with the same word are sent to the same reducer (use partitioner)

image.png

方法二的特点:
(1)Memory requirements:
(1.1)Minimal, because only the marginal (an integer) needs to be stored
(1.2)No buffering of individual co-occurring word
(1.3)No scalability bottleneck
(2)Key ingredients for order inversion
(2.1)Emit a special key-value pair to capture the marginal
(2.2)Control the sort order of the intermediate key, so that the special key-value pair is processed first
(2.3)Define a custom partitioner for routing intermediate key-value pairs

Order Inversion:
(1)Common design pattern
(1.1)Computing relative frequencies requires marginal counts
(1.2)But marginal cannot be computed until you see all counts
(1.3)Buffering is a bad idea!
(1.4)Trick: getting the marginal counts to arrive at the reducer before the joint counts
(2)Optimizations
(2.1)Apply in-memory combining pattern to accumulate marginal counts

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,968评论 6 482
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,601评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 153,220评论 0 344
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,416评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,425评论 5 374
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,144评论 1 285
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,432评论 3 401
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,088评论 0 261
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,586评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,028评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,137评论 1 334
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,783评论 4 324
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,343评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,333评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,559评论 1 262
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,595评论 2 355
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,901评论 2 345