How do we aggregate partial counts efficiently?
An algorithm. This algorithm illustrates the use of complex keys in order to coordinate distributed computations.
Each mapper takes a sentenceReducers sum up counts associated with these pairs //"pairs" approach class MAPPER method MAP(docid a, doc d) for all term w∈doc d do for all term u∈NEIGHBORS(w) do EMIT( pair(w, u) , count 1) //EMIT count for each co-occurrence class REDUCER method REDUCE(pair p, counts[c1, c2, ...]) s = 0 for all count c ∈counts[c1, c2, ...] do s = s + c EMIT(pair p, count s)For each term emit pairs: ( (a,b), 1 ) 键值是一个pair(a,b)
Advantages
Easy to implement, easy to understand: map就是找pair,reduce就是统计Disadvantages
Lots of pairs to sort and shuffle around, upper bound = (n!)(n个单词,就有n的阶乘个pairs)Not many opportunities for combiners to workCo-occurrence information is first stored in an associative array, denoted H. The mapper emits key-value pairs with words as keys and corresponding associative arrays as values, where each associative array encodes the co-occurrence counts of the neighbors of a particular word.
Each mapper takes a sentenceReducers perform element-wise sum of associative arrays //"stripes" approach class MAPPER method MAP(docid a, doc d) for all term w∈doc d do H = new ASSOCIATIVEARRAY for all term u∈NEIGHBORS(w) do H{u} = H{u} + 1 //Tally words co-occurring with w EMIT( term w , Stripe H) class REDUCER method REDUCE(term w , Stripes [H1, H2, H3,...]) Hf = new ASSOCIATIVEARRAY for all stripe H ∈stripes[H1, H2, H3, ...] do sum(Hf,H) EMIT(term w , Stripe Hf)For each term emit stripes: a->{b:1, c:2, d:2, ….} 键值是“a”
Advantages
Far less sorting and shuffling of key-value pairsCan make better use of combinersDisadvantages
More difficult to implementUnderlying object more heavyweightFundamental limitation in terms of size of event space