Cloud Computing(3)

    xiaoxiao2021-03-25  78

    How do we aggregate partial counts efficiently?

    Pairs

    An algorithm. This algorithm illustrates the use of complex keys in order to coordinate distributed computations.

    Each mapper takes a sentenceReducers sum up counts associated with these pairs //"pairs" approach class MAPPER method MAP(docid a, doc d) for all term w∈doc d do for all term u∈NEIGHBORS(w) do EMIT( pair(w, u) , count 1) //EMIT count for each co-occurrence class REDUCER method REDUCE(pair p, counts[c1, c2, ...]) s = 0 for all count c ∈counts[c1, c2, ...] do s = s + c EMIT(pair p, count s)

    For each term emit pairs: ( (a,b), 1 ) 键值是一个pair(a,b)

    “Pairs Analysis”(数组短,但数目多)

    Advantages

    Easy to implement, easy to understand: map就是找pair,reduce就是统计

    Disadvantages

    Lots of pairs to sort and shuffle around, upper bound = (n!)(n个单词,就有n的阶乘个pairs)Not many opportunities for combiners to work

    Stripes

    Co-occurrence information is first stored in an associative array, denoted H. The mapper emits key-value pairs with words as keys and corresponding associative arrays as values, where each associative array encodes the co-occurrence counts of the neighbors of a particular word.

    Each mapper takes a sentenceReducers perform element-wise sum of associative arrays //"stripes" approach class MAPPER method MAP(docid a, doc d) for all term w∈doc d do H = new ASSOCIATIVEARRAY for all term u∈NEIGHBORS(w) do H{u} = H{u} + 1 //Tally words co-occurring with w EMIT( term w , Stripe H) class REDUCER method REDUCE(term w , Stripes [H1, H2, H3,...]) Hf = new ASSOCIATIVEARRAY for all stripe H ∈stripes[H1, H2, H3, ...] do sum(Hf,H) EMIT(term w , Stripe Hf)

    For each term emit stripes: a->{b:1, c:2, d:2, ….} 键值是“a”

    “Stripes Analysis”(数组长,但数目少)

    Advantages

    Far less sorting and shuffling of key-value pairsCan make better use of combiners

    Disadvantages

    More difficult to implementUnderlying object more heavyweightFundamental limitation in terms of size of event space

    Pairs vs. Stripes

    处理量不大,处理资源数目少,用pairs;反之,stripes较优
    转载请注明原文地址: https://ju.6miu.com/read-16978.html

    最新回复(0)