在学习Map Reduce方法时,Word Count(单词统计)程序是最基础的入门训练。不同的写法会有不同的执行效率,下面是用python写的一个示例。

Map:

#!/usr/bin/python
#
#  WordCount mapper in Python
#  Author: Zeng, Xi
#  SID:    1010105140
#  Email:  zengxi@cuhk.edu.hk

import sys
import re

def main(argv):
  line = sys.stdin.readline()
  pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
  try:
    while line:
      for word in pattern.findall(line):
        print word + "\t" + "1"
      line = sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

Reduce:

#!/usr/bin/python
#
#  WordCount reducer in Python
#  Author: Zeng, Xi
#  SID:    1010105140
#  Email:  zengxi@cuhk.edu.hk

import sys
word_list = {}

## collect (key,val) pairs from sort phase
for line in sys.stdin:
    try:
        word, count = line.strip().split("\t", 2)

        if word not in word_list:
            word_list[word] = int(count)
        else:
            word_list[word] += int(count)

    except ValueError, err:
        sys.stderr.write("Value ERROR: %(err)s\n%(data)s\n" % {"err": str(err), "data": line})

## emit results
for word, count in word_list.items():
    print " ".join([word, str(count)])