Hadoop Tutorial

Install hadoop

Download hadoop tar ball from stable hadoop release (1.2.1 was used with this example)
Unzip the downloaded package into /usr/local or somewhere else if you have a preference
Create symlink: ln -s /usr/local/hadoop-1.2.1 /usr/local/hadoop
Export a several environment variables:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

MapReduce (Streaming) with single node

In this section, we will run simple word count MapReduce job using Streaming under single node setup.

Our map function will break down a sentence into a word and emit each of them with value 1.

# src/word_count_map.py
#!/usr/bin/env python

import re
import sys

for line in sys.stdin:
  sentence = line.strip()
  sentence = re.sub(r'[,.]', '', sentence)
  for word in sentence.split():
    print '%s\t%s' % (word, 1)

Our reduce function will count total occurence of each word.

# src/word_count_reduce.py
#!/usr/bin/env python

import sys

(last_word, count) = (None, 0)

for line in sys.stdin:
  (word, val) = line.strip().split('\t')
  if last_word and last_word != word:
    print '%s\t%s' % (last_word, count)
    (last_word, count) = (word, 1)
  else:
    (last_word, count) = (word, count + 1)

if last_word:
  print '%s\t%s' % (last_word, count)

Now we can run this MapReduce function using hadoop:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-input data/sentences.txt \
-output output \
-mapper src/word_count_map.py \
-reducer src/word_count_reduce.py

If the job is successful, it should create output directory and output/part-00000.txt, which contains a result.

Setup HDFS and MapReduce in pseudodistributed mode

On Mac OS X, make sure Remote Login (under System Preferences -> Sharing) is enabled for the current user
Make sure you can ssh to localhost: ssh localhost
Format a namenode of HDFS: hadoop namenode -format
Copy required pseudodistributed configs: cp pseudodistributed/* $HADOOP_HOME/config
Start HDFS (a namenode, a secondary namenode and a datanode): start-dfs.sh (You can check if it's successful by accessing http://localhost:50070/)
Start MapReduce (a jobtracker and a task tracker): start-mapred.sh (You can check if it's successful by accessing http://localhost:50030/)

That's all. You are now running HDFS and MapReduce locally!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
pseudodistributed_conf		pseudodistributed_conf
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Tutorial

Install hadoop

MapReduce (Streaming) with single node

Setup HDFS and MapReduce in pseudodistributed mode

About

Releases

Packages

Languages

kn/hadoop_tutorial

Folders and files

Latest commit

History

Repository files navigation

Hadoop Tutorial

Install hadoop

MapReduce (Streaming) with single node

Setup HDFS and MapReduce in pseudodistributed mode

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages