MapReduce Concept
MapReduce is natively Java, allowed to interface with Python under Streaming
-
Like dictionary in data structure
-
Shuffle Keys and Sort Values
-
Reducer
-
So overall operation is like:
-
Handle of failure
MapReduce Coding
-
Problem:
Map Function:
def mapper_get_ratings(self, _, line):
(userID, movieID,rating,Timestamp)=line.split('\t')
yield rating,1
- Reduce Funtion:
def reducer_count_ratings(self, key, values):
yield key, sum(values)
-
After putting them together:
Installation & Preparation
-
After log into the command line interface:
Run locally:
python RatingsBreakdown.py u.data
- Run with Hadoop:
python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data
-
Result should be :
-
Plus, codes for more complex problems (sorted for movie numbers):