Spark构建聚类模型（一）

从MovieLens数据集提取特征

下载数据集
这个数据集主要分为三个部分：第一个是电影打分的数据集（在u.data文件中）,第二个是用户数据（u.user），第三个是电影数据（u.item）。除此之外，我们从题材文件中获取了每个电影的题材（u.genre）
```
[hadoop@master spark]$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
[hadoop@master spark]$ unzip ml-100k.zip 
[hadoop@master spark]$ hdfs dfs -put ml-100k ML/
```

提取电影的题材标签

val movies = sc.textFile("ML/ml-100k/u.item")
println(movies.first)
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
# 提取电影的题材标签
val genres = sc.textFile("ML/ml-100k/u.genre")
genres.take(5).foreach(println)
unknown|0
Action|1
Adventure|2
Animation|3
Children's|4

#为了提取题材的映射关系，我们对每一行数据进行分割，得到具体的<题材，索引>键值对
val genreMap = genres.filter(!_.isEmpty).map(line => line.split("\\|")).map(array => (array(1), array(0))).collectAsMap
println(genreMap)
Map(2 -> Adventure, 5 -> Comedy, 12 -> Musical, 15 -> Sci-Fi, 8 -> Drama, 18 -> Western, 7 -> Documentary, 17 -> War, 1 -> Action, 4 -> Children's, 11 -> Horror, 14 -> Romance, 6 -> Crime, 0 -> unknown, 9 -> Fantasy, 16 -> Thriller, 3 -> Animation, 10 -> Film-Noir, 13 -> Mystery)

为电影数据和题材映射关系创建新的RDD，其中包含电影ID、标题和题材

#对每部电影提取相应的题材（是Strings形式而不是Int索引）。然
#后，使用zipWithIndex方法统计包含题材索引的集合，这样就能将
#集合中的索引映射到对应的文本信息
val titlesAndGenres = movies.map(_.split("\\|")).map { array =>
val genres = array.toSeq.slice(5, array.size)
#filter只过滤g为1，即属于相应的题材
val genresAssigned = genres.zipWithIndex.filter { case (g, idx)
=> g == "1"
#map为相应的题材信息
}.map { case (g, idx) => genreMap(idx.toString) }
#返回一个二元组(电影ID,(标题和题材))
(array(0).toInt, (array(1), genresAssigned))
}
println(titlesAndGenres.first)
(1,(Toy Story (1995),ArrayBuffer(Animation, Children's, Comedy)))

#下面的代码帮助理解上面的代码
scala> val first = movies.first.split("\\|")
first: Array[String] = Array(1, Toy Story (1995), 01-Jan-1995, "", http://us.imdb.com/M/title-exact?Toy%20Story%20(1995), 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
scala> first.toSeq.slice(5, first.size)
res7: Seq[String] = WrappedArray(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

scala> first.toSeq.slice(5, first.size).zipWithIndex
res8: Seq[(String, Int)] = ArrayBuffer((0,0), (0,1), (0,2), (1,3), (1,4), (1,5), (0,6), (0,7), (0,8), (0,9), (0,10), (0,11), (0,12), (0,13), (0,14), (0,15), (0,16), (0,17), (0,18))

scala> first.toSeq.slice(5, first.size).zipWithIndex.filter { case (g, idx) => g == "1" }
res9: Seq[(String, Int)] = ArrayBuffer((1,3), (1,4), (1,5))

训练推荐模型

要获取用户和电影的因素向量，首先需要训练一个新的推荐模型

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val rawData = sc.textFile("ML/ml-100k/u.data")
val rawRatings = rawData.map(_.split("\t").take(3))
val ratings = rawRatings.map{ case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)

最小二乘法（Alternating Least Squares，ALS）模型返回了两个键值RDD（userFeatures和productFeatures）。这两个RDD的键为用户ID或者电影ID，值为相关因素。我们
还需要提取相关的因素并转化到MLlib的Vector中作为聚类模型的训练输入

下面代码分别对用户和电影进行处理

import org.apache.spark.mllib.linalg.Vectors
val movieFactors = alsModel.productFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
res27: org.apache.spark.rdd.RDD[(Int, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[224] at map at <console>:35
#下划线对二元组的特殊用法
val movieVectors = movieFactors.map(_._2)
res24: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[225] at map at <console>:37

val userFactors = alsModel.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) }
res25: org.apache.spark.rdd.RDD[(Int, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[226] at map at <console>:35

val userVectors = userFactors.map(_._2)
res26: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[227] at map at <console>:37

归一化

在训练聚类模型之前，有必要观察一下输入数据的相关因素特征向量的分布，这可以告诉我们是否需要对训练数据进行归一化。从结果来看，没有发现特别的离群点会影响聚类结果，因此本例中没有必要进行归一化。

import org.apache.spark.mllib.linalg.distributed.RowMatrix
val movieMatrix = new RowMatrix(movieVectors)
val movieMatrixSummary = movieMatrix.computeColumnSummaryStatistics()
val userMatrix = new RowMatrix(userVectors)
val userMatrixSummary = userMatrix.computeColumnSummaryStatistics()
println("Movie factors mean: " + movieMatrixSummary.mean)
println("Movie factors variance: " + movieMatrixSummary.variance)
println("User factors mean: " + userMatrixSummary.mean)
println("User factors variance: " + userMatrixSummary.variance)
Movie factors mean: [0.28047737659519767,0.26886479057520024,0.2935579964
446398,0.27821738264113755, ...
Movie factors variance: [0.038242041794064895,0.03742229118854288,0.04411
6961097355877,0.057116244055791986, ...
User factors mean: [0.2043520841572601,0.22135773814655782,0.214970631841
8221,0.23647602029329481, ...
User factors variance: [0.037749421148850396,0.02831191551960241,0.032831
876953314174,0.036775110657850954, ...

训练聚类模型

#首先需要引入必要的模块，设置模型参数：K（numClusters）、最大迭代次数（numIteration）和训练次数（numRuns）
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 5
val numIterations = 10
val numRuns = 3
#对电影的系数向量运行K-均值算法
val movieClusterModel = KMeans.train(movieVectors, numClusters,numIterations, numRuns)
#更大的迭代次数，converged-聚集
val movieClusterModelConverged = KMeans.train(movieVectors,numClusters, 100)
#在用户相关因素的特征向量上训练K-均值模型
val userClusterModel = KMeans.train(userVectors, numClusters,numIterations, numRuns)

使用聚类模型进行预测

#对一个单独的样本进行预测
val movie1 = movieVectors.first
val movieCluster = movieClusterModel.predict(movie1)
println(movieCluster)
0
#通过传入一个RDD [Vector]数组对多个输入样本进行预测
val predictions = movieClusterModel.predict(movieVectors)
println(predictions.take(10).mkString(","))
0,2,4,2,4,2,4,1,4,4

用MovieLens数据集解释类别预测

解释电影类簇
首先，因为K-均值最小化的目标函数是样本到其类中心的欧拉距离之和，我们便可以将“最靠近类中心”定义为最小的欧拉距离。下面让我们定义这个度量函数，注意引入Breeze库（MLlib的一个依赖库）用于线性代数和向量运算
```
import breeze.linalg._
import breeze.numerics.pow
def computeDistance(v1: DenseVector[Double], v2: DenseVector[Double])
= pow(v1 - v2, 2).sum
```

下面我们利用上面的函数对每个电影计算其特征向量与所属类簇中心向量的距离。为了让结
果具有可读性，输出结果中添加了电影的标题和题材数据

val titlesWithFactors = titlesAndGenres.join(movieFactors)
val moviesAssigned = titlesWithFactors.map { case (id, ((title,genres), vector)) =>
val pred = movieClusterModel.predict(vector)
val clusterCentre = movieClusterModel.clusterCenters(pred)
val dist = computeDistance(DenseVector(clusterCentre.toArray),DenseVector(vector.toArray))
(id, title, genres.mkString(" "), pred, dist)
}
#按照cluster分组
val clusterAssignments = moviesAssigned.groupBy { case (id, title, genres, cluster, dist) => cluster }.collectAsMap
clusterAssignments: scala.collection.Map[Int,Iterable[(Int, String, String, Int, Double)]] = Map(2 -> CompactBuffer((204,Back to the Future (1985),Comedy Sci-Fi,2,0.6735965658876787), (318,Schindler's List (1993),Drama War,2,0.7299714144239348), (228,Star Trek: The Wrath of Khan (1982),Action Adventure Sci-Fi,2,1.232051633746667), (1168,Little Buddha (1993),Drama,2,1.1096062423372306), (1550,Destiny Turns on the Radio (1995),Comedy,2,0.7740719996017212), (196,Dead Poets Society (1989),Drama,2,0.4990554824588539), (1176,Welcome To Sarajevo (1997),Drama War,2,3.205189858830396), (1144,Quiet Room, The (1996),Drama,2,0.7915200882484206), (292,Rosewood (1997),Drama,2,1.8089704369054382), (1262,Walking and Talking (1996),Romance,2,2.1070678491655315), (836,Ninotchka (1939),Comedy Romance,2,1....
#运行完代码之后，我们得到一个RDD，其中每个元素是关于某个类簇的键值对，键是类簇
#的标识，值是若干电影和相关信息组成的集合。电影的信息为：电影ID、标题、题材、
#类别索引，以及电影的特征向量和类中心的距离。

#枚举每个类簇并输出距离类中心最近的前20部电影。下划线访问元组，下标从1而不是0开始
for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) {
println(s"Cluster $k:")
#按照d即距离排序
val m = v.toSeq.sortBy(_._5)
println(m.take(20).map { case (_, title, genres, _, d) =>
(title, genres, d) }.mkString("\n"))
println("=====\n")
}
Cluster 0:
(Angela (1995),Drama,0.29636049873225556)
(Moonlight and Valentino (1995),Drama Romance,0.3403447970733214)
(Blue Chips (1994),Drama,0.37733781105042424)
(Johns (1996),Drama,0.3931535460787371)
(For Love or Money (1993),Comedy,0.467658288041116)
(Mr. Jones (1993),Drama Romance,0.48646846535533833)
(Air Up There, The (1994),Comedy,0.4932269425293205)
(New Jersey Drive (1995),Crime Drama,0.5068923379274556)
(Outbreak (1995),Action Drama Thriller,0.5128207768005651)
(Pagemaster, The (1994),Action Adventure Animation Children's Fantasy,0.5268818392996645)
(Tainted (1998),Comedy Thriller,0.532524552773841)
(Wedding Bell Blues (1996),Comedy,0.532524552773841)
(Next Step, The (1995),Drama,0.532524552773841)
(Nightwatch (1997),Horror Thriller,0.5407591489002819)
(River Wild, The (1994),Action Thriller,0.5490537339724462)
(Stag (1997),Action Thriller,0.5563061689451448)
(Santa Clause, The (1994),Children's Comedy,0.5761877333414581)
(Cliffhanger (1993),Action Adventure Crime,0.5882511906571888)
(It Takes Two (1995),Comedy,0.602675373740639)
(Private Benjamin (1980),Comedy,0.631468704858662)
=====

Cluster 1:
(All Over Me (1997),Drama,0.2158047958023726)
(Gate of Heavenly Peace, The (1995),Documentary,0.34160593043095533)
(Killer: A Journal of Murder (1995),Crime Drama,0.3546233386391832)
(Wings of Courage (1995),Adventure Romance,0.35877274078657867)
(Two Friends (1986) ,Drama,0.3650324179215576)
(Dadetown (1995),Documentary,0.3650324179215576)
(Girls Town (1996),Drama,0.3650324179215576)
(Big One, The (1997),Comedy Documentary,0.3650324179215576)
(Hana-bi (1997),Comedy Crime Drama,0.3650324179215576)
(� k�ldum klaka (Cold Fever) (1994),Comedy Drama,0.3650324179215576)
(Silence of the Palace, The (Saimt el Qusur) (1994),Drama,0.3650324179215576)
(Land and Freedom (Tierra y libertad) (1995),War,0.3650324179215576)
(Normal Life (1996),Crime Drama,0.3650324179215576)
(Eighth Day, The (1996),Drama,0.3650324179215576)
(Walking Dead, The (1995),Drama War,0.3814647434386409)
(Sweet Nothing (1995),Drama,0.4463225018862566)
(I Like It Like That (1994),Comedy Drama Romance,0.45973680012935836)
(Collectionneuse, La (1967),Drama,0.4834649131693116)
(Glass Shield, The (1994),Drama,0.49980690750059353)
(Lover's Knot (1996),Comedy,0.5029945315100043)
=====

Cluster 2:
(Last Time I Saw Paris, The (1954),Drama,0.17305682467275096)
(Substance of Fire, The (1996),Drama,0.18694335286663155)
(Wife, The (1995),Comedy Drama,0.3182552386461784)
(All Things Fair (1996),Drama,0.3258699566626903)
(Wedding Gift, The (1994),Drama,0.32828303127967334)
(Mr. Wonderful (1993),Comedy Romance,0.3365560627641174)
(Commandments (1997),Romance,0.35323014663969393)
(Prefontaine (1997),Drama,0.3676922615513928)
(Outlaw, The (1943),Western,0.3824177730525398)
(Apollo 13 (1995),Action Drama Thriller,0.38822865614351426)
(Sword in the Stone, The (1963),Animation Children's,0.44734438632925777)
(Intimate Relations (1996),Comedy,0.44789039227338406)
(20,000 Leagues Under the Sea (1954),Adventure Children's Fantasy Sci-Fi,0.49171018440451153)
(Glory (1989),Action Drama War,0.4988137149082063)
(Dead Poets Society (1989),Drama,0.4990554824588539)
(Abyss, The (1989),Action Adventure Sci-Fi Thriller,0.5125201683410132)
(Dave (1993),Comedy Romance,0.5133280291996677)
(When Harry Met Sally... (1989),Comedy Romance,0.5261110829107895)
(In the Line of Fire (1993),Action Thriller,0.5329236056524467)
(Target (1995),Action Drama,0.5534147386483806)
=====

Cluster 3:
(Machine, The (1994),Comedy Horror,0.06594718019318282)
(Amityville: A New Generation (1993),Horror,0.10348648983350936)
(Amityville 1992: It's About Time (1992),Horror,0.10348648983350936)
(Johnny 100 Pesos (1993),Action Drama,0.11550108120321055)
(War at Home, The (1996),Drama,0.13301389378075107)
(Amityville: Dollhouse (1996),Horror,0.13534519806809794)
(Venice/Venice (1992),Drama,0.1361353105626825)
(Gordy (1995),Comedy,0.14073028392471718)
(Being Human (1993),Drama,0.14405406656519718)
(Boys in Venice (1996),Drama,0.14457703868320995)
(Somebody to Love (1994),Drama,0.14457703868320995)
(Falling in Love Again (1980),Comedy,0.15156313949956887)
(Catwalk (1995),Documentary,0.16497776765767916)
(New Age, The (1994),Drama,0.16954557413198104)
(Beyond Bedlam (1993),Drama Horror,0.1705142860734866)
(Police Story 4: Project S (Chao ji ji hua) (1993),Action,0.17091934252191224)
(Babyfever (1994),Comedy Drama,0.17248672672017062)
(Small Faces (1995),Drama,0.17259895799733074)
(August (1996),Drama,0.17780437730622617)
(Leopard Son, The (1996),Documentary,0.17780437730622617)
=====

Cluster 4:
(Witness (1985),Drama Romance Thriller,0.1946379960877294)
(King of the Hill (1993),Drama,0.2580233704878727)
(Mamma Roma (1962),Drama,0.3126024876925699)
(Nelly & Monsieur Arnaud (1995),Drama,0.32101078512082826)
(Angel and the Badman (1947),Western,0.3427227302957313)
(Scream of Stone (Schrei aus Stein) (1991),Drama,0.3515308779440109)
(Cosi (1996),Comedy,0.38514605438035665)
(Beans of Egypt, Maine, The (1994),Drama,0.4074854418942155)
(Spirits of the Dead (Tre passi nel delirio) (1968),Horror,0.411316222180582)
(Object of My Affection, The (1998),Comedy Romance,0.41584384804887287)
(Ed's Next Move (1996),Comedy,0.42539143205503793)
(Celestial Clockwork (1994),Comedy,0.43545079573947953)
(Love and Other Catastrophes (1996),Romance,0.4539418426282623)
(Last Klezmer: Leopold Kozlowski, His Life and Music, The (1995),Documentary,0.47519050165111265)
(Spellbound (1945),Mystery Romance Thriller,0.4776835514404319)
(Price Above Rubies, A (1998),Drama,0.47779905950191665)
(They Made Me a Criminal (1939),Crime Drama,0.4794300819064454)
(Farewell to Arms, A (1932),Romance War,0.4852033525639032)
(Third Man, The (1949),Mystery Thriller,0.48551232209476797)
(Pushing Hands (1992),Comedy,0.4970817434925626)
=====

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,723评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,080评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,604评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,440评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,431评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,499评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,893评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,541评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,751评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,547评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,619评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,320评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,890评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,896评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,137评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,796评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,335评论 2赞 342

Spark构建聚类模型（一）

推荐阅读更多精彩内容