- 从MovieLens数据集提取特征
-
下载数据集
这个数据集主要分为三个部分:第一个是电影打分的数据集(在u.data文件中),第二个是用户数据(u.user),第三个是电影数据(u.item)。除此之外,我们从题材文件中获取了每个电影的题材(u.genre)[hadoop@master spark]$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip [hadoop@master spark]$ unzip ml-100k.zip [hadoop@master spark]$ hdfs dfs -put ml-100k ML/
-
提取电影的题材标签
val movies = sc.textFile("ML/ml-100k/u.item") println(movies.first) 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0 # 提取电影的题材标签 val genres = sc.textFile("ML/ml-100k/u.genre") genres.take(5).foreach(println) unknown|0 Action|1 Adventure|2 Animation|3 Children's|4 #为了提取题材的映射关系,我们对每一行数据进行分割,得到具体的<题材,索引>键值对 val genreMap = genres.filter(!_.isEmpty).map(line => line.split("\\|")).map(array => (array(1), array(0))).collectAsMap println(genreMap) Map(2 -> Adventure, 5 -> Comedy, 12 -> Musical, 15 -> Sci-Fi, 8 -> Drama, 18 -> Western, 7 -> Documentary, 17 -> War, 1 -> Action, 4 -> Children's, 11 -> Horror, 14 -> Romance, 6 -> Crime, 0 -> unknown, 9 -> Fantasy, 16 -> Thriller, 3 -> Animation, 10 -> Film-Noir, 13 -> Mystery)
-
为电影数据和题材映射关系创建新的RDD, 其中包含电影ID、标题和题材
#对每部电影提取相应的题材(是Strings形式而不是Int索引)。然 #后,使用zipWithIndex方法统计包含题材索引的集合,这样就能将 #集合中的索引映射到对应的文本信息 val titlesAndGenres = movies.map(_.split("\\|")).map { array => val genres = array.toSeq.slice(5, array.size) #filter只过滤g为1,即属于相应的题材 val genresAssigned = genres.zipWithIndex.filter { case (g, idx) => g == "1" #map为相应的题材信息 }.map { case (g, idx) => genreMap(idx.toString) } #返回一个二元组(电影ID,(标题和题材)) (array(0).toInt, (array(1), genresAssigned)) } println(titlesAndGenres.first) (1,(Toy Story (1995),ArrayBuffer(Animation, Children's, Comedy))) #下面的代码帮助理解上面的代码 scala> val first = movies.first.split("\\|") first: Array[String] = Array(1, Toy Story (1995), 01-Jan-1995, "", http://us.imdb.com/M/title-exact?Toy%20Story%20(1995), 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) scala> first.toSeq.slice(5, first.size) res7: Seq[String] = WrappedArray(0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) scala> first.toSeq.slice(5, first.size).zipWithIndex res8: Seq[(String, Int)] = ArrayBuffer((0,0), (0,1), (0,2), (1,3), (1,4), (1,5), (0,6), (0,7), (0,8), (0,9), (0,10), (0,11), (0,12), (0,13), (0,14), (0,15), (0,16), (0,17), (0,18)) scala> first.toSeq.slice(5, first.size).zipWithIndex.filter { case (g, idx) => g == "1" } res9: Seq[(String, Int)] = ArrayBuffer((1,3), (1,4), (1,5))
- 训练推荐模型
-
要获取用户和电影的因素向量,首先需要训练一个新的推荐模型
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating val rawData = sc.textFile("ML/ml-100k/u.data") val rawRatings = rawData.map(_.split("\t").take(3)) val ratings = rawRatings.map{ case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } ratings.cache val alsModel = ALS.train(ratings, 50, 10, 0.1)
最小二乘法(Alternating Least Squares,ALS)模型返回了两个键值RDD(userFeatures和productFeatures)。这两个RDD的键为用户ID或者电影ID,值为相关因素。我们
还需要提取相关的因素并转化到MLlib的Vector中作为聚类模型的训练输入-
下面代码分别对用户和电影进行处理
import org.apache.spark.mllib.linalg.Vectors val movieFactors = alsModel.productFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) } res27: org.apache.spark.rdd.RDD[(Int, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[224] at map at <console>:35 #下划线对二元组的特殊用法 val movieVectors = movieFactors.map(_._2) res24: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[225] at map at <console>:37 val userFactors = alsModel.userFeatures.map { case (id, factor) => (id, Vectors.dense(factor)) } res25: org.apache.spark.rdd.RDD[(Int, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[226] at map at <console>:35 val userVectors = userFactors.map(_._2) res26: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[227] at map at <console>:37
-
归一化
在训练聚类模型之前,有必要观察一下输入数据的相关因素特征向量的分布,这可以告诉我们是否需要对训练数据进行归一化。从结果来看,没有发现特别的离群点会影响聚类结果,因此本例中没有必要进行归一化。
import org.apache.spark.mllib.linalg.distributed.RowMatrix val movieMatrix = new RowMatrix(movieVectors) val movieMatrixSummary = movieMatrix.computeColumnSummaryStatistics() val userMatrix = new RowMatrix(userVectors) val userMatrixSummary = userMatrix.computeColumnSummaryStatistics() println("Movie factors mean: " + movieMatrixSummary.mean) println("Movie factors variance: " + movieMatrixSummary.variance) println("User factors mean: " + userMatrixSummary.mean) println("User factors variance: " + userMatrixSummary.variance) Movie factors mean: [0.28047737659519767,0.26886479057520024,0.2935579964 446398,0.27821738264113755, ... Movie factors variance: [0.038242041794064895,0.03742229118854288,0.04411 6961097355877,0.057116244055791986, ... User factors mean: [0.2043520841572601,0.22135773814655782,0.214970631841 8221,0.23647602029329481, ... User factors variance: [0.037749421148850396,0.02831191551960241,0.032831 876953314174,0.036775110657850954, ...
-
训练聚类模型
#首先需要引入必要的模块,设置模型参数:K(numClusters)、最大迭代次数(numIteration)和训练次数(numRuns) import org.apache.spark.mllib.clustering.KMeans val numClusters = 5 val numIterations = 10 val numRuns = 3 #对电影的系数向量运行K-均值算法 val movieClusterModel = KMeans.train(movieVectors, numClusters,numIterations, numRuns) #更大的迭代次数,converged-聚集 val movieClusterModelConverged = KMeans.train(movieVectors,numClusters, 100) #在用户相关因素的特征向量上训练K-均值模型 val userClusterModel = KMeans.train(userVectors, numClusters,numIterations, numRuns)
-
使用聚类模型进行预测
#对一个单独的样本进行预测 val movie1 = movieVectors.first val movieCluster = movieClusterModel.predict(movie1) println(movieCluster) 0 #通过传入一个RDD [Vector]数组对多个输入样本进行预测 val predictions = movieClusterModel.predict(movieVectors) println(predictions.take(10).mkString(",")) 0,2,4,2,4,2,4,1,4,4
用MovieLens数据集解释类别预测
-
解释电影类簇
首先,因为K-均值最小化的目标函数是样本到其类中心的欧拉距离之和,我们便可以将“最靠近类中心”定义为最小的欧拉距离。下面让我们定义这个度量函数,注意引入Breeze库(MLlib的一个依赖库)用于线性代数和向量运算import breeze.linalg._ import breeze.numerics.pow def computeDistance(v1: DenseVector[Double], v2: DenseVector[Double]) = pow(v1 - v2, 2).sum
-
下面我们利用上面的函数对每个电影计算其特征向量与所属类簇中心向量的距离。为了让结
果具有可读性,输出结果中添加了电影的标题和题材数据val titlesWithFactors = titlesAndGenres.join(movieFactors) val moviesAssigned = titlesWithFactors.map { case (id, ((title,genres), vector)) => val pred = movieClusterModel.predict(vector) val clusterCentre = movieClusterModel.clusterCenters(pred) val dist = computeDistance(DenseVector(clusterCentre.toArray),DenseVector(vector.toArray)) (id, title, genres.mkString(" "), pred, dist) } #按照cluster分组 val clusterAssignments = moviesAssigned.groupBy { case (id, title, genres, cluster, dist) => cluster }.collectAsMap clusterAssignments: scala.collection.Map[Int,Iterable[(Int, String, String, Int, Double)]] = Map(2 -> CompactBuffer((204,Back to the Future (1985),Comedy Sci-Fi,2,0.6735965658876787), (318,Schindler's List (1993),Drama War,2,0.7299714144239348), (228,Star Trek: The Wrath of Khan (1982),Action Adventure Sci-Fi,2,1.232051633746667), (1168,Little Buddha (1993),Drama,2,1.1096062423372306), (1550,Destiny Turns on the Radio (1995),Comedy,2,0.7740719996017212), (196,Dead Poets Society (1989),Drama,2,0.4990554824588539), (1176,Welcome To Sarajevo (1997),Drama War,2,3.205189858830396), (1144,Quiet Room, The (1996),Drama,2,0.7915200882484206), (292,Rosewood (1997),Drama,2,1.8089704369054382), (1262,Walking and Talking (1996),Romance,2,2.1070678491655315), (836,Ninotchka (1939),Comedy Romance,2,1.... #运行完代码之后,我们得到一个RDD,其中每个元素是关于某个类簇的键值对,键是类簇 #的标识,值是若干电影和相关信息组成的集合。电影的信息为:电影ID、标题、题材、 #类别索引,以及电影的特征向量和类中心的距离。 #枚举每个类簇并输出距离类中心最近的前20部电影。下划线访问元组,下标从1而不是0开始 for ( (k, v) <- clusterAssignments.toSeq.sortBy(_._1)) { println(s"Cluster $k:") #按照d即距离排序 val m = v.toSeq.sortBy(_._5) println(m.take(20).map { case (_, title, genres, _, d) => (title, genres, d) }.mkString("\n")) println("=====\n") } Cluster 0: (Angela (1995),Drama,0.29636049873225556) (Moonlight and Valentino (1995),Drama Romance,0.3403447970733214) (Blue Chips (1994),Drama,0.37733781105042424) (Johns (1996),Drama,0.3931535460787371) (For Love or Money (1993),Comedy,0.467658288041116) (Mr. Jones (1993),Drama Romance,0.48646846535533833) (Air Up There, The (1994),Comedy,0.4932269425293205) (New Jersey Drive (1995),Crime Drama,0.5068923379274556) (Outbreak (1995),Action Drama Thriller,0.5128207768005651) (Pagemaster, The (1994),Action Adventure Animation Children's Fantasy,0.5268818392996645) (Tainted (1998),Comedy Thriller,0.532524552773841) (Wedding Bell Blues (1996),Comedy,0.532524552773841) (Next Step, The (1995),Drama,0.532524552773841) (Nightwatch (1997),Horror Thriller,0.5407591489002819) (River Wild, The (1994),Action Thriller,0.5490537339724462) (Stag (1997),Action Thriller,0.5563061689451448) (Santa Clause, The (1994),Children's Comedy,0.5761877333414581) (Cliffhanger (1993),Action Adventure Crime,0.5882511906571888) (It Takes Two (1995),Comedy,0.602675373740639) (Private Benjamin (1980),Comedy,0.631468704858662) ===== Cluster 1: (All Over Me (1997),Drama,0.2158047958023726) (Gate of Heavenly Peace, The (1995),Documentary,0.34160593043095533) (Killer: A Journal of Murder (1995),Crime Drama,0.3546233386391832) (Wings of Courage (1995),Adventure Romance,0.35877274078657867) (Two Friends (1986) ,Drama,0.3650324179215576) (Dadetown (1995),Documentary,0.3650324179215576) (Girls Town (1996),Drama,0.3650324179215576) (Big One, The (1997),Comedy Documentary,0.3650324179215576) (Hana-bi (1997),Comedy Crime Drama,0.3650324179215576) (� k�ldum klaka (Cold Fever) (1994),Comedy Drama,0.3650324179215576) (Silence of the Palace, The (Saimt el Qusur) (1994),Drama,0.3650324179215576) (Land and Freedom (Tierra y libertad) (1995),War,0.3650324179215576) (Normal Life (1996),Crime Drama,0.3650324179215576) (Eighth Day, The (1996),Drama,0.3650324179215576) (Walking Dead, The (1995),Drama War,0.3814647434386409) (Sweet Nothing (1995),Drama,0.4463225018862566) (I Like It Like That (1994),Comedy Drama Romance,0.45973680012935836) (Collectionneuse, La (1967),Drama,0.4834649131693116) (Glass Shield, The (1994),Drama,0.49980690750059353) (Lover's Knot (1996),Comedy,0.5029945315100043) ===== Cluster 2: (Last Time I Saw Paris, The (1954),Drama,0.17305682467275096) (Substance of Fire, The (1996),Drama,0.18694335286663155) (Wife, The (1995),Comedy Drama,0.3182552386461784) (All Things Fair (1996),Drama,0.3258699566626903) (Wedding Gift, The (1994),Drama,0.32828303127967334) (Mr. Wonderful (1993),Comedy Romance,0.3365560627641174) (Commandments (1997),Romance,0.35323014663969393) (Prefontaine (1997),Drama,0.3676922615513928) (Outlaw, The (1943),Western,0.3824177730525398) (Apollo 13 (1995),Action Drama Thriller,0.38822865614351426) (Sword in the Stone, The (1963),Animation Children's,0.44734438632925777) (Intimate Relations (1996),Comedy,0.44789039227338406) (20,000 Leagues Under the Sea (1954),Adventure Children's Fantasy Sci-Fi,0.49171018440451153) (Glory (1989),Action Drama War,0.4988137149082063) (Dead Poets Society (1989),Drama,0.4990554824588539) (Abyss, The (1989),Action Adventure Sci-Fi Thriller,0.5125201683410132) (Dave (1993),Comedy Romance,0.5133280291996677) (When Harry Met Sally... (1989),Comedy Romance,0.5261110829107895) (In the Line of Fire (1993),Action Thriller,0.5329236056524467) (Target (1995),Action Drama,0.5534147386483806) ===== Cluster 3: (Machine, The (1994),Comedy Horror,0.06594718019318282) (Amityville: A New Generation (1993),Horror,0.10348648983350936) (Amityville 1992: It's About Time (1992),Horror,0.10348648983350936) (Johnny 100 Pesos (1993),Action Drama,0.11550108120321055) (War at Home, The (1996),Drama,0.13301389378075107) (Amityville: Dollhouse (1996),Horror,0.13534519806809794) (Venice/Venice (1992),Drama,0.1361353105626825) (Gordy (1995),Comedy,0.14073028392471718) (Being Human (1993),Drama,0.14405406656519718) (Boys in Venice (1996),Drama,0.14457703868320995) (Somebody to Love (1994),Drama,0.14457703868320995) (Falling in Love Again (1980),Comedy,0.15156313949956887) (Catwalk (1995),Documentary,0.16497776765767916) (New Age, The (1994),Drama,0.16954557413198104) (Beyond Bedlam (1993),Drama Horror,0.1705142860734866) (Police Story 4: Project S (Chao ji ji hua) (1993),Action,0.17091934252191224) (Babyfever (1994),Comedy Drama,0.17248672672017062) (Small Faces (1995),Drama,0.17259895799733074) (August (1996),Drama,0.17780437730622617) (Leopard Son, The (1996),Documentary,0.17780437730622617) ===== Cluster 4: (Witness (1985),Drama Romance Thriller,0.1946379960877294) (King of the Hill (1993),Drama,0.2580233704878727) (Mamma Roma (1962),Drama,0.3126024876925699) (Nelly & Monsieur Arnaud (1995),Drama,0.32101078512082826) (Angel and the Badman (1947),Western,0.3427227302957313) (Scream of Stone (Schrei aus Stein) (1991),Drama,0.3515308779440109) (Cosi (1996),Comedy,0.38514605438035665) (Beans of Egypt, Maine, The (1994),Drama,0.4074854418942155) (Spirits of the Dead (Tre passi nel delirio) (1968),Horror,0.411316222180582) (Object of My Affection, The (1998),Comedy Romance,0.41584384804887287) (Ed's Next Move (1996),Comedy,0.42539143205503793) (Celestial Clockwork (1994),Comedy,0.43545079573947953) (Love and Other Catastrophes (1996),Romance,0.4539418426282623) (Last Klezmer: Leopold Kozlowski, His Life and Music, The (1995),Documentary,0.47519050165111265) (Spellbound (1945),Mystery Romance Thriller,0.4776835514404319) (Price Above Rubies, A (1998),Drama,0.47779905950191665) (They Made Me a Criminal (1939),Crime Drama,0.4794300819064454) (Farewell to Arms, A (1932),Romance War,0.4852033525639032) (Third Man, The (1949),Mystery Thriller,0.48551232209476797) (Pushing Hands (1992),Comedy,0.4970817434925626) =====