从源数据中提取特征指标数据,这是一个比较典型且通用的步骤,因为我们的原始数据集里,经常会包含一些非指标数据,如 ID,Description 等。为方便后续模型进行特征输入,需要部分列的数据转换为特征向量,并统一命名,VectorAssembler类完成这一任务。VectorAssembler是一个transformer,将多列数据转化为单列的向量列。
- 代码如下
object VectorAssemblerExample {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local[*]").appName("QuantileDiscretizerExample").getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val dataset = spark.createDataFrame(
Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
).toDF("id", "hour", "mobile", "userFeatures", "clicked")
val assembler = new VectorAssembler()
.setInputCols(Array("hour", "mobile", "userFeatures"))
.setOutputCol("features")
val output = assembler.transform(dataset)
output.show(false)
println(output.select("features", "clicked").first())
sc.stop()
}
}
- 原始数据
+---+----+------+--------------+-------+
| id|hour|mobile| userFeatures|clicked|
+---+----+------+--------------+-------+
| 0| 18| 1.0|[0.0,10.0,0.5]| 1.0|
+---+----+------+--------------+-------+
- 转换之后的数据
+---+----+------+--------------+-------+-----------------------+
|id |hour|mobile|userFeatures |clicked|features |
+---+----+------+--------------+-------+-----------------------+
|0 |18 |1.0 |[0.0,10.0,0.5]|1.0 |[18.0,1.0,0.0,10.0,0.5]|
+---+----+------+--------------+-------+-----------------------+
- 特征数据
[[18.0,1.0,0.0,10.0,0.5],1.0]