Spark 1.5.x版本引入的内置函数
在Spark 1.5.x版本,增加了一系列内置函数到DataFrame API中,并且实现了code-generation的优化。与普通的函数不同,DataFrame的函数并不会执行后立即返回一个结果值,而是返回一个Column对象,用于在并行作业中进行求值。Column可以用在DataFrame的操作之中,比如select,filter,groupBy等。函数的输入值,也可以是Column。
种类 | 函数 |
---|---|
聚合函数 | approxCountDistinct, avg, count, countDistinct, first, last, max, mean, min, sum, sumDistinct |
集合函数 | array_contains, explode, size, sort_array |
日期/时间函数 | 日期时间转换 unix_timestamp, from_unixtime, to_date, quarter, day, dayofyear, weekofyear, from_utc_timestamp, to_utc_timestamp 从日期时间中提取字段 year, month, dayofmonth, hour, minute, second |
日期/时间函数 | 日期/时间计算 datediff, date_add, date_sub, add_months, last_day, next_day, months_between 获取当前时间等 current_date, current_timestamp, trunc, date_format |
数学函数 | abs, acros, asin, atan, atan2, bin, cbrt, ceil, conv, cos, sosh, exp, expm1, factorial, floor, hex, hypot, log, log10, log1p, log2, pmod, pow, rint, round, shiftLeft, shiftRight, shiftRightUnsigned, signum, sin, sinh, sqrt, tan, tanh, toDegrees, toRadians, unhex |
混合函数 | array, bitwiseNOT, callUDF, coalesce, crc32, greatest, if, inputFileName, isNaN, isnotnull, isnull, least, lit, md5, monotonicallyIncreasingId, nanvl, negate, not, rand, randn, sha, sha1, sparkPartitionId, struct, when |
字符串函数 | ascii, base64, concat, concat_ws, decode, encode, format_number, format_string, get_json_object, initcap, instr, length, levenshtein, locate, lower, lpad, ltrim, printf, regexp_extract, regexp_replace, repeat, reverse, rpad, rtrim, soundex, space, split, substring, substring_index, translate, trim, unbase64, upper |
窗口函数 | cumeDist, denseRank, lag, lead, ntile, percentRank, rank, rowNumber |
案例实战:根据每天的用户访问日志和用户购买日志,统计每日的uv和销售额
每日uv
这里讲解一下uv的基本含义和业务
每天都有很多用户来访问,但是每个用户可能每天都会访问很多次
所以,uv,指的是,对用户进行去重以后的访问总数
Java版本
一定要引入,import static org.apache.spark.sql.functions.*;
否则无法直接使用countDistinct函数
public class DailyUV {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("DailyUVJava").setMaster("local");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sparkContext.sc());
// 构造用户访问日志数据,并创建DataFrame
// 模拟用户访问日志,日志用逗号隔开,第一列是日期,第二列是用户id
List<String> userAccessLog = new ArrayList<String>();
userAccessLog.add("2018-12-30,1122");
userAccessLog.add("2018-12-30,1122");
userAccessLog.add("2018-12-30,1123");
userAccessLog.add("2018-12-30,1124");
userAccessLog.add("2018-12-30,1125");
userAccessLog.add("2018-12-30,1126");
userAccessLog.add("2018-12-31,1126");
userAccessLog.add("2018-12-31,1127");
userAccessLog.add("2018-12-31,1128");
userAccessLog.add("2018-12-31,1129");
userAccessLog.add("2018-12-31,1130");
userAccessLog.add("2018-12-31,1131");
userAccessLog.add("2018-12-31,1132");
// 将模拟出来的用户访问日志RDD,转换为DataFrame
// 首先,将普通的RDD,转换为元素为Row的RDD
JavaRDD<String> userAccessLogRDD = sparkContext.parallelize(userAccessLog);
JavaRDD<Row> userAccessLogRowRDD = userAccessLogRDD.map(new Function<String, Row>() {
@Override
public Row call(String v1) throws Exception {
return RowFactory.create(v1.split(",")[0], v1.split(",")[1]);
}
});
// 构造DataFrame的元数据
List<StructField> fieldList = new ArrayList<StructField>();
fieldList.add(DataTypes.createStructField("data", DataTypes.StringType, true));
fieldList.add(DataTypes.createStructField("userid", DataTypes.StringType, true));
StructType structType = DataTypes.createStructType(fieldList);
// 使用SQLContext创建DataFrame
DataFrame df = sqlContext.createDataFrame(userAccessLogRowRDD, structType);
df.groupBy("data").agg(countDistinct("userid")).show();
}
}
Scala版本
import org.apache.spark.sql.functions._
object DailyUV {
def main(args: Array[String]): Unit = {
// 首先还是创建SparkConf
val conf = new SparkConf().setAppName("DailyUVScala").setMaster("local")
// 创建SparkContext
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
// 这里着重说明一下!!!
// 要使用Spark SQL的内置函数,就必须在这里导入SQLContext下的隐式转换
import sqlContext.implicits._
// 构造用户访问日志数据,并创建DataFrame
// 模拟用户访问日志,日志用逗号隔开,第一列是日期,第二列是用户id
val userAccessLog = Array("2018-12-30,1122", "2018-12-30,1122", "2018-12-30,1123",
"2018-12-30,1124", "2018-12-30,1125", "2018-12-30,1126", "2018-12-31,1126",
"2018-12-31,1127", "2018-12-31,1128", "2018-12-31,1129", "2018-12-31,1130",
"2018-12-31,1131", "2018-12-31,1132")
// 将模拟出来的用户访问日志RDD,转换为DataFrame
// 首先,将普通的RDD,转换为元素为Row的RDD
val userAccessLogRDD = sparkContext.parallelize(userAccessLog,5)
val userAccessLogRowRDD = userAccessLogRDD.map(s => Row(s.split(",")(0), s.split(",")(1)))
val structType = StructType(Array(StructField("data",StringType, true), StructField("userid",StringType, true)))
val df = sqlContext.createDataFrame(userAccessLogRowRDD, structType)
df.groupBy("data")
.agg('data, countDistinct('userid))//注意格式,data前面是单引号
.foreach(row => println(row))//注意结果是3列
}
}
每日销售额
说明一下,业务的特点
实际上呢,我们可以做一个,单独统计网站登录用户的销售额的统计
有些时候,会出现日志的上报的错误和异常,比如日志里丢了用户的信息,那么这种,我们就一律不统计了
Java版本
public class DailySale {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("DailySaleJava").setMaster("local");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sparkContext.sc());
// 构造用户访问日志数据,并创建DataFrame
// 模拟用户访问日志,日志用逗号隔开,第一列是日期,第二列是用户id,第三列是钱
List<String> userSaleLog = new ArrayList<String>();
userSaleLog.add("2018-12-30,1122,112.23");
userSaleLog.add("2018-12-30,1122,222.23");
userSaleLog.add("2018-12-30,1123,110.00");
userSaleLog.add("2018-12-30,1124,23.25");
userSaleLog.add("2018-12-30,1125,23.33");
userSaleLog.add("2018-12-30,1126,210.3");
userSaleLog.add("2018-12-31,1126,666.66");
userSaleLog.add("2018-12-31,1127,");
userSaleLog.add("2018-12-31,1128,777.89");
userSaleLog.add("2018-12-31,1129,");
userSaleLog.add("2018-12-31,1130,33333");
userSaleLog.add("2018-12-31,1131,2301");
userSaleLog.add("2018-12-31,1132,333");
// 将模拟出来的用户访问日志RDD,转换为DataFrame
// 首先,将普通的RDD,转换为元素为Row的RDD
JavaRDD<String> userSaleLogRDD = sparkContext.parallelize(userSaleLog);
userSaleLogRDD = userSaleLogRDD.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String v1) throws Exception {
if(v1.split(",").length == 3) {
return true;
}
return false;
}
});
JavaRDD<Row> userSaleLogRowRDD = userSaleLogRDD.map(new Function<String, Row>() {
@Override
public Row call(String v1) throws Exception {
return RowFactory.create(v1.split(",")[0], Double.parseDouble(v1.split(",")[2]));
}
});
// 构造DataFrame的元数据
List<StructField> fieldList = new ArrayList<StructField>();
fieldList.add(DataTypes.createStructField("data", DataTypes.StringType, true));
fieldList.add(DataTypes.createStructField("sale_amount", DataTypes.DoubleType, true));
StructType structType = DataTypes.createStructType(fieldList);
// 使用SQLContext创建DataFrame
DataFrame df = sqlContext.createDataFrame(userSaleLogRowRDD, structType);
df.groupBy("data").agg(sum("sale_amount")).show();
}
}
Scala版本
object DailySale {
def main(args: Array[String]): Unit = {
// 首先还是创建SparkConf
val conf = new SparkConf().setAppName("DailySaleScala").setMaster("local")
// 创建SparkContext
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
// 这里着重说明一下!!!
// 要使用Spark SQL的内置函数,就必须在这里导入SQLContext下的隐式转换
import sqlContext.implicits._
// 构造用户访问日志数据,并创建DataFrame
// 模拟用户访问日志,日志用逗号隔开,第一列是日期,第二列是用户id
val userSaleLog = Array("2018-12-30,1122,112.33", "2018-12-30,1122,112.23", "2018-12-30,1123,663",
"2018-12-30,1124,55.55", "2018-12-30,1125,44.44", "2018-12-30,1126,33.33", "2018-12-31,1126,69",
"2018-12-31,1127,66.66", "2018-12-31,1128,77.77", "2018-12-31,1129,88.88", "2018-12-31,1130,99.99",
"2018-12-31,1131,201.22", "2018-12-31,1132,100.1")
// 将模拟出来的用户访问日志RDD,转换为DataFrame
// 首先,将普通的RDD,转换为元素为Row的RDD
val userSaleLogRDD = sparkContext.parallelize(userSaleLog,5)
val userFilter = userSaleLogRDD.filter(s => if(s.split(",").length == 3) true else false)
val userSaleLogRowRDD = userFilter.map(s => Row(s.split(",")(0), s.split(",")(2).toDouble))
val structType = StructType(Array(StructField("data",StringType, true), StructField("sale_amount",DoubleType, true)))
val df = sqlContext.createDataFrame(userSaleLogRowRDD, structType)
df.groupBy("data")
.agg('data, sum("sale_amount"))//注意格式,data前面是单引号
.foreach(row => println(row))//注意结果是3列
}
}