一、窗口函数
窗口函数是spark sql模块从1.4之后开始支持的,主要用于解决对一组数据进行操作,同时为每条数据返回单个结果,比如计算指定访问数据的均值、计算累加和或访问当前行之前行数据等,这些场景使用普通函数实现是比较困难的。
窗口函数计算的一组行,被称为Frame。每一个被处理的行都有一个唯一的frame相关联。
Spark SQL支持三类窗口函数:排名函数、分析函数和聚合函数。
以下汇总了Spark SQL支持的排名函数和分析函数。
对于聚合函数来说,普通的聚合函数都可以作为窗口聚合函数使用。
spark支持两种方式使用窗口函数:
在SQL语句中的支持的函数中添加OVER语句。例如avg(revenue) OVER (…)
使用DataFrame API在支持的函数调用over()方法。例如rank().over(…)。
当一个函数被作为窗口函数使用时,需要为该窗口函数定义相关的窗口规范。窗口规范定义了哪些行会包括到给定输入行相关联的帧(frame)中。窗口规范包括三部分:
分区规范:定义哪些行属于相同分区,这样在对帧中数据排序和计算之前相同分区的数据就可以被收集到同一台机器上。如果没有指定分区规范,那么所有数据都会被收集到单个机器上处理。
排序规范:定义同一个分区中所有数据的排序方式,从而确定了给定行在他所属分区中的位置
帧规范:指定哪些行会被当前输入行的帧包括,通过其他行对于当前行的相对位置实现。
如果使用sql语句的话,PARTITION BY关键字用来为分区规范定义分区表达式、 ORDER BY关键字用来为排序规范定义排序表达式。格式:OVER (PARTITION BY ... ORDER BY ... )。
如果使用DataFrame API的话,API提供了函数来定义窗口规范。实例如下:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy(...).orderBy(...)
为了分区和排序操作,需要定义帧的开始边界、结束边界和帧的类型,这也是一个帧规范的三部分。
一共有五种边界:
UNBOUNDED PRECEDING(分区第一行),
UNBOUNDED FOLLOWING(分区最后一行),
CURRENT ROW 当前行
<value> PRECEDING(当前行之前行)
<value> FOLLOWING(当前行之后行)
有两种帧类型:ROW帧和RANGE帧。
ROW帧是基于当前输入行的位置的物理偏移量:
比如:CURRENT ROW被用作边界表示当前输入行,<value> PRECEDING和<value> FOLLOWING分别表示出现在当前行之前和之后的行数。
例如:
SQL语句ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING表示一个包括当前行、当前行之前1行和之后1行的帧。
RANGE帧是基于当前行位置的逻辑偏移。逻辑偏移为当前输入行的排序表达式的值和帧边界行的排序表达式的值之差。也因为这种定义,使用RANGE帧时,只能允许使用单个排序表达式。对于RANGE帧,就边界计算而言,和当前输入行的排序表达式的值相同的行都被认为是相同行。例如:排序表达式为revenue,SQL语句为RANGE BETWEEN 2000 PRECEDING AND 1000 FOLLOWING,则边界为[current revenue value - 2000, current revenue value + 1000]。
使用DataFrame API,可以使用以下方法实现ROW帧和RANGE帧:
Window.partitionBy(...).orderBy(...).rowBetween(start, end)
Window.partitionBy(...).orderBy(...).rangeBetween(start, end)
测试数据:
{"EMPNO": 7369,"ENAME": "SMITH","JOB": "CLERK","MGR": 7902,"HIREDATE": "1980-12-17 00:00:00","SAL": 800.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7499,"ENAME": "ALLEN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-20 00:00:00","SAL": 1600.00,"COMM": 300.00,"DEPTNO": 30}
{"EMPNO": 7521,"ENAME": "WARD","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-22 00:00:00","SAL": 1250.00,"COMM": 500.00,"DEPTNO": 30}
{"EMPNO": 7566,"ENAME": "JONES","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-04-02 00:00:00","SAL": 2975.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7654,"ENAME": "MARTIN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-28 00:00:00","SAL": 1250.00,"COMM": 1400.00,"DEPTNO": 30}
{"EMPNO": 7698,"ENAME": "BLAKE","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-05-01 00:00:00","SAL": 2850.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7782,"ENAME": "CLARK","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-06-09 00:00:00","SAL": 2450.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7788,"ENAME": "SCOTT","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1987-04-19 00:00:00","SAL": 1500.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7839,"ENAME": "KING","JOB": "PRESIDENT","MGR": null,"HIREDATE": "1981-11-17 00:00:00","SAL": 5000.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7844,"ENAME": "TURNER","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-08 00:00:00","SAL": 1500.00,"COMM": 0.00,"DEPTNO": 30}
{"EMPNO": 7876,"ENAME": "ADAMS","JOB": "CLERK","MGR": 7788,"HIREDATE": "1987-05-23 00:00:00","SAL": 1100.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7900,"ENAME": "JAMES","JOB": "CLERK","MGR": 7698,"HIREDATE": "1981-12-03 00:00:00","SAL": 950.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7902,"ENAME": "FORD","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1981-12-03 00:00:00","SAL": 3000.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7934,"ENAME": "MILLER","JOB": "CLERK","MGR": 7782,"HIREDATE": "1982-01-23 00:00:00","SAL": 1300.00,"COMM": null,"DEPTNO": 10}
以编程的方式使用:
scala> val df=spark.read.json("/root/emp.json")
1. 排名函数
1) rank
为相同组的数据计算排名,如果相同组中排序字段相同,当前行的排名值和前一行相同;如果相同组中排序字段不同,则当前行的排名值为该行在当前组中的行号;因此排名序列会出现间隙。
scala> spark.sql("SELECT deptno,sal,rank() OVER(partition by deptno ORDER BY sal) AS rank from emp").show
+------+------+----+
|deptno| sal|rank|
+------+------+----+
| 10|1300.0| 1|
| 10|2450.0| 2|
| 10|5000.0| 3|
| 30| 950.0| 1|
| 30|1250.0| 2|
| 30|1250.0| 2|
| 30|1500.0| 4|
| 30|1600.0| 5|
| 30|2850.0| 6|
| 20| 800.0| 1|
| 20|1100.0| 2|
| 20|1500.0| 3|
| 20|2975.0| 4|
| 20|3000.0| 5|
+------+------+----+
说明:
部门30因有两个工资同为:1250.0 所以排名丢失了3。
2) dense_rank
为相同组内数据计算排名,如果相同组中排序字段相同,当前行的排名值和前一行相同;如果相同组中排序字段不同,则当前行的排名值为前一行排名值加1;排名序列不会出现间隙。
scala> spark.sql("SELECT deptno,sal,dense_rank() OVER(partition by deptno ORDER BY sal) AS rank from emp").show
+------+------+----+
|deptno| sal|rank|
+------+------+----+
| 10|1300.0| 1|
| 10|2450.0| 2|
| 10|5000.0| 3|
| 30| 950.0| 1|
| 30|1250.0| 2|
| 30|1250.0| 2|
| 30|1500.0| 3|
| 30|1600.0| 4|
| 30|2850.0| 5|
| 20| 800.0| 1|
| 20|1100.0| 2|
| 20|1500.0| 3|
| 20|2975.0| 4|
| 20|3000.0| 5|
+------+------+----+
说明:
部门30因有两个工资同为:1250.0 所以排名同为2。
3) percent_rank
该值的计算公式(组内排名-1)/(组内行数-1),如果组内只有1行,则结果为0。
scala> spark.sql("SELECT deptno,sal,percent_rank() OVER(partition by deptno ORDER BY sal) AS rank from emp").show
+------+------+----+
|deptno| sal|rank|
+------+------+----+
| 10|1300.0| 0.0|
| 10|2450.0| 0.5|
| 10|5000.0| 1.0|
| 30| 950.0| 0.0|
| 30|1250.0| 0.2|
| 30|1250.0| 0.2|
| 30|1500.0| 0.6|
| 30|1600.0| 0.8|
| 30|2850.0| 1.0|
| 20| 800.0| 0.0|
| 20|1100.0|0.25|
| 20|1500.0| 0.5|
| 20|2975.0|0.75|
| 20|3000.0| 1.0|
+------+------+----+
说明:
第1行:10部门:只有一行结果为:0
第2行:10部门: (2-1)/(3-1)=0.5
4) ntile
将组内数据排序然后按照指定的n切分成n个桶,该值为当前行的桶号(桶号从1开始)。
# 指定2个桶
scala> spark.sql("SELECT deptno,sal,ntile(2) OVER(partition by deptno ORDER BY sal) AS rank from emp").show
+------+------+----+
|deptno| sal|rank|
+------+------+----+
| 10|1300.0| 1|
| 10|2450.0| 1|
| 10|5000.0| 2|
| 30| 950.0| 1|
| 30|1250.0| 1|
| 30|1250.0| 1|
| 30|1500.0| 2|
| 30|1600.0| 2|
| 30|2850.0| 2|
| 20| 800.0| 1|
| 20|1100.0| 1|
| 20|1500.0| 1|
| 20|2975.0| 2|
| 20|3000.0| 2|
+------+------+----+
5)row_number
将组内数据排序后,该值为当前行在当前组内的从1开始的递增的唯一序号值。
scala> spark.sql("SELECT deptno,sal,row_number() OVER(partition by deptno ORDER BY sal) AS rank from emp").show
+------+------+----+
|deptno| sal|rank|
+------+------+----+
| 10|1300.0| 1|
| 10|2450.0| 2|
| 10|5000.0| 3|
| 30| 950.0| 1|
| 30|1250.0| 2|
| 30|1250.0| 3|
| 30|1500.0| 4|
| 30|1600.0| 5|
| 30|2850.0| 6|
| 20| 800.0| 1|
| 20|1100.0| 2|
| 20|1500.0| 3|
| 20|2975.0| 4|
| 20|3000.0| 5|
+------+------+----+
2. 分析函数
1)cume_dist():
该值的计算公式为:组内小于等于当前行值的行数/组内总行数
计算一个值相对于分区中所有值的位置。
scala> spark.sql("SELECT deptno,sal,CUME_DIST() OVER(ORDER BY deptno) AS rn1 from emp").show
20/04/04 09:34:17 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+------+-------------------+
|deptno| sal| rn1|
+------+------+-------------------+
| 10|2450.0|0.21428571428571427|
| 10|5000.0|0.21428571428571427|
| 10|1300.0|0.21428571428571427|
| 20| 800.0| 0.5714285714285714|
| 20|2975.0| 0.5714285714285714|
| 20|1500.0| 0.5714285714285714|
| 20|1100.0| 0.5714285714285714|
| 20|3000.0| 0.5714285714285714|
| 30|1600.0| 1.0|
| 30|1250.0| 1.0|
| 30|1250.0| 1.0|
| 30|2850.0| 1.0|
| 30|1500.0| 1.0|
| 30| 950.0| 1.0|
+------+------+-------------------+
没有分区,所有数据均为一组,总行数为14行,
10值的行:小于等于14的行数为3,因此,3/14=0.21428571428571427
20值的行:小于等于14的行数为8,因此,5+3/14=0.5714285714285714
2)lag
用法:lag(input, [offset, [default]]),计算组内当前行按照排序字段排序的之前offset行的input列的值,如果offset大于当前窗口(组内当前行之前行数)则返回default值,default值默认为null。
scala> spark.sql("SELECT deptno,sal,lag(sal,2) OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+------+
|deptno| sal| row|
+------+------+------+
| 10|2450.0| null|
| 10|5000.0| null|
| 10|1300.0|2450.0|
| 30|1600.0| null|
| 30|1250.0| null|
| 30|1250.0|1600.0|
| 30|2850.0|1250.0|
| 30|1500.0|1250.0|
| 30| 950.0|2850.0|
| 20| 800.0| null|
| 20|2975.0| null|
| 20|1500.0| 800.0|
| 20|1100.0|2975.0|
| 20|3000.0|1500.0|
+------+------+------+
3)lead
用法:lead(input, [offset, [default]]),计算组内当前行按照排序字段排序的之后offset行的input列的值,如果offset大于当前窗口(组内当前行之后行数)则返回default值,default值默认为null。
scala> spark.sql("SELECT deptno,sal,lead(sal,1) OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+------+
|deptno| sal| row|
+------+------+------+
| 10|2450.0|5000.0|
| 10|5000.0|1300.0|
| 10|1300.0| null|
| 30|1600.0|1250.0|
| 30|1250.0|1250.0|
| 30|1250.0|2850.0|
| 30|2850.0|1500.0|
| 30|1500.0| 950.0|
| 30| 950.0| null|
| 20| 800.0|2975.0|
| 20|2975.0|1500.0|
| 20|1500.0|1100.0|
| 20|1100.0|3000.0|
| 20|3000.0| null|
+------+------+------+
3. 聚合函数
1)分组排序
row_number为行号。
scala> spark.sql("SELECT deptno,sal,row_number() OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+---+
|deptno| sal|row|
+------+------+---+
| 10|2450.0| 1|
| 10|5000.0| 2|
| 10|1300.0| 3|
| 30|1600.0| 1|
| 30|1250.0| 2|
| 30|1250.0| 3|
| 30|2850.0| 4|
| 30|1500.0| 5|
| 30| 950.0| 6|
| 20| 800.0| 1|
| 20|2975.0| 2|
| 20|1500.0| 3|
| 20|1100.0| 4|
| 20|3000.0| 5|
+------+------+---+
2)分组sum
scala> spark.sql("SELECT deptno,sal,sum(sal) OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+------+
|deptno| sal| row|
+------+------+------+
| 10|2450.0|8750.0|
| 10|5000.0|8750.0|
| 10|1300.0|8750.0|
| 30|1600.0|9400.0|
| 30|1250.0|9400.0|
| 30|1250.0|9400.0|
| 30|2850.0|9400.0|
| 30|1500.0|9400.0|
| 30| 950.0|9400.0|
| 20| 800.0|9375.0|
| 20|2975.0|9375.0|
| 20|1500.0|9375.0|
| 20|1100.0|9375.0|
| 20|3000.0|9375.0|
+------+------+------+
3)分组avg
scala> spark.sql("SELECT deptno,sal,avg(sal) OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+------------------+
|deptno| sal| row|
+------+------+------------------+
| 10|2450.0|2916.6666666666665|
| 10|5000.0|2916.6666666666665|
| 10|1300.0|2916.6666666666665|
| 30|1600.0|1566.6666666666667|
| 30|1250.0|1566.6666666666667|
| 30|1250.0|1566.6666666666667|
| 30|2850.0|1566.6666666666667|
| 30|1500.0|1566.6666666666667|
| 30| 950.0|1566.6666666666667|
| 20| 800.0| 1875.0|
| 20|2975.0| 1875.0|
| 20|1500.0| 1875.0|
| 20|1100.0| 1875.0|
| 20|3000.0| 1875.0|
+------+------+------------------+
4)分组count
scala> spark.sql("SELECT deptno,sal,count(deptno) OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+---+
|deptno| sal|row|
+------+------+---+
| 10|2450.0| 3|
| 10|5000.0| 3|
| 10|1300.0| 3|
| 30|1600.0| 6|
| 30|1250.0| 6|
| 30|1250.0| 6|
| 30|2850.0| 6|
| 30|1500.0| 6|
| 30| 950.0| 6|
| 20| 800.0| 5|
| 20|2975.0| 5|
| 20|1500.0| 5|
| 20|1100.0| 5|
| 20|3000.0| 5|
+------+------+---+
5)分组min
scala> spark.sql("SELECT deptno,sal,min(sal) OVER(partition by deptno ORDER BY deptno) AS row from emp").show
+------+------+------+
|deptno| sal| row|
+------+------+------+
| 10|2450.0|1300.0|
| 10|5000.0|1300.0|
| 10|1300.0|1300.0|
| 30|1600.0| 950.0|
| 30|1250.0| 950.0|
| 30|1250.0| 950.0|
| 30|2850.0| 950.0|
| 30|1500.0| 950.0|
| 30| 950.0| 950.0|
| 20| 800.0| 800.0|
| 20|2975.0| 800.0|
| 20|1500.0| 800.0|
| 20|1100.0| 800.0|
| 20|3000.0| 800.0|
+------+------+------+
6)分组max
同min。
4. 帧规范
currentRow(): 返回表示窗口分区中当前行的特殊帧边界。
unboundedFollowing():返回表示窗口分区中最后一行的特殊帧边界。
scala> spark.sql("SELECT deptno,sal, count(*) OVER(PARTITION BY deptno ORDER BY sal RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as row from emp").show
+------+------+---+
|deptno| sal|row|
+------+------+---+
| 10|1300.0| 1|
| 10|2450.0| 2|
| 10|5000.0| 3|
| 30| 950.0| 1|
| 30|1250.0| 3|
| 30|1250.0| 3|
| 30|1500.0| 4|
| 30|1600.0| 5|
| 30|2850.0| 6|
| 20| 800.0| 1|
| 20|1100.0| 2|
| 20|1500.0| 3|
| 20|2975.0| 4|
| 20|3000.0| 5|
+------+------+---+
scala> spark.sql("SELECT deptno,sal, max(sal) OVER(PARTITION BY deptno ORDER BY sal ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) as row from emp").show
+------+------+------+
|deptno| sal| row|
+------+------+------+
| 10|1300.0|5000.0|
| 10|2450.0|5000.0|
| 10|5000.0|5000.0|
| 30| 950.0|1250.0|
| 30|1250.0|1500.0|
| 30|1250.0|1600.0|
| 30|1500.0|2850.0|
| 30|1600.0|2850.0|
| 30|2850.0|2850.0|
| 20| 800.0|1500.0|
| 20|1100.0|2975.0|
| 20|1500.0|3000.0|
| 20|2975.0|3000.0|
| 20|3000.0|3000.0|
+------+------+------+
二、混杂(misc)函数
1.crc32(e: Column) 计算CRC32,返回bigint
scala> spark.sql("SELECT crc32('Spark')").show
+----------------------------+
|crc32(CAST(Spark AS BINARY))|
+----------------------------+
| 1557323817|
+----------------------------+
2.hash(cols: Column*) 计算 hash code,返回int
scala> spark.sql("SELECT hash('Spark')").show
+-----------+
|hash(Spark)|
+-----------+
| 228093765|
+-----------+
3.md5(e: Column) 计算MD5摘要,返回32位,16进制字符串
scala> spark.sql("SELECT md5('Spark')").show(false)
+--------------------------------+
|md5(CAST(Spark AS BINARY)) |
+--------------------------------+
|8cde774d6f7333752ed72cacddb05126|
+--------------------------------+
4.sha1(e: Column) 计算SHA-1摘要,返回40位,16进制字符串
scala> spark.sql("SELECT sha1('Spark')").show(false)
+----------------------------------------+
|sha1(CAST(Spark AS BINARY)) |
+----------------------------------------+
|85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c|
+----------------------------------------+
5.sha2(e: Column, numBits: Int) 计算SHA-1摘要,返回numBits位,16进制字符串。numBits支持224, 256, 384, or 512
scala> spark.sql("SELECT sha2('Spark',256)").show(false)
+----------------------------------------------------------------+
|sha2(CAST(Spark AS BINARY), 256) |
+----------------------------------------------------------------+
|529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b|
+----------------------------------------------------------------+