Spark聚合下推思路以及demo

Spark原本预计在2.3版本实现聚合下推，虽然不知道是何原因最终没有能够在2.3版本最终实现，但是因为工作需要，必须要从聚合函数下手优化Spark SQL，遂思考之实现之。

一篇有意义的参考文章

网上有个牛人想在2.2版本实现聚合下推并提交代码到Spark，结果在pull request里被拒绝了，Spark的人说他们在2.3会实现一套新的DataSource API，即DataSource API v2，所以让他不要这么执着于提交这个代码，我也是很醉。。。这里贴出他的博文：

SparkSQL如何实现聚合下推

该大牛是基于物理计划实现的下推，局限性比较大。所以我参考了他的思路，从逻辑计划和物理计划两个方面都做了一些优化。这里只讲逻辑计划的下推。下推，必然最后是推到数据源层，而Spark没有实现DataSource的聚合数据源的接口，这里可以参考下刚刚分享博文实现的AggregatedFilteredScan接口，我也是基于这个接口的做法实现的。

下推的意义

无论是传统的谓词下推，还是聚合下推，意义都在于将一些操作推到数据源层，这样从数据源里返回的数据就会极大减少。磁盘读写和网络开销都会降低，性能会得到提升。

难点的实现思路

聚合下推的最大难点，我认为是遇到了join，当join的on的两列不属于group列或者aggregate列该肿么办。最开始我认为这种情况可能没有办法下推，因为这样势必要在在group列中加上了原本不属于group的某一join列，这样会影响聚合的结果并且会多一次聚合。但是经过大神提点，其实这样也是可以下推的，原因有二：

即使多一个聚合节点，SQL执行的结果也是对的，也就是最终结果来看其实不应该聚合结果。
一般来说，join on的两列不可能有相同的行数，如果行数相同，那么按照数据库的设计规范，这两张表就应该Union成一张表。所以多的这个聚合节点，也是会减少数据源的数据传输的。

这两点在后面的例子会有展示。

一个下推到join的简单思路以及结果

数据源准备

使用写一个Spark DataSource的随手笔记的Scott数据源。

初始SQL以及逻辑计划

SQL：

SELECT AVG(salary), deptName 
FROM emp 
JOIN dept 
ON emp.deptNo = dept.deptNo 
GROUP BY deptName;

LogicalPlan：

Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
+- Project [salary#6, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

逐步下推

Step 1，指针在最上层，aggregate节点推到其child节点project的下面，同时将project里的salary#6替换成avg(salary)#19：

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 2，指针在第二层，这次是下推一层后的aggregate节点，搜索join节点下面的左右子project节点，看哪个有salary#6，往salary#6所在的子节点上方添加一个以join字段为group by条件的聚合节点，假设新生成exprId是20：

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
      :  +- Project [deptNo#5, salary#6]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 3，指针在第4层的aggregate，与1类似，将aggregate推到project下方，并将project中的salary#6替换成salary#20：

Project [avg(salary)#19, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#20 as double)) AS avg(salary)#19, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#20]
      :  +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#20, deptNo#5]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 2，此时aggregate的子节点是join，将avg(salary)#19的exprId(19)生成新id，假设这里变成avg(salary)#20：

Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#6]
      :  +- Filter isnotnull(deptNo#5)
      :     +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 3，搜索join节点下面的左右子project节点，看哪个有salary#6，往salary#6所在的子节点上方添加一个以join字段为group by条件的聚合节点，假设新生成exprId是21：

Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#21 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
      :  +- Project [deptNo#5, salary#6]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

Step 4，与1类似，将aggregate推到project下方，并将project中的salary#6替换成salary#21：

Project [avg(salary)#20, deptName#8]
+- Aggregate [deptName#8], [avg(cast(salary#6 as double)) AS avg(salary)#20, deptName#8]
   +- Join Inner, (deptNo#5 = deptNo#7)
      :- Project [deptNo#5, salary#21]
      :  +- Aggregate [deptNo#5], [avg(cast(salary#6 as double)) AS salary#21, deptNo#5]
      :     +- Filter isnotnull(deptNo#5)
      :        +- Relation[empNo#2,empName#3,mgr#4,deptNo#5,salary#6] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#7, deptName#8]
         +- Filter isnotnull(deptNo#7)
            +- Relation[deptNo#7,deptName#8,loc#9] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

这时aggregate-filter-relation的组合就会调用到上面提到的AggregatedFilteredScan接口，调用到数据源的buildscan()方法。

结果查看

这个下推，其实等价于下面的SQL和执行计划：

SQL：

SELECT AVG(salary), deptName 
FROM 
(SELECT AVG(salary), deptNo FROM emp GROUP BY deptNo) a 
JOIN dept 
ON a.deptNo = dept.deptNo 
GROUP BY deptName

LogicalPlan：

Project [avg(avgsalary)#21, deptName#9]
+- Aggregate [deptName#9], [avg(avgsalary#2) AS avg(avgsalary)#21, deptName#9]
   +- Join Inner, (deptNo#6 = deptNo#8)
      :- Project [deptNo#6, avgsalary#2]
      :  +- Aggregate [deptNo#6], [avg(cast(salary#7 as double)) AS avgsalary#2, deptNo#6]
      :     +- Filter isnotnull(deptNo#6)
      :        +- Relation[empNo#3,empName#4,mgr#5,deptNo#6,salary#7] TestAggregatePushdownPlan2Scan(emp,StructType(StructField(empNo,IntegerType,true), StructField(empName,StringType,true), StructField(mgr,IntegerType,true), StructField(deptNo,IntegerType,true), StructField(salary,FloatType,true)))
      +- Project [deptNo#8, deptName#9]
         +- Filter isnotnull(deptNo#8)
            +- Relation[deptNo#8,deptName#9,loc#10] TestAggregatePushdownPlan2Scan(dept,StructType(StructField(deptNo,IntegerType,true), StructField(deptName,StringType,true), StructField(loc,StringType,true)))

在没有聚合下推的情况下，返回的join两侧的数据源是：

data without aggregate function: 
  List(ListBuffer(20, 800.0), ListBuffer(20, 3000.0), ListBuffer(20, 2975.0), ListBuffer(30, 1600.0), ListBuffer(30, 1250.0), ListBuffer(30, 2950.0), ListBuffer(10, 5000.0))
data without aggregate function: 
  List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))

而在下推的情况下，返回的join两侧数据源是：

data with aggregate function: 
  ListBuffer(List(30, 5800.0, 3), List(20, 6775.0, 3), List(10, 5000.0, 1))
data without aggregate function: 
  List(ListBuffer(10, accounting), ListBuffer(20, research), ListBuffer(30, sales), ListBuffer(40, operations))

可以看到，下推和没下推，在一侧数据源中拿到的数据有明显减少。这还只是数据量在不到20的情况下。在数据量大的情况下，那么聚合下推效果会更好。

二者查询结果都是：

+------------------+----------+
|avg(salary)       |deptName  |
+------------------+----------+
|1933.3333333333333|sales     |
|5000.0            |accounting|
|2258.3333333333335|research  |
+------------------+----------+